A modified multiple-criteria decision-making approach based on a protein-protein interaction network to diagnose latent tuberculosis

Ayalvari, Somayeh; Kaedi, Marjan; Sehhati, Mohammadreza

doi:10.1186/s12911-024-02668-z

Research
Open access
Published: 30 October 2024

A modified multiple-criteria decision-making approach based on a protein-protein interaction network to diagnose latent tuberculosis

Somayeh Ayalvari¹,
Marjan Kaedi¹ &
Mohammadreza Sehhati²

BMC Medical Informatics and Decision Making volume 24, Article number: 319 (2024) Cite this article

726 Accesses
Metrics details

Abstract

Background

DNA microarrays provide informative data for transcriptional profiling and identifying gene expression signatures to help prevent progression of latent tuberculosis infection (LTBI) to active disease. However, constructing a prognostic model for distinguishing LTBI from active tuberculosis (ATB) is very challenging due to the noisy nature of data and lack of a generally stable analysis approach.

Methods

In the present study, we proposed an accurate predictive model with the help of data fusion at the decision level. In this regard, results of filter feature selection and wrapper feature selection techniques were combined with multiple-criteria decision-making (MCDM) methods to select 10 genes from six microarray datasets that can be the most discriminative genes for diagnosing tuberculosis cases. As the main contribution of this study, the final ranking function was constructed by combining protein-protein interaction (PPI) network with an MCDM method (called Decision-making Trial and Evaluation Laboratory or DEMATEL) to improve the feature ranking approach.

Results

By applying data fusion at the decision level on the 10 introduced genes in terms of fusion of classifiers of random forests (RF) and k-nearest neighbors (KNN) regarding Yager’s theory, the proposed algorithm reached a sensitivity of 0.97, specificity of 0.90, and accuracy of 0.95. Finally, with the help of cumulative clustering, the genes involved in the diagnosis of latent and activated tuberculosis have been introduced.

Conclusions

The combination of MCDM methods and PPI networks can significantly improve the diagnosis different states of tuberculosis.

Clinical trial number

Not applicable.

Peer Review reports

Introduction

Tuberculosis is a common infectious disease, with a high mortality rate that commonly caused by Mycobacterium tuberculosis, a species of mycobacteria. Conventional methods for diagnosing active tuberculosis (ATB) include skin tests, blood tests, sputum tests, sputum cultures, and chest radiographs [1]. However, these techniques may fail because tuberculosis infections are latent and asymptomatic most of the time. Genetic factors in a person with latent tuberculosis bacteria may play an essential role in developing ATB. It is an attractive research area in studying the human immune system. Over time, as the person’s immune system weakens, the bacteria may wake up, and the person may develop ATB [1]. If ATB stays in the body for a while or is not adequately treated, it may turn into drug-resistant tuberculosis, a condition where the body does not respond to medication. Thus, early prognosis of latent and ATB is beneficial for improving clinical research plans [1,2,3].

Gene expression that is the process by which information within a gene is used to obtain a functional product, can be measured by different technologies like microarray data and RNA sequencing [4]. In gene expression, data from the activity of thousands of genes are measured and assessed to form an image of cell function. This process determines how cells respond to a disease or treatment [5]. The analysis of gene expression data in Sun et al.‘s study showed that the genes responsible for the activation or hiding of tuberculosis tend to be enriched in different clusters. The genes introduced in this research have the ability to identify tuberculosis disease in different stages [6]. In the study of Deng et al., the gene expression data of 123 patients with tuberculosis were analyzed and 24 genes were identified as the cause of tuberculosis activation [7]. Bah et al.‘s study showed that the genes responsible for tuberculosis disease in adults and children have significant differences [8].

In the study of Tavasoli et al.‘s, a new weighting method is presented to select the best subset of genes. This approach has used the combination of five feature selection methods and different gene ranking methods [9].

The primary purpose of this study is to provide a method for selecting the appropriate feature and classifier for the diagnosis of latent tuberculosis with high reliability. Machine learning algorithms are valuable tools for classifying transcription data. In Multiple Criteria Decision Making (MCDM), the best criteria are selected from among several criteria [10].

The aim of this study is mainly to provide a highly reliable method for diagnosing latent tuberculosis by selecting appropriate feature selection methods and classifiers. For this purpose, we try to identify which feature selection criteria (MIM [11], FDR [12], MIFS [11], correlation coefficient, ANOVA [13,14,15], entropy, CMIM [11], Ave, Gen-score [16], and JMI [11]) and fusing the results of which classifiers can improve the accuracy of identifying latent tuberculosis-distinguishing genes from healthy control and ATB. MCDM models Linear Assignment, Decision-making Trial and Evaluation Laboratory (DEMATEL)) are applied, and the appropriate feature selection criteria and best classifiers are selected. Filter feature selection methods (t-test, fisher’s discriminant ratio (FDR) [12], analysis of variance (ANOVA) [13,14,15], Mutual Information Maximization (MIM) [11], mutual Information feature selection (MIFS) [11], joint mutual information (JMI) [11], conditional mutual info maximization (CMIM) [11], entropy, correlation coefficient, and Ave) were used to rank all genes. Using Sequential Forward Feature Selection (SFFS) [17, 18] (as a wrapper feature selection method), optimal combinations of genes were introduced that are the most distinctive regarding Random Forests (RF) classifier. This ranking method was combined with DEMATEL and MCDM method to select limited genes. Finally, 10 genes are introduced that can differentiate between types of tuberculosis.

The main weakness of previous studies is the low accuracy of previous biomarkers, not discriminative for latent tuberculosis infection (LTBI), non-stable due to small overlap of genes among studies and the lack of integration of helpful information such as the Protein-protein interaction (PPI) network before making a decision that motivates our study.

This study utilizes PPI scores to construct a ranking matrix in DEMATEL analysis. DEMATEL is an analytical method used to assess mutual impacts among factors in decision-making processes. PPI refers to intracellular protein collaborations crucial for biological processes. Utilizing DEMATEL alongside PPI enhances the quality of biological analysis, despite potential time complexity, leading to error reduction and improved predictions.

Section 2 is dedicated to the related work. Section 3 describes the feature selection used in this study and introduces data fusion method, and multiple-attribute decision-making method. Section 4 presents the dataset. In Sect. 5, the proposed method is presented. The results are discussed in Sect. 6. Finally, Sect. 7 concludes the paper with a summary.

Related work

In this section, first, the previous studies related to tuberculosis diagnosis by analyzing the gene expression data are reviewed. Then, the studies related to the diagnosis of other diseases using data fusion are discussed. Afterward, applying MCDM method and different feature selection methods in diagnosis of various diseases are reviewed.

Tuberculosis diagnosis by analyzing the gene expression data

In the study of Sun et al. [6], by creating molecular networks and studying the interaction of proteins, several genes have been introduced. It has been shown that the identified genes and gene pairs can diagnose tuberculosis patients at different stages of the disease. In the study of Deng et al. [7], 24 genes were identified that could predict the tuberculosis activation. According to the in-depth biological analyses of these 24 genes, 24 signature genes were found, which were capable of predicting ATB, and the production of cytokine was a crucial procedure in the course of the activation of tuberculosis. It is essential to study the signature genes, such as TSPO, CYBB, STAT1, and CD36, further. In recent years, the expression of human genes in response to active/latent tuberculosis has been investigated by Wang et al. [19] and Bah et al. [8], who concluded that host genes often do not exhibit much expression changes in the latent tuberculosis infection scenario. In the study of Juan Zhang et al. [20], the focus was on identifying potential biomarkers for diagnosing tuberculosis in blood and their role in mycobacterium tuberculosis-infected macrophages. Initially, Weighted Correlation Network Analysis (WGCNA) of 9451 genes revealed significant changes in tuberculosis patients’ whole blood. Subsequently, 220 interferon-gamma-related genes were identified, with 30 key genes prioritized using cytoscape. The Area Under the Curve (AUC) values for these genes were calculated for better feature selection. Nine genes were identified, among which SAMD9L showed high diagnostic value (AUC = 0.925) and significant discriminative ability (AUC > 0.865) in ROC analysis. In the study of Wu et al. [21], the objective was to identify diagnostic biomarkers for tuberculosis. Gene ontology analysis indicated significant changes primarily in cell-cell adhesion regulation and T cell activation. KEGG analysis showed host response in tuberculosis primarily involves cytokine-receptor interactions and folate biosynthesis. Using protein-protein interaction networks, IRF1 was identified as a biomarker. Validation in datasets showed increased IRF1 levels in tuberculosis patients compared to healthy individuals. ELISA confirmed IRF1 as a significant biomarker with AUC = 0.801, suggesting its potential use as a new marker for pulmonary tuberculosis diagnosis. In the study of Natarajan et al. [22], integrated analysis identified transcriptional profiles and gene expression signatures distinguishing ATB from latent tuberculosis infection. Pathway analysis indicated upregulated genes are associated with signaling pathways such as IFN and interleukin-1 production. Furthermore, seven-gene signature was proposed as biomarkers for distinguishing between ATB and LTBI, demonstrating high diagnostic accuracy in ROC analysis. In the study of Liu et al. [23], the investigation focused on ATB, caused by mycobacterium tuberculosis. This study provided utilizes several machine learning methods as backpropagation neural network, WGCNA, Single Sample Gene Set Enrichment Analysis (ssGSEA). The study combines multiple machine learning approaches and gene data analysis techniques to develop and evaluate predictive and diagnostic models. In the study of Dai et al. [24], six genes (CASP1, TNFSF10, CASP4, CASP5, IFI16, and ATF3) with strong diagnostic performance (AUC > 0.7) were identified for distinguishing ATB from LTBI. They focus on developing diagnostic models based on biomarker expression data. RF, Least Absolute Shrinkage and Selection Operator (LASSO), and logistic regression methods were employed to develop these models. Delgobo et al. [25] examined whether mycobacterium tuberculosis directly controls spinal cord commitment. Results indicated mycobacterium tuberculosis can transform human CD34 + cells into monocytes/macrophages, a transformation occurring in vitro without type I or II IFN signaling. Moreover, mycobacterium tuberculosis increased IL-6 response in these cells, and inhibiting IL-6R reduced spinal cord commitment and mycobacterium tuberculosis growth in vitro. Genetic, proteomic, and genomic data analysis showed the IL-6/IL6R/CEBP gene module is associated with disease severity in tuberculosis patients, recently evolved to include neanderthal introgression and human microbe adaptation. Chen et al. [26] used WGCNA to identify central genes used in distinguishing between ATB infection (ATB) and latent tuberculosis disease. Using differential analysis and WGCNA, central genes capable of distinguishing between ATB and LTBI were identified. The results showed that five central genes (FBXO6, ATF3, GBP1, GBP4 and GBP5) were identified as potential markers for the progression of LTBI to ATB, which ROC analysis showed that these genes had high diagnostic accuracy with values under the ROC curve between 0.8 and are 0.9. In the study of Yu et al. [27], the profile of seven genes was presented using the RF model, which can be used as potential markers for distinguishing ATB from LTBI in children (AUC = 0.888). This study, summarizes the development and evaluation of machine learning models such as Support Vector Machine (SVM), RF, and Generalized Linear Models (GLM) using specific gene expression profiles. The aim was to identify cluster-specific genes with high diagnostic potential for tuberculosis. The RF model demonstrated significantly lower residual errors compared to other models. In the study of Chen et al. [28], a prediction model was developed using machine-learning classifiers (RF, GLM, SVM, and XGB), with SVM showing the highest AUC in predicting tuberculosis subtypes among pediatric patients. In the study of Wang et al. [29], blood levels of autophagy-related genes (ARGs) were analyzed to differentiate between ATB and latent tuberculosis infection. Three genes (FOXO1, CCL2, ITGA3) positively correlated with adaptive immune lymphocytes and negatively with myeloid and inflammatory cells. A nomogram using these genes accurately distinguished ATB from LTBI patients in subsequent dataset validations.

Diagnosis of various diseases using data fusion

In the study of Meng et al. [30], a comprehensive review of data fusion methods based on machine learning is presented. In this study, several requirements have been proposed that are used as criteria to evaluate the performance of existing fusion methods based on machine learning. A new classification based on DST and a convolutional neural network (CNN) is proposed to classify valuable collections. A new neural network classification based on CNN and DS in-depth approach for classifying valuable sets is presented. The study by Olivan et al. [31] provides a comprehensive review of recent developments in data integration and machine learning for industrial forecasting, emphasizing on identifying research trends, appropriate opportunities, and undiscovered challenges. In the study of Ali et al. [32], an intelligent model for predicting heart disease has been proposed using deep learning and data fusion. This model fuse data extracted from sensors and medical records to achieve a reliable diagnosis. In a study by Hu et al. [33], a model based on machine learning for the diagnosis of COVID-19 is presented, which uses a fusion of medical information to diagnose the symptoms of the disease reliably. In the study of Simjanoska et al. [34], a multi-level information fusion approach is proposed to learn a blood pressure predictor model using electrocardiogram sensor data. In a study by Razavifar et al. [35], two methods based on the k-nearest neighbor model have been proposed to find the missing values. The first method uses the local search to find the best value of k, and the second method uses the best k-nearest neighbors (KNN) to find the missing values. The proposed model uses Dempster-Shafer’s Theory (DST) method for the final estimation. In the study of Nachappa et al. [36], the performance of several multi-criteria decision analysis (MCDA) models, machine learning and several fused modeling methods have been investigated. Also, in that study, the Dempster-Shafer method was analyzed. In the study of Razi et al. [37], a decision-level data fusion method is proposed to fuse the results of the classifiers using the DST. Wang et al. [38], designed a distributed intrusion detection system using DST to combine evidence from distributed sensors. In the study of Saeed et al. [39], an automated disease diagnosis system using partial least squares regression for feature selection from a set of deeply extracted features is presented. In the study of Arshad et al. [40], an integrated framework for HGR using the deep neural network and the fuzzy entropy-controlled Skewness approach is presented. In the study of Jee and Namin [41], a model is designed using fuzzy criteria to select the best evidence. This study aims to demonstrate the effect of DST and fuzzy reasoning on improving the accuracy of web spam classification. In the study of Tang et al. [42], a framework based on the fused classification of RF and D-S Evidence Theory for detecting single faults is presented. In the study of Wang et al. [43], a two-step framework, namely, hierarchical fusion hierarchical and heterogeneous fusion, to fuse the results of different classifiers was introduced.

Diagnosis of various diseases using MCDM

Kim et al. [44] and Hashemi et al. [45] tried to model a feature selection process as a multiple-criteria decision making procedure. Such a technique has employed the TOPSIS (i.e., Technique of Order Preference by Similarity to Ideal Solution) approach for the evaluation of the features on the basis of their association with a number of labels as various criteria. In addition, the Technique of Order Preference by Similarity to the Ideal Solution is meant to include the alternatives featuring the nearest distances to the ideal positive solution while maximizing the distances to the negative one. He et al. [46] introduced an integrated MCDM in order to combine a variety of classifications in the MCDM framework so as to evaluate the comparative weights of various classifiers. Farhadi et al. [47] identified and prioritized the factors contributing to service quality from the views of those healthcare providers employed in the educational hospitals affiliated with the University of Medical Sciences of Shiraz. As an instance, Hsieh et al. [48] presented a case study pertaining to a beverage & food information system.

Diagnosis of various diseases using different feature selection methods

In the study of Maghsoudloo et al. [12], Feature selection methods such as FDR have been used to diagnose asthma and to introduce biomarkers and genes that cause the disease. In the study of Shrivastava et al. [49], with the help of feature selection method FDR and classifier SVM succeeded in designing a new system for diagnosis and classification of skin diseases psoriasis In the study of Pascal Ezenkwu et al. [50], used a variety of feature selection methods such as SFFS and MI to select features related to the entire dataset. In the study of Xu et al. [51], various feature selection methods have been compared to analyze the decoding of brain states from FMRI data. Two methods of feature selection of Kendall matching coefficient and ANOVA have been used. Yadegaridehkordi et al. [52], used the DEMATEL method in order to discover the interdependencies between the factors as well as their significance in big data. In study of Saghapour et al. [53], a variety of new feature ranking methods have been proposed to predict cancer conditions from protein data. This method includes ten feature selection techniques that were combined with the Topsis method to identify the most distinct proteins to detect different types of cancer.

Materials and methods

Feature selection, data fusion, and MCDM

In the proposed method, the concepts of Feature selection, data fusion, and MCDM will be used. For this reason, in the rest of this section, these concepts will be introduced.

Feature selection

Feature selection methods are divided into three categories: filter-based methods, wrapper techniques, and hybrid approaches. Filter-based methods are independent of learning algorithms and use statistical data features to select features so that one credit is calculated for each feature. Features are sorted by their rankings, and features with the best ranks are removed. Filter-based methods are fast because they do not use learning algorithms and are suitable for high-dimensional data. Another method of feature selection is the wrapper technique. Subset evaluation provides a subset of candidate features based on a specific search strategy. This technique exploits a machine learning algorithm for effective feature selection. Therefore, this method has high accuracy, but because of using a machine learning algorithm, its computational complexity is also high. Hybrid methods consist of two steps. In the first step, filter-based methods are used to reduce data dimensions, and in the second stage, wrapper methods are applied to select the best subset of features [35, 54, 55].

Data fusion

Data fusion is a technique that fuses data collected from different sources. The purpose of this method is to create a predictive model of a system based on data obtained from several sources and classifiers. This study employs data fusion across three levels: data level, feature level, and decision level. At the data level, features (genes) from various microarray datasets are integrated. At the feature level, the IDE feature selection method is applied to identify significant genes based on IDE scores. At the decision level, Dempster-Shafer and Yager methods are utilized to optimize classifier combinations. The levels of data fusion and the strengths and weaknesses of each are presented in Table 1. More details about these three data fusion levels are provided in the rest of this Sects. [35, 41, 56,57,58,59].

Table 1 Strengths and weaknesses of different levels of data fusion method [56, 59]

Full size table

Data level

In this approach, no analysis or processing is performed on the data. In this method, the data from different sources are directly fused [35, 56].

Feature level

At this level, practical information features are extracted from various sources. Features extracted from different sources are fused to form a group of features that will represent the state of a system [35, 56].

Improved distance evaluation (IDE)

IDE is a concept used in data fusion, particularly in the context of multi-sensor data fusion or sensor data integration. IDE is a method used to enhance the accuracy of data fusion by considering the distances between the measurements obtained from different sensors or sources. In data fusion, multiple sensors or sources are used to collect data about a particular phenomenon or object. These sensors might have different characteristics, noise levels, or biases. IDE aims to address the challenges of fusing data from diverse sources by evaluating the distances between data points in the combined feature space. The key idea behind IDE is to assess the consistency and reliability of the measurements from different sources. By considering the distances between measurements, the method can identify outliers, inconsistencies, or unreliable data points. This helps in making more informed decisions during the data fusion process and improves the overall quality and accuracy of the fused data. IDE can be particularly useful in scenarios where the sources have different levels of accuracy, noise, or uncertainty. By incorporating distance evaluation, the fusion process can mitigate the impact of unreliable or inconsistent measurements, leading to more robust and accurate results.

IDE method is one of the feature selection methods used in data fusion at feature level. In this method, the best features should have the following condition: their values are as close to each other as possible for one class, and as far apart as possible for two different classes. In this method, using a kind of averaging of the importance of the features and the distance between their placement centers on the page, a score is assigned to each feature. The higher the score of a feature, the higher the values of that feature for two different classes, and the values of that feature are close together for a class. Then, by setting a threshold, features that score above the threshold are selected as the best features. The main idea of this method is explained below:

Suppose that $\:\text{c}=\text{1,2},\dots\:,\text{C}$ is number of classes, $\:m$$\:=\text{1,2},\dots\:,M$ is the number of samples in each class, and $\:\text{j}$$\:=\text{1,2},\dots\:,\text{J}$ is the number of features extracted. We consider $\:{q}_{m}$, $\:c$, $\:j$ to be the value of the $\:j$ feature of the $\:m$ sample of class $\:c$. The mean distance between the values of a feature that are extracted from different samples of a class is obtained from Eq. (1) [35, 54]:

$$\:{d}_{c,j}=\frac{1}{M\times\:\left(M-1\right)}\times\:{\sum\:}_{l,m}^{M}\left|{q}_{m,c,j}-{q}_{l,c,j}\right|;\text{m}\ne\:1$$

(1)

The maximum changes of $\:{d}_{c,j}$ are obtained using the following Eq. (2) [35, 57]:

$$\:{V}_{j}^{\left(w\right)}=\left|\frac{{max}\left({d}_{c,j}\right)}{{min}\left({d}_{c,j}\right)}\right|$$

(2)

The larger the $\:{V}_{j}^{\left(w\right)}$ index for a feature, the greater the distance between the values of that feature in different classes. So the larger the index, the higher the feature score. To calculate the mean distances of feature between different classes, first the average values of a feature for all samples in a class are obtained as follows [35, 57]:

$${u_{c,j}} = {1 \over M}\, \times \,\sum \, _{m = 1}^M{q_{m,c,j}}$$

(3)

Now the distance between the mean values of a feature in different classes is calculated as follows [35, 57]

$$\:{d}_{j}^{\left(b\right)}=\frac{1}{\text{C}\times\:(C-1)}\:\times\:{\sum\:}_{c,e=1}^{C}{|u}_{c,j}-{u}_{e,j}{|}$$

(4)

The higher this index is for one feature, the greater the distance of values of that feature are for different classes, and the more appropriate it will be to differentiate between those classes. So far we have worked on the average distance of the classes from each other, but we have not paid attention to the changes. The more variation (variance), the less reliable our system is. That is, the features are close to each other in one class and far from each other in different classes, but with a lot of changes and uncertainty. Therefore, we define the compensation factor or compensation index. The greater the variation ($\:V$), the larger the denominator, the smaller the reward index ($\:\lambda\:$) and the lower the feature score.

The maximum changes in the $\:{d}_{j}^{\left(b\right)}$ index are calculated as follows [35, 57]:

$$\:{V}_{j}^{\left(b\right)}=\:\frac{max\left({|u}_{c,j-}{u}_{e,j}\right|)\:}{min\left(\right|{u}_{c,j-}{u}_{e,j}\left|\right)}$$

(5)

The greater the distance between the values of a feature within a class, the lower its score. The greater the distance between the values of a feature between different classes, the higher its score. So the score of each feature is proportional to [35, 57]:

$$\:{\alpha\:}_{j}\approx\:\frac{{d}_{j}^{\left(b\right)}\:\:}{{d}_{j}^{\left(w\right)}}$$

(6)

Any feature that has less variance in the values of $\:{d}_{j}^{\left(b\right)}$ and $\:{d}_{j}^{\left(w\right)}$ indices will have better quality, so it should get more points. Accordingly, using the variances of these values, the reward index is defined for each feature [35, 57]:

$$\:{{\uplambda\:}}_{j}=\:\frac{1}{\:\frac{{V}_{j}^{\left(w\right)}}{max\left({V}_{j}^{\left(w\right)}\right)}+\:\frac{{V}_{j}^{\left(b\right)}}{max\left({V}_{j}^{\left(b\right)}\right)}}$$

(7)

Using the reward factor, the score of each feature in raw form is equal to [35, 57]:

$$\:{\alpha\:}_{j}={\lambda\:}_{j}\:\times\:\frac{{d}_{j}^{\left(b\right)}\:\:}{{d}_{j}^{\left(w\right)}}\:\:$$

(8)

Finally, using the following formula, the score of each feature is normalized so that the best features can be selected by applying the desired threshold [35, 57]:

$$\overline {\alpha j} = {{{\alpha _j}} \over {\max \left( {{\alpha _j}} \right)}}$$

(9)

Decision level

This is the highest level in data fusion. There is the highest accuracy and most minor error in data fusion at this level. The largest volume of computation is at this level [35, 57].

DST and Yager’s theory

As a generalization of the Bayesian technique, DST allows a typical uncertainty level and paves the way for explicitly accounting for the undetermined potential cause of the observational data [41]. As a well-known evidence theory, it defines the basic probability assignments (BPAs) function in order to present the evidence combination [58]. In addition, belief uncertainty intervals are employed by DST on the basis of the evidence obtained via a number of observations in order to introduce the assumed belief [57]. As an efficient technique for assessing uncertainty and modeling imprecision, the DST is capable of providing more flexibility for the purpose of specifying uncertainty in probabilistic models and testing the hypotheses. Theoretically, two essential functions are necessary for the purpose of displaying information, known as Bel and plausibility function (PLS). PLS and Bel derive the upper bound for an unknown probability function and the lower bound value for a known probability function, respectively. One can decide the uncertainty of the knowledge with regard to the objective proposition by determining the differentiation between PLS and Bel. DST is used to combine data at the decision level. Three essential functions are used in DST [58, 59]:

The basic function of probability mass (m).
Belief functions (Bel).
Plausibility function (PLS).

The basic function of probability mass, called the mass function, is the most critical part of evidence theory and is known by the symbols m and Basic Probability Assignment (BPA). The PLS and Bel functions are the upper and lower limits of the occurrence of a subject, respectively, which are defined based on the basic function of probability mass [41, 57]. In evidence theory, situations where the conflict between classifiers (evidence) is severe, may lead to an entirely erroneous estimate. Yager introduced an efficient method in which the possibility of conflict between the evidence is adequately considered.

$$\:\text{B}\text{e}\text{l}\left(\text{A}\right)={\sum\:}_{B|B\subseteq\:A}^{}\text{m}\left(\text{B}\right)\:\:\:\:\text{P}\text{l}\text{s}\left(\text{A}\right)=\:{\sum\:}_{B|B\cap\:A\ne\:\varphi\:}^{}\text{m}\left(\text{B}\right)$$

(10)

$${\text{Bel}}\left({\text{A}}\right)\:\le\:P\left(A\right)\le\:PlS\left(A\right)\:{\xrightarrow{if\:PlS\left(A\right)=Bel\left(A\right)}}\:bel\left(A\right)=P\left(A\right)=PlS\left(A\right)$$

(11)

The real probability of the occurrence of an event such as $\:A$, denoted by $\:P\left(A\right)$, is a value between the Belief functions and Plausibility function values of that event. Figure 1 shows the schematic of DST combinations.

Multiple-criteria decision-making (MCDM)

MCDM is the method of selection when dealing with several different criteria. For example, when we use several classifiers or different feature selection methods to diagnose a disease, after examining and integrating them, we can announce our final diagnosis. Using the principles of MCDM can analyze each criterion separately and create the result. In MCDM, instead of measuring the optimality with one criterion, several different criteria are used so that there is no loss or regret about other criteria. This method provides the possibility of selecting the best solution and making the best decision among multiple alternatives that sometimes even conflict with each other. This decision-making method provides the possibility of selecting between one goal or several goals for a decision maker or several decision makers. It is the process that leads to an answer from among the solutions selected to solve the problem [60,61,62].

Linear assignment

This method is one of the simplest available in multiple-attribute decision-making. Criteria are generally denoted by$\:\:{C}_{j}$, $\:j=\{1,\dots\:,n\}$. Alternatives are usually denoted by $\:{A}_{i}$, $\:i=\{1,\dots\:,m\}$. $\:{X}_{ij}$’s are points or scores that are assigned to the performance of alternatives relative to criteria [60].

If the criterion is incremental, it becomes normal using Eq. (12):

$$\:{r}_{ij}=\frac{{x}_{ij}}{{\sum\:}_{\text{i}=1}^{\text{m}}{x}_{ij}}\:\:\:\:\:\:\:or\:\:\:\:{r}_{ij}=\frac{{x}_{ij}}{\text{M}\text{a}\text{x}\:\left\{{x}_{ij}\right\}}$$

(12)

If the criterion is decremental, it becomes normal using Eq. (13):

$$\:{r}_{ij}=1-{x}_{ij}\:\:\:\:\:or\:\:\:{r}_{ij}=\frac{\text{M}\text{i}\text{n}\:\left\{{x}_{ij}\right\}}{{x}_{ij}}$$

(13)

The next step is to form the weighted matrix. In this step, according to the weights calculated from other methods (Shannon entropy and other methods), Linear Assignment obtains the weighted matrix. Finally, the best alternative is selected, and the score of each alternative is calculated by summing the rows of the weight matrix and based on that, alternatives are ranked.

DEMATEL

In order to analyze the influence relationships between a system’s factors, The DEMATEL can serve as an effective technique. By analyzing the whole influence relations among the factors via the DEMATEL, one can reach an ideal solution for solving complicated problems of the system and a better perception of the structural relations. DEMATEL is one of the methods in MCDM and a way of structuring the problem based on the opinion of experts or classifiers, which is used here in terms of different classifiers. With the help of a communication table, the vector of superiority and the communication vector of features (genes) are calculated. In this method, the effectiveness and dependence of each feature are obtained. The intensity of the effect of feature $\:i$ on feature $\:j$ is denoted by one of the numbers 4, 3, 2, 1, 0. R indicates the degree of effectiveness and C indicates the dependency degree of each feature.

R + C (Superiority vector): The higher the index value, the more the feature interacts with the other features and the more important that feature is.

R-C (Communication vector): represents the net effect of the feature in the system. If the value of R-C is greater than zero, we have an effective feature, and if it is less than zero, the corresponding feature (gene) is dependent [60,61,62].

As will be mentioned later, in the current study, the scores of the table presented in Appendix E have been used to create the communication table.

Sequential Forward feature selection (SFFS)

SFFS is a feature selection technique used in machine learning and statistics. It’s a method for choosing a subset of relevant features from a larger set of features to improve the performance of a predictive model. In SFFS, the process starts with an empty set of selected features. It iteratively adds one feature at a time, selecting the feature that provides the best improvement in model performance, until a predefined stopping criterion is met. At each step, the algorithm evaluates the performance of the model using cross-validation or some other evaluation metric, and then selects the next feature to add based on its impact on performance. SFFS is a forward selection technique because it starts with no features and incrementally builds up the set of selected features. This method can be effective in reducing the dimensionality of the feature space and improving model accuracy, as it aims to include only the most relevant features while excluding irrelevant or redundant ones. Through the iterative elimination of the worst features or aggregation of the best features, the algorithms of sequential feature selection seek to find an efficient subset of features [17, 18]. By beginning the search from a null/random subset $\:{X}_{0}$, SFFS carries out a process iteratively in order to select the most significant feature (MFS) out of the remaining dataset (at iteration $\:k$: $\:{Y}_{k}$ = $\:U$ - $\:{X}_{k}$), which will be added to $\:{X}_{k}$ ($\:{X}_{k}$ = $\:{X}_{k}$ U $\:MFS$). Subsequently, SFFS continues the repeated process of finding and deleting the least significant features (LFS) from the new subset. Following each phase of iteration, a comparison is made between the obtained results and the results obtained during the preceding step ($\:{X}_{k}$). If we have an improved outcome, then $\:{X}_{k}+1$=$\:{X}_{k}$–$\:\:LFS$. The same process is iterated until a particular criterion is reached. The MFS and LFS are chosen by applying an evaluation criterion and a wrapper algorithm.

PPI

According to recent studies, any functional defect in one of the pathway proteins may indicate biological disorders such as tuberculosis. PPI is a protein interaction identification method and a popular application of scientific visualization techniques. PPIs are extracted from the HIPPIE database [63], which is a very comprehensive repository that integrates information from several well-established databases (such as BioGRID [64] and HPRD [65]). PPI provides an estimated confidence score and is very significant in terms of investigating the functionality of proteins [66,67,68].

Dataset

The datasets used in this study are microarray data collected from the national center for biotechnology information (NCBI) [69,70,71,72]. Each row in datasets represents a data sample and each column represents a gene. The data samples used in our study are from the three classes of healthy control, ATB, and latent tuberculosis. It should be noted that the original dataset included other classes, but we do not consider them in this study. In the present work, each gene is considered a feature. The details of datasets and the number of their samples are listed in Table A1 in Appendix A. Datasets were reported by their gene expression omnibus series code, gene series expression (GSE).

The proposed approach

The aim of this study is to identify key genes that distinguish between individuals with LTBI, ATB, and healthy individuals. To achieve this goal, we employ data fusion techniques at both the feature and decision levels, enhancing the reliability of our findings by integrating diverse sources of evidence and information.

Our approach integrates multiple microarray datasets (detailed in Sect. 4) to leverage a combination of Multi-Criteria Decision Making (MCDM) methods and PPI networks for tuberculosis diagnosis. Specifically, we utilize the IDE method for feature-level data fusion and apply Dempster-Shafer and Yager’s theory at the decision level.

Here is a breakdown of our proposed method (illustrated in Figs. 2 and 3):

1)
Feature Selection: We apply filter-based feature selection methods (as shown in Appendix B) to the genes across five microarray datasets. Through 10-fold cross-validation, we identify the top 500 genes consistently selected by all methods from the training datasets. These genes are then tested on dataset GSE 19444 [71].
2)
Classifier Evaluation: Using Random Forest (RF), Naïve Bayes (NB), k-Nearest Neighbors (KNN), and Support Vector Machine (SVM), we evaluate the performance of these 500 genes. Results from this step guide the selection of optimal feature selection methods, detailed in Appendix C.
3)
Ranking and Selection: The Linear Assignment method is applied to rank classifiers based on the outcomes of Step 2, determining the most effective classifiers.
4)
Feature Fusion: The IDE method is used to fuse features selected in Step 1, identifying the most discriminative genes.
5)
Optimal Gene Combination: SFFS is employed on genes identified in Step 4 to refine gene selection and determine the optimal combination of genes and classifiers to enhance tuberculosis classification across the microarray datasets.
6)
Protein Interaction Analysis: Utilizing the STRING database [73], we construct a PPI network to assess gene interactions. The network is used to generate a communication table for weight matrix construction via the DEMATEL method, ranking the top 236 discriminative genes identified in Step 5.
7)
Decision Fusion: Decision-level data fusion is conducted to optimize classifier combinations for classifying microarray data into LTBI, ATB, and healthy states. This fusion approach reveals that integrating RF and KNN classifiers on features selected by criteria like correlation coefficient, MIM [11], MIFS [11], and entropy achieves the highest accuracy. RF and KNN are applied on 26 introduced genes in Step 6.
8)
Accumulative clustering and hierarchical clustering: Accumulative clustering and hierarchical clustering are used to introduce pairs of genes responsible for hiding and activating tuberculosis. Common genes between the current study and the study of Sun, et al. [6] and Bah, et al. [8] were examined to introduce the distinguishing genes between the latent and active states of TB on the GSE 19444 [71], and top 10 discriminative genes were identified.
9)
Decision Fusion: Finally, RF and KNN are applied on 10 genes introduced in Step 8.

The proposed approach aims to address challenges associated with noisy data in Microarray datasets and the lack of a stable analytical framework.

Feature selection plays a pivotal role by identifying and prioritizing relevant features using various models and algorithms. This process offers several benefits as follows:

Dimensionality reduction:

Selecting important features reduces data dimensionality, enhancing algorithm efficiency.

Noise reduction:

Eliminating unnecessary features improves model accuracy and mitigates noise effects in Microarray datasets.

Enhanced generalization:

Significant features improve algorithm robustness and generalization capabilities.

Ambiguity reduction:

Selecting appropriate features minimizes data ambiguities, thereby improving distance estimation accuracy.

In data fusion, effective feature selection addresses challenges posed by noisy data through:

Removing noisy features:

Identifying and removing features associated with data noise improves decision-making and distance evaluation.

Improving pattern detection:

Focusing on significant features enhances pattern recognition amidst noise.

Enhancing algorithm robustness:

Eliminating noise in Microarray datasets and irrelevant features improves algorithm performance with new data.

The SFFS method is highlighted for its role in feature selection in machine learning, offering benefits such as noise reduction and improved model performance. However, it should be integrated with other methods for comprehensive noise mitigation and enhanced model accuracy.

Therefore, in the following, PPI and DEMATEL have been used. Additionally, the PPI network approach is recommended for analyzing microarray data due to its ability to study protein networks holistically, detect patterns, predict interactions, and reduce noise in Microarray datasets. This study introduces a novel approach by utilizing PPI degrees to construct a ranking matrix for DEMATEL analysis. DEMATEL is an analytical method for assessing mutual influences among factors in decision-making processes. The PPI refers to collaborations between proteins within cells, crucial for biological processes. Integrating PPI with DEMATEL offers several advantages: identifying and evaluating protein networks, understanding mutual effects between proteins, predicting physiological changes, and managing complexities in biological systems. This combined approach enhances decision-making in biology, medicine, and drug development by providing deeper insights into protein interactions and their implications.

The average degree of connectivity in a protein-protein interaction (PPI) network is defined as the average number of connections (edges) to each node (protein). If $\:\text{V}$ is the number of nodes (proteins) and $\:\text{E}$ is the number of edges (connections), the average degree $\:\langle K\rangle$ is calculated as follows:

$$\langle K\rangle \, = \,{{2E} \over V}$$

(14)

Here, $\:2\text{E}$ represents the total number of edges in the network because each edge connects two nodes. Dividing $\:2\text{E}$ by $\:\text{V}$ gives the average number of connections per node. This equation indicates the average number of connections each protein has, which is crucial in analyzing complex networks like PPI networks. Here are the simplified equations for using PPI data in constructing the weight matrix for DEMATEL:

Weight matrix for DEMATEL

To construct the weight matrix in DEMATEL using PPI data, we define the weights $\:\text{w}\text{i}\text{j}$ as follows:

$$\:{{W}_{ij}=PPI}_{ij}$$

(15)

Here, $\:{PPI}_{ij}$ represents the positive or negative impact of factor $\:\text{i}$ on factor $\:\text{j}$, derived from protein-protein interactions. If $\:{PPI}_{ij}$ is a positive value, $\:{W}_{ij}$ will be a positive value as well, and if $\:{PPI}_{ij}$is a negative value, $\:{W}_{ij}$ will be a negative value too.

Degree of dependence matrix in DEMATEL

To calculate the degree of dependence matrix $\:\text{D}\text{E}\text{M}$ in DEMATEL, we use the weight matrix $\:\text{w}$ derived from Eq. (16):

$$\:{DEM}_{ij}\:=\frac{{W}_{ij}\:+{W}_{ji}\:}{2}$$

(16)

Here, $\:{W}_{ij}$ and $\:\text{w}\text{j}\text{i}$ are weights obtained from PPI data, representing the positive and negative impacts between factors $\:\text{i}$ and $\:\text{j}$.

These equations illustrate how PPI data can be utilized to construct the weight matrix and calculate the degree of dependence matrix in DEMATEL methodology.

This approach results in selecting the top 500 genes from each dataset. Omitting each step in the presented method leads to incorrect results. For example, omitting filter-based feature selection methods and preprocessing before implementing information fusion and IDE feature selection result in a large number of features (genes) entering the IDE method and complicates the analysis of the IDE feature selection method. Omitting the wrapper method in SFFS feature selection prevents us from determining which gene combinations lead to better results. Additionally, omitting feature selection methods and directly using PPI leads to increased noise and overfitting, making the analysis of results difficult. If the PPI scores are not used in DEMATEL, the relationships between genes and their impact on each other may not be properly identified. The ablation study and a more detailed examination of the impact of each step of the proposed method are presented in Sect. 6 (Table 10).

Experiments

Evaluation measures

In this study, sensibility, specificity and accuracy have been used to evaluate the proposed method.

A confusion matrix, as an evaluation tool for predictive model performance, particularly in classification problems, is used. In the current study, which aims to identify distinguishing genes of LTBI, this matrix aids in assessing the accuracy of these genes. The matrix consists of four main components:

True Positive (TP): Number of samples correctly classified as LTBI.

True Negative (TN): Number of samples correctly classified as non-LTBI.

False Positive (FP): Number of samples incorrectly classified as LTBI (instead of non-LTBI).

False Negative (FN): Number of samples incorrectly classified as non-LTBI (instead of LTBI).

Based on these four values, various metrics can be calculated to evaluate the accuracy of your model. The most important of these metrics include:

Accuracy: The ratio of correct predictions$\:(TP+TN)$, to the total number of samples$\:(TP+TN+FN+FP)$.

Accuracy is calculated using Eq. 17.

$$\bullet\:Accuracy=\frac{TP+TN}{TP+TN+FN+FP}$$

(17)

This metric indicates how accurate your model’s classifications are across all classes (latent and non-latent).

Furthermore, from the confusion matrix, other metrics such as Sensitivity and Specificity have also been calculated, which are particularly useful in problems with imbalanced data.

$\:\text{S}\text{e}\text{n}\text{s}\text{i}\text{b}\text{i}\text{l}\text{i}\text{t}\text{y}$ is calculated using Eq. (18).

$$\bullet Sensibility=\frac{TP}{TP+FN}$$

(18)

Specificity is calculated using Eq. (19).

$$\bullet Specificity=\frac{TN}{TN+FN}$$

(19)

Results and discussion

In the rest of this section, the results related to the several steps of the proposed method are presented and discussed.

Data fusion at the feature level, and use of the IDE method

Improved Distance Evaluation (IDE) is a method used for feature selection to assess and select important features in various problems such as detecting latent tuberculosis genes. In this approach, the importance of each feature is computed based on the variations and differences present in the data. This approach leads to the selection of features that have a higher capability to differentiate between different data categories, thus enhancing the accuracy and precision of data modeling and analysis. The inputs at this stage include 500 top genes from each dataset, upon which the IDE feature selection method has been applied at the information fusion stage. The best features (genes), as determined by the feature selection method specified in this stage, are identified. The study examines how IDE (Information Density Estimation) can serve as a valuable computational tool for selecting top genes in microarray data analysis. Graphical representation helps researchers make informed decisions about which genes to prioritize for further analysis and investigation. The figures related to IDE are presented in Appendix D. The statistical report of the results, which contains the average and standard deviation (SD) of the importance of features (genes), is shown in Table 2.

Table 2 IDE statistical report

Full size table

Creating ranking matrices using linear assignment and identifying the best classifiers

The Linear Assignment method can rank the classifiers and find the best classifier. In this study, the accuracy value obtained from applying five classifiers on genes has been used to select the best classifiers with the help of Linear Assignment. This value is used as the main questionnaire in Linear Assignment method that should be normalized, the rank matrix is created according to the definitions and equations presented in Sect. 2.3.1. In this study, we have five classifiers and ten feature selection criteria; the goal here is to find the best combination of classifiers and feature selection criteria. The rank matrix for five datasets is shown in Fig. 4. In this study, the correlation coefficient, MIM [11], MIFS [11] are increasing criterion, and the entropy is decreasing criterion.

According to Fig. 4, the best classifiers are RF, KNN, NB and SVM, respectively. The rows in these matrixes correspond to NB, KNN, SVM, and RF respectively. The numbers in this matrix indicate the significance of each classifier in a peer-to-peer way. For example, the presence of 1 in the (4, 1) entry of matrix (the first column and the fourth row) for all of the datasets, indicates the superiority of the RF classifier.

Hyperparameters in classifiers such as NB, KNN, SVM, and RF are determined before the training process, directly influencing the performance and accuracy of the algorithms. The hyperparameters setting in the applied classifiers are as follows:

RF: Tree depth = 9, Number of trees = 100, Criterion = Entropy.
NB: NB does not have traditional hyperparameters to tune.
SVM: Kernel Function = RBF, data normalization using mean and variance of features.
KNN: Number of neighbors = 5, Distance = minKowski, data normalization using mean and variance of features.

Data fusion at the decision level

In this section, the data fusion at the decision level for four classifiers NB, SVM, KNN and RF is evaluated. For each of the mentioned classifiers, the confusion matrix has been used. Four classifiers have been applied to 500 best-shared genes in Step 1. The results have been combined using Yager’s theory to improve the classification accuracy and achieve more reliable results. Fusing RF and KNN classifiers can help us achieve the goal.

Using wrapper feature selection method, SFFS

By applying SFFS on the selected genes of the genes selected in Step 4, the best features regarding the fusion of RF and KNN classifiers for each data set were selected. The results are shown in Table 3. After applying 10 feature selection methods on each dataset and selecting the top 500 genes based on classifier performance, the selected top genes are fed into the IDE algorithm to identify the best genes according to this algorithm. The top genes chosen based on the IDE algorithm are further evaluated using the SFFS method to determine the best genes according to this approach. The SFFS method determines the suitable features (genes) to be used to achieve the desired results. Figures related to SFFS are presented in Appendix D.

Table 3 SFFS statistical report

Full size table

The statistical report of the results, which contains the average and standard deviation (SD) of the importance of features (genes), is shown in Table 3.

PPI

PPI network has been used to prepare a communication table between features (genes) presented in Table 4. String database was used to investigate protein interactions (string-db.org) [73] and the PPI network presented in Table 4 has been used to prepare a communication table between genes. The results obtained are shown in Fig. 5. The network nodes represent proteins that are the products of the selected genes. According to Fig. 5 there are isolated nodes in the PPI network, which should be removed and not be ranked by the DEMATEL method. After removing the individual nodes, the correlation table was designed for PPIs. This table is presented in Appendix E.

Table 4 Superiority vector (R + C) and communication vector (R-C) for features of datasets

Full size table

Using DEMATEL method for weighing features

Here, the PPI network is used to obtain the weight matrix of the DEMATEL method. The superior genes selected by feature selection methods (shown in Fig. 5) are re-ranked by DEMATEL method.

The communication table shown in Appendix E has been used in DEMATEL method to re-rank the genes in Table 4. Therefore, in this study, a new feature ranking approach is introduced in which the PPI network is used to obtain the weight matrix of the DEMATEL method. The results are given in Table 4. Finally, 26 genes are introduced that can be the most discriminative.

In Table 4, genes with negative R-C values have the lowest importance, while genes with the highest R + C values indicate greater significance in diagnosing different types of tuberculosis. More information of the gene interactions in Table 5 is provided in Appendix E.

Table 4 shows that the CYBB, AIM2, CDC42, MAPK14, TLR5, IL15, NOD2, IL7R, HIF1A, IFITM3, GBP2, TNFSF13B, CD274, GBP4, EPSTI1, IL2, MMP9, JAK2, GBP5, CASP1, STAT3, NFKB1, GBP1, IFIH1, IL1B and STAT1genes are more important and can be used as biomarkers to identify types of tuberculosis.

Comparison of the present study with several related studies

Figure 6 shows that the results obtained from the present study share about 70% with previous studies [6,7,8]. Finally, the classifiers RF and KNN were applied on 26 introduced genes.

In this study, a combination of PPI and DEMATEL methods has been utilized, where the weights obtained from PPI were used to construct the DEMATEL ranking matrix. While this approach may increase time complexity, using appropriate feature selection methods alongside can resolve this issue, which has not been addressed in previous studies [6,7,8, 19].

Previous studies [6, 19] did not employ feature selection methods and directly used PPI. The direct utilization of PPI leads to computational complexity and increases the likelihood of errors when feature selection methods are not employed alongside. Data fusion at the feature and decision levels has led to the identification of top genes and suitable classifier combinations that were not utilized in prior research [6,7,8, 19]. Ultimately, employing these methods has enabled the discovery of latent and ATB genes. Previous studies have not utilized PPI. Considering that PPI can deeply investigate gene interactions and uncover relationships between them, its usage seems essential, despite its absence in prior studies [7, 8].

The time complexity of feature selection methods, PPI, and DEMATEL are analyzed in the following:

Feature selection methods

Depending on the algorithm used, these methods can have different time complexities, typically. Simple methods like filters generally have linear time complexity, where n denotes the number of features or samples. More complex methods like wrapper methods, which use machine learning models to evaluate features, may have higher time complexities, like, where is the number of features, is the number of samples, and is the number of model iterations. The process of SFFS involves iteratively adding and removing features in a dataset. Adding features requires training the model and evaluating its performance for each feature. The number of steps depends on the total number of features and dataset size, typically requiring steps per feature added. After adding features, unnecessary ones are removed similarly. Overall, SFFS’s time complexity is (O(n²)) due to the training and evaluation needed at each stage.

PPI

Time complexity in analyzing PPI networks depends on the size of the network and the analytical complexity. For instance, using graph-based algorithms for analyzing protein networks may result in a time complexity of where represents the number of nodes (proteins) and represents the number of edges (interactions) in the graph.

DEMATEL

DEMATEL, used to analyze relationships between variables, has a cubic time complexity (O(n³)), where n is the number of variables.

In Table 5, the results of these classifiers on the 26 introduced genes, in terms of fusion of classifiers of RF and KNN in Yager’s theory, have been compared with the results of former studies [7]. The fusion of RF and KNN classifiers has achieved the highest accuracy. The results are shown in Table 5. The weight of the classifiers is regarded as w = 0.98 for RF, w = 0.95 for KNN, w = 0.92 for NB, and w = 0.90 for SVM. In Yager’s theory, w = 0.90 means that a 0.10 probability of error or lack of awareness is considered for the classifier.

Table 5 The sensibility, specificity, and accuracy of data fusion using Yager’s theory

Full size table

In this study, RF-KNN fusion, applied to the output of best feature selection criteria (i.e., correlation coefficient, MIM, MIFS and entropy) achieved the highest accuracy in classifying LTBI expression data (accuracy of 0.92 on the test data(GSE 19444 [71]). The results are shown in Table 6.

Table 6 The sensibility, specificity, and accuracy of applying classifiers on 26 introduced genes compared to the results of former studies

Full size table

The sensibility, specificity, and accuracy of applying confusion classifiers on 26 introduced genes are shown in Fig. 7 (highest accuracy value of the fusion of RF and KNN classifiers on the test data (GSE 19444 [71]).

Common genes of the current study, along with the research of Sun et al. [6] and Bah, et al. [8] introduced in the Venn diagram of Fig. 6, were examined to introduce the distinguishing genes between the latent and active states of TB on the GSE 19444 [71] (Data set test). The result is presented in Table 7.

Table 7 Distinguishing genes between the latent and active states of TB

Full size table

Hierarchical clustering and accumulative clustering were used for further investigation and to obtain more assurance of the introduced genes that differentiate LTBI and ATB states.

The accumulative clustering indicates whether the genes located in one cluster and one data set are located similarly in the same cluster in another data set or not. This procedure is applied to the GSE37250 [72] and the GSE39939 [72] data sets.

Accumulative clustering for genes in Table 8 is shown in Fig. 8.

The colored significant relations shown in Fig. 8 are shown and analyzed separately in Fig. 9. In Fig. 8, the labeled gene numbers on the right side corresponds to the GSE37250 [72] dataset, and the left one corresponds to the GSE39939 [72].

Genes CD36, PSMA4, TNFSF13B, DUSP2, STAT1, TSPO, GBP4, GBP5, SAMD9L, and DAB1 correspond to labeled gene numbers 1 to 10 in Fig. 9. The accumulative clustering analyses in Fig. 9 demonstrated that our gene pairs preferred to cluster within the topological and functional modules. Common gene pairs extracted from Table 8 shows the strongest tendency in this regard. Genes 1, 2, 3, 4, 5, 6, 8, 10, which are related to the latent state of tuberculosis, were located in identical and close clusters. This activation tendency is also true for genes 7 and 9, which are the genes that activate tuberculosis. The ATB-specific genes showed higher expression values and fold changes than the LTBI-specific genes. Furthermore, the ATB-related pairs generally displayed higher expression correlations and were more activated when compared to their LTBI-related counterparts.

According to the results of Fig. 9, it was found that the pairs of genes GBP5-TSPO, STAT1-TNFSF13B, DUSP2-PSMG2, and DAB1-CD36, are the latent factors of tuberculosis, and the pair of genes SAMD9L-GBP4 are the factors of activating tuberculosis.

After applying cumulative clustering on the GSE19491 dataset [69] and GSE19444 dataset [71], we found that MATR3-NR2C2 gene pair is the cause of TB latency, and SAMD9L-GBP4 gene pair is the cause of TB activation.

The sensibility, specificity, and accuracy of applying classifiers on 10 introduced genes compared to the results of former studies [7] presented in Table 8.

Table 8 The sensibility, specificity, and accuracy of applying classifiers on 10 introduced genes compared to the results of former studies

Full size table

In order to compare the results presented in this study and previous studies in the field of LTBI diagnosis, the AUC value has been approximated using the Eq. (21).

$${\rm{AUC}} \approx {{{\rm{Accuracy}} - 0.5} \over {0.5}}$$

(21)

The most relevant studies [20,21,22,23,24, 26,27,28,29] in the field of detection of TB differential genes, which were reviewed in the related work section, have been compared with the present study in Table 9 in term of AUC. The introduced genes in Table 9 are the combination of introduced genes in Step 6 and Step 8 of the proposed method. In the Step 6 of the proposed method, 26 discriminative genes are introduced and in the Step 8 of the proposed method, 10 discriminative genes are introduced. Common genes between this study and previous studies are shown in bold.

Table 9 Comparison of the present study with previous studies

Full size table

To compare the current study with studies listed in Table 2 [20,21,22,23,24, 26,27,28,29], the following points can be highlighted:

• AUC:

In the present study, the AUC obtained is higher compared to several studies [20, 21, 24, 26, 29]. This indicates an improvement in the performance of the proposed model over previous methods.

• Number of Microarray Datasets:

In this study, the number of microarray datasets examined is greater than those used in all studies listed in Table 10 [21, 28]. This demonstrates a more comprehensive exploration of datasets, potentially providing broader applicability for our model.

• Number of Genes Introduced:

Furthermore, the number of genes introduced in the current study is fewer than those introduced in several studies listed in Table 10 [20,21,22,23,24, 26,27,28,29], indicating a higher discriminatory ability of the introduced genes.

Studies that solely rely on bioinformatic methods such as protein-protein interaction networks (PPIN) without using machine learning [20,21,22, 26, 29] may face several challenges and weaknesses such as low prediction accuracy and inability to understand biological complexities.

To investigate the contribution of each component of our proposed method on the overall performance, the ablation study is conducted. The ablation study results are presented in Table 10.

Table 10 The ablation study results in terms of sensibility, specificity, and accuracy (applying RF-KNN classifiers on GSE 19444)

Full size table

Conclusion

The goal of this study was to minimize the number of features of tuberculosis data. The second goal is to classify tuberculosis data using a subset of genetic features obtained by the first goal. In this study, data fusion methods, MCDM, feature selection and PPI network were utilized to identify LTBI distinguishing genes from Healthy control and ATB. Filter feature selection methods and SFFS methods have been used to create more overlap of genes among studies and increase the accuracy in introducing biomarkers with greater distinguishing power. The PPI network is used to obtain the weight matrix of the MCDM method.

In the data fusion at the feature level, IDE feature selection method is applied on the top 500 genes in terms of the best classifier introduced in each feature selection criteria. Also, in the data fusion at the decision level, it is determined which classifiers should be fused to achieve better results. In this paper, a new approach based on Dempster-Shafer and Yager’s theory is proposed to fuse the effects of classifiers. Therefore, the proposed method introduces a suitable set of feature selection criteria and a suitable set of classifiers to achieve a reliable diagnosis of latent tuberculosis. According to the results, fusion of correlation coefficient, MIM [11], MIFS [11] and entropy features selection method, and the fusion of RF and KNN classifiers can be used to identify latent tuberculosis genes.

Finally, the 26 genes were selected, and some of these genes were shared with the results of previous studies [6,7,8]. The results of our study were able to identify more LTBI genes with higher accuracy, analyzing more datasets and providing a more limited set of genes differentiating LTBI compared to the results of studies conducted by Wang et al. [19] and Bah et al. [7]. The main weaknesses of other approaches [7, 8] is the low accuracy of previous biomarkers, lack of stability due to small overlap of genes among studies, and the lack of integration helpful information such as the PPI network. The introduced genes can be used in many applications such as disease risk prediction systems.

At the end of the current research, with the help of hierarchical clustering and accumulative clustering, the introduced genes were reanalyzed to differentiate between latent and ATB states more reliably. Additionally, several pairs of genes responsible for the activation and hiding of tuberculosis were introduced.

In the future, the proposed procedure may be extended and applied to other datasets and used some new classification and feature selection methods to diagnose tuberculosis and other diseases. In the current study, the traditional machine learning methods were applied. In future work, deep learning techniques can be applied instead of traditional machine learning methods to improve the results [74].

Data availability

The datasets used in this study were collected from the NCBI website (https://www.ncbi.nlm.nih.gov/). Part of the data, which is related to the results of the article, has been uploaded as Supplementary material and related file, and is included in the Appendix E of the paper.

References

Larry Jameson J, Fauci AS, Kasper DL, Hauser SL, Longo DL, Loscalzo J. Harrison’s principles of Internal Medicine. Twentieth ed.: The McGraw-Hill Companies; 1950. pp. 216–1488.
Google Scholar
Meraj SS, Yaakob R, Azreen Azman. Artificial intelligence in diagnosing tuberculosis: a review. Int J Adv Sci Eng Inform Technol. 2019;9. https://doiorg.publicaciones.saludcastillayleon.es/10.18517/ijaseit.9.1.7567.
Mithra KS, Sam Emmanuel WR. GFNN: gaussian-fuzzy-neural network for diagnosis of tuberculosis using sputum smear microscopic images. J King Saud Univ Comput Inf Sci. 2018;1319–1578. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.jksuci.2018.08.004
Alessandra Tessitore G, Cicciarelli FD, Vecchio A, Gaggiano D, Verzella M, Fischietti D, Vecchiotti D, Capece F, Zazzeroni, Edoardo Alesse. MicroRNAs in the DNA damage/repair network and cancer. Int J Genomics. 2014;12:32–42. https://doiorg.publicaciones.saludcastillayleon.es/10.1155/2014/820248.
Article CAS Google Scholar
Hala Alshamlan G, Badr, Yousef Alohali. mRMR-ABC: A hybrid gene selection algorithm for cancer classification using microarray gene expression profiling. Biomed Res Int. 2015;2015:604910. https://doiorg.publicaciones.saludcastillayleon.es/10.1155/2015/604910
Sun J, Shi Q, Chen X, Liu R. Decoding the similarities and specific differences between latent and active tuberculosis infections based on consistently differential expression networks. Brief Bioinform. 2020;21(6):2084–98. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/bib/bbz127.
Deng M, Lv X-D, Fang Z-X, Xie X-S, Wen-Yu Chen. The blood transcriptional signature for active and latent. Infect Drug Resist. 2019;12:321–8. https://doiorg.publicaciones.saludcastillayleon.es/10.2147/IDR.S184640.
Article CAS PubMed PubMed Central Google Scholar
Bah SY, Forster T, Dickinson P, Kampmann B, Ghazal P. Meta-analysis identification of highly robust and differential immune-metabolic signatures of systemic host response to acute and latent tuberculosis in children and adults. Front Genet. 2018;9. https://doiorg.publicaciones.saludcastillayleon.es/10.3389/fgene.2018.00457
Niloofar Tavasoli K, Rezaee M, Momenzadeh, Mohammadreza Sehhati. An ensemble soft weighted gene selection-based approach and cancer classification using modified metaheuristic learning. J Comput Des Eng. 2021;8(4):1172–89. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/jcde/qwab039.
Article Google Scholar
Mohammad, Atai. Multi-criteria decision-making. third ed. Shahroud University of Technology; 2017.
Wang X, Guo B, Shen Y, Zhou C, Xuliang Duan. Input Feature Selection Method Based on Feature Set Equivalence and Mutual Information Gain Maximization. 2019;7. https://doiorg.publicaciones.saludcastillayleon.es/10.1109/ACCESS.2019.2948095
Maghsoudloo M, Jamalkandi SA, Najafi A, Masoudi-Nejad A. An efficient hybrid feature selection method to identify potential biomarkers in common chronic lung inflammatory diseases. Genomics. 2020;112(5):3284–93. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.ygeno.2020.06.010.
Liangwei Yang H, Gao K, Wu H, Zhang C, Li, Lixia Tang. Identification of cancerlectins by using cascade linear discriminant analysis and optimal g-gap tripeptide composition. Bentham Sci Publishers. 2020;15(6):528–37. https://doiorg.publicaciones.saludcastillayleon.es/10.2174/1574893614666190730103156.
Article CAS Google Scholar
Li H-F, Wang X-F, Tang H. Predicting bacteriophage enzymes and hydrolases by using combined features. Front Bioeng Biotechnol. 2015;8:183.
Article Google Scholar
Ding H, Li D. Identification of mitochondrial proteins of malaria parasite using analysis of variance. Amino Acids. 2014;47(2):329–33.
Article PubMed Google Scholar
Tabatabaei A, Derhami V, Sheikhpour R, Pajoohan M-R. Diagnosis of breast Cancer subtypes using the selection of effective genes from microarray data. Iran Q J Breast Disease. 2019;12(1):39–47.
Google Scholar
Somol P, Novovicova J, Pudil JP. Flexible hybrid sequential floating search in statistical feature selection. In: Lecture Notes in Computer Science. Vol. 41. Berlin: Springer-Verlag; 2006. p. 632-639.
Shirbani F, Soltanian Zadeh H. Fast SFFS-Based algorithm for feature selection in Biomedical Datasets. Amirkabir Int J Sci Res (Electr Electron Eng). 2013;45(2):43-56.
Google Scholar
Zhang Wang S, Arat M, Magid-Slav JR, Brown. Meta-analysis of human gene expression in response to Mycobacterium tuberculosis infection reveals potential therapeutic targets. BMC Syst Biol. 2018;12:3. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12918-017-0524-z.
Article CAS PubMed Google Scholar
Zhang Xiang-juan, Xu Hai-shan, Li Chong-hui, Fu Yu-rong. Zheng-Jun Yi. (2021) Up-regulated SAMD9L modulated by TLR2 and HIF-1α as a promising biomarker in tuberculosis. J Cell Mol Med. 2021. https://doiorg.publicaciones.saludcastillayleon.es/10.1111/jcmm.17307
Liwei Wu1 | Qiliang Cheng. IRF1 as a potential biomarker in Mycobacterium tuberculosis infection. J Cell Mol Med. 2021. https://doiorg.publicaciones.saludcastillayleon.es/10.1111/jcmm.16756.
Sudhakar Natarajan M, Ranganathan LE, Hanna, Srikanth Tripathy. Transcriptional profiling and deriving a seven-gene signature that discriminates active and latent tuberculosis: An integrative bioinformatics approach. 2022. https://doiorg.publicaciones.saludcastillayleon.es/10.3390/genes13040616
Yuchen Liu L, Zhang F, Wu Y, Liu Y, Li Y, Chen. Identification and validation of a pyroptosis-related signature in identifying ATB via a deep learning algorithm. Front Cell Infect Microbiol. 2023. https://doiorg.publicaciones.saludcastillayleon.es/10.3389/fcimb.2023.1273140.
Article PubMed PubMed Central Google Scholar
Dai X, Zhou L, He X, Hua J, Chen L, Yingying Lu. Identification of apoptosis-related gene signatures as potential biomarkers for differentiating active from latent tuberculosis via bioinformatics analysis. Front Cell Infect Microbiol. 2024. https://doiorg.publicaciones.saludcastillayleon.es/10.3389/fcimb.2024.1285493.
Article PubMed PubMed Central Google Scholar
Delgobo M, Mendes DA, Kozlova E, Rocha EL, Rodrigues-Luiz GF, Mascarin L, Dias G, Patrício DO, Dierckx T, Bicca MA, Bretton G. An evolutionary recent IFN/IL-6/CEBP axis is linked to monocyte expansion and tuberculosis severity in humans. eLife. 2019;8:e47013. https://doiorg.publicaciones.saludcastillayleon.es/10.7554/eLife.47013.
Liang Chen J, Hua, Xiaopu He. Coexpression network analysis-based identification of critical genes differentiating between latent and ATB. Dis Markers. 2022. https://doiorg.publicaciones.saludcastillayleon.es/10.1155/2022/2090560
Yang Yu J, Hua, Liang Chen. Autophagy-related molecular clusters identifed as indicators for distinguishing active and latent TB infection in pediatric patients. BMC Pediatr. 2024. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12887-024-04881-1.
Article PubMed Google Scholar
Liang Chen J, Hua, Xiaopu He. Identifcation of cuproptosis-related molecular subtypes as a biomarker for diferentiating active from latent tuberculosis in children. BMC. 2023. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12864-023-09491-2.
Article Google Scholar
Chengbin Wang J, Hua X, He L, Chen S, Lv. A diagnostic model for distinguishing between ATB and latent tuberculosis infection based on the blood expression profiles of autophagy-related genes. Ther Adv Respir Dis. 2023. https://doiorg.publicaciones.saludcastillayleon.es/10.1177/17534666231217798.
Meng T, Jing X, Yan Z, Pedrycz W. A survey on machine learning for data fusion. Inf Fusion. 2020;57:115–29. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.inffus.12.001.
Ser AD-OJD, Galar D, Basilio Sierra. Data Fusion and Machine Learning for Industrial Prognosis: Trends and Perspectives towards Industry 4.0. Inf Fusion. 2018. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.inffus.2018.10.005
Ali F, El-Sappagh S, Islam SR, Kwak D, Ali A, Imran M, Kwak KS. A smart healthcare monitoring system for heart disease prediction based on ensemble deep learning and feature fusion. Inf Fusion. 2020;63:208–22. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.inffus.2020.06.008.
Hu F, Huang M, Sun J, Zhang X, Liu J. An analysis model of diagnosis and treatment for COVID-19 pandemic based on medical information fusion. Inf Fusion. 2021;73:11–21. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.inffus.2021.02.016.
Simjanoska M, Kochev S, Tanevski J, Bogdanova AM, Papa G, Eftimov T. Multi-level information fusion for learning a blood pressure predictive model using sensor data. Inf Fusion. 2020;58:24–39. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.inffus.2019.12.008.
Cheng RR-FB, Saif M, Majid Ahmadi. Similarity-learning information-fusion schemes for missing data imputation. Knowl Based Syst. 2019. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.knosys.2019.06.013.
Article Google Scholar
Nachappa TG, Piralilou ST, Gholamnia K, Ghorbanzadeh O, Rahmati O, Blaschke T. Flood susceptibility mapping with machine learning, multi-criteria decision analysis and ensemble using Dempster Shafer Theory. J Hydrol. 2020;590:125275. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.jhydrol.2020.125275.
Sara Razi MRK, Mollaei, Jamal Ghasemi. A novel method for classification of BCI multi-class motor imagery task based on Dempster–Shafer theory. Inf Sci. 2019;484:14–26. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.ins.2019.01.053.
Article Google Scholar
Wang Y, Yang H, Wang X, Zhang R. Distributed intrusion detection system based on data fusion method. In: Proceedings of the Fifth World Congress on Intelligent Control and Automation. 2004. p. 4331-4334.
Saeed F, Khan MA, Sharif M, Mittal M, Goyal LM, Roy S. Deep neural network features fusion and selection based on PLS regression with an application for crops diseases classification. Appl Soft Comput. 2021;103:107164. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.asoc.2021.107164.
Habiba Arshad MA, Khan MI, Sharif M, Yasmin RS, Tavares Y-D, Zhang. Suresh Chandra Satapathy. A multilevel Paradigm for Deep Convolutional Neural Network Features Selection with an Application to Human Gait Recognition. Expert Syst. 2020. https://doiorg.publicaciones.saludcastillayleon.es/10.1111/exsy.12541
Chatter jee and Siami Namin. A fuzzy Dempster–Shafer classifier for detecting web spams. J Inform Secur Appl. 2021;59. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.jisa.2021.102793.
Xianghong Tang XG, LeiRao JL. A single fault detection method of gearbox based on random forest hybrid classifier and improved Dempster-Shafer information fusion. Comput Electr Eng. 2021;92:107101. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.compeleceng.2021.107101.
Article Google Scholar
Wang L, Mo T, Wang X, Chen W, He Q, Li X, Zhang S, Yang R, Wu J, Gu X, Wei J. A hierarchical fusion framework to integrate homogeneous and heterogeneous classifiers for medical decision-making. Knowl Based Syst. 2021;212:106517. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.knosys.2020.106517.
Kim C, Lee H, Seol H, Changyong Lee. Identifying core technologies based on technological cross-impacts: an association rule mining (ARM) and analytic network process (ANP) approach. Expert Syst Appl. 2011;38(12):12559-12564. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.eswa.2011.04.042
Hashemi A, Dowlatshahi MB, Nezamabadi-Pour H. MFS-MCDM: Multi-label feature selection using multi-criteria decision-making. Knowl Based Syst. 2020;206:106365. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.knosys.2020.106365.
He Q, Li X, Kim DN, Jia X, Gu X, Zhen X, Zhou L. Feasibility study of a multi-criteria decision-making based hierarchical model for multi-modality feature and multi-classifier fusion: applications in medical prognosis prediction. Inform Fusion. 2020;55:207–19. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.inffus.2019.09.001.
Payam Farhadi M, Niyas N, Shokrpour, Ramin Ravangard. Prioritizing Factors Affecting Health Service Quality using Integrated fuzzy DEMATEL and ANP: a case of Iran. Open Public Health J. 2020;13:263–72. https://doiorg.publicaciones.saludcastillayleon.es/10.2174/1874944502013010263.
Article Google Scholar
Hsieh Y-F, Lee Y-C, Lin S-B. Rebuilding DEMATEL threshold value: an example of a food and beverage information system. SpringerPlus. 2016;5:1385. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s40064-016-3083-7
Vimal KS, Rajendra NDL, Jasjit SS S.Suri. A novel approach to multiclass psoriasis machine Disease risk stratification: learning paradigm. Biomed Signal Process Control. 2016;28:27–40. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.bspc.2016.04.001.
Article Google Scholar
Chinedu PascalEzenkwu U, IdioAkpan, Bliss Utibe-AbasiStephen. A class-specific metaheuristic technique for explainable relevant feature selection. Mach Learn Appl. 2021;6:100142. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.mlwa.2021.100142
Xu W, Li Q, Liu X, Zhen Z, Wu X. Comparison of feature selection methods based on discrimination and reliability for fMRI decoding analysis. J Neurosci Methods. 2020;335:108567. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.jneumeth.2019.108567.
Elaheh Yadegaridehkordi M, Hourmand M, Nilashi, LiyanaShuib A, Ahani, Othman Ibrahim. Influence of big data adoption on manufacturing companies’ performance: an integrated DEMATEL-ANFIS approach. Technol Forecast Soc Change. 2018;137. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.techfore.2018.07.043.
Ehsan Saghapour S, Kermani, Mohammadreza Sehhati. A novel feature ranking method for prediction of cancer stages using proteomics data. PLoS ONE. 2017. https://doiorg.publicaciones.saludcastillayleon.es/10.1371/journal.pone.0184203.
Article PubMed PubMed Central Google Scholar
Jeffrey DU. Mining of massive datasets. Camb Univ Press. 2011;112–226. https://doiorg.publicaciones.saludcastillayleon.es/10.1017/CBO9781139924801.
Richard O, Duda PE, Hart DG, Stork. Pattern classification, 2nd ed., 2003.
Ala’a El-Nabawy N, El-Bendary NA, Belal. A feature-fusion framework of clinical, genomics, and histopathological data for METABRIC breast cancer subtype classification. Appl Soft Comput J. 2020. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.asoc.2020.106238.
Article Google Scholar
Majid Khazaee AS, Nobari. Application of Improved Distance Evaluation Technique in Feature Selection of Vibration for Steel Beam. In: Proceedings of the 3rd International Conference on Acoustic and Vibration (ISAV2013). 2013.
Chen TM, Venkataramanan V. Dempster–Shafer theory for intrusion detection in ad hoc networks. In: Proceedings of the IEEE Internet Computing; November 2005. p. 35-41.
Qifeng, Zhou et al. The structural damage detection based support on posteriori probability vector machine and Dempster–Shafer evidence theory. Appl Soft Comput. 2015. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.asoc.2015.06.057
Yu-Jie Wang. Interval-valued fuzzy multi-criteria decision-making based on simple additive weighting and relative preference relation. Inf Sci. 2019;503:319-335. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.ins.2019.07.012.
Article Google Scholar
Adel Azar F, Khosravani. Soft Operations Research (Problem Structuring Approaches), Industrial Management Institute, secon ed., 2009.
Du Y-W, Wen Zhou. New improved DEMATEL method based on both subjective experience and objective data. Eng Appl Artif Intell. 2019;83:57-71. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.engappai.2019.05.001
Alanis-Lobato G, Andrade-Navarro MA, Schaefer MH. HIPPIE v2.0: enhancing meaningfulness and reliability of protein–protein interaction networks. Nucleic Acids Res. 2017. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/nar/gkw985.
Article PubMed Google Scholar
Chatr-Aryamontri A, Oughtred R, Boucher L, et al. The BioGRID interaction database: 2017 update. Nucleic Acids Res. 2017. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/nar/gky1079.
Article PubMed Google Scholar
Keshava Prasad TS, Goel R, Kandasamy K et al. Human protein reference database–2009 update. Nucleic Acids Res 2009;37:D767–72.
Lun H, Yang S, Luo X, Yuan H, Sedraoui K, MengChu Zhou. IEEE/CAA J Automatica Sinica. A Distributed Framework for Large-scale Protein-protein Interaction Data Analysis and Prediction Using MapReduce. IEEE/CAA J Automatica Sinica. 2022. https://doiorg.publicaciones.saludcastillayleon.es/10.1109/JAS.2021.1004198
Xiaojuan Wang W, Yang Y, Yang Y, He J, Zhang L, Wang, Lun, Hu. PPISB: a Novel Network-based Algorithm of Predicting protein-protein interactions with mixed membership Stochastic Blockmodel. IEEE/ACM Trans Comput Biol Bioinform. 2023. https://doiorg.publicaciones.saludcastillayleon.es/10.1109/TCBB.2022.3196336
Lun Hu, Keith CC, Chan. Discovering variable-length patterns in protein sequences for protein-protein Interaction Prediction. IEEE Trans Nanobiosci. 2015. https://doiorg.publicaciones.saludcastillayleon.es/10.1109/TNB.2015.2429672.
Article Google Scholar
Matthew PR, Berry CM, Graham FW, McNab Z, Xu, Susannah AA, Bloch T, Oni KA, Wilkinson R, Banchereau J, Skinner RJ, Wilkinson C, Quinn D, Blankenship R, Dhawan JJ, Cush A, Mejias O, Ramilo, Onn M, Kon. An interferon-inducible neutrophil-driven blood transcriptional signature in human tuberculosis. Nature. 2010;466:973–7. https://doiorg.publicaciones.saludcastillayleon.es/10.1542/peds.2011-2107LLLL. Virginia Pascual, Jacques Banchereau, Damien Chaussabel, Anne O’Garra.
Article Google Scholar
Kalum Clayton ME, Polak CH, Woelk, Paul Elkington. Gene expression signatures in Tuberculosis have Greater Overlap with Autoimmune diseases than with infectious diseases. Am J Respir Crit Care Med. 2017;196(5):655–6. https://doiorg.publicaciones.saludcastillayleon.es/10.1164/rccm.201706-1248LE.
Article PubMed PubMed Central Google Scholar
Chuan Wang S, Yang G, Tang SX, Lu S, Neyrolles O, Qian Gao. Comparative miRNA expression profiles in individuals with latent and active. PLoS One e. 2011;25832:6–10. https://doiorg.publicaciones.saludcastillayleon.es/10.1371/journal.pone.0025832.
Article CAS Google Scholar
Suzanne T, Anderson M, Kaforou AJ, Brent VJ, Wright CM, Banwell G, Chagaluka, Amelia C, Crampin, Hazel M, Dockrell N, French MS, Hamilton ML, Hibberd F, Kern PR, Langford L, Ling R, Mlotha, Tom HM, Ottenhoff S, Pienaar V, Pillay J, Anthony G, Scott H, Twahir RJ, Wilkinson, Lachlan J, Coin RS, Heyderman M, Levin, Brian Eley. Diagnosis of childhood tuberculosis and host RNA expression in Africa. N Engl J Med. 2014;370:1712-1723. https://doiorg.publicaciones.saludcastillayleon.es/10.1056/NEJMoa1303657
STRING Consortium. 2022. Available: https://string-db.org/
Yue Yang X, Su B, Zhao GD, Li P, Hu J, Zhang, Lun Hu. Fuzzy-based deep attributed graph clustering. IEEE Trans Fuzzy Syst. 2023. https://doiorg.publicaciones.saludcastillayleon.es/10.1109/tfuzz.2023.3338565.
Article Google Scholar

Download references

Acknowledgements

Not applicable.

Funding

This research received no external funding.

Author information

Authors and Affiliations

Faculty of Computer Engineering, University of Isfahan, Isfahan, Iran
Somayeh Ayalvari & Marjan Kaedi
Department of Biomedical Engineering, School of Advanced Medical Technology, Isfahan University of Medical Sciences, Isfahan, Iran
Mohammadreza Sehhati

Authors

Somayeh Ayalvari
View author publications
You can also search for this author inPubMed Google Scholar
Marjan Kaedi
View author publications
You can also search for this author inPubMed Google Scholar
Mohammadreza Sehhati
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

All authors contributed to the study. S.A. performed data analysis, developed the method, prepared the tables and figures, analyzed the results, and wrote the paper. M.K. supervised the research, developed the method, analyzed the results, write, reviewed and edited the paper. M.S. prepared the dataset, reviewed and edited the paper, designed the tables and figures. S.A., M.K. and M.S. read and approved the final manuscript.

Corresponding author

Correspondence to Marjan Kaedi.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable as the work is carried out on publicly available dataset.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ayalvari, S., Kaedi, M. & Sehhati, M. A modified multiple-criteria decision-making approach based on a protein-protein interaction network to diagnose latent tuberculosis. BMC Med Inform Decis Mak 24, 319 (2024). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12911-024-02668-z

Download citation

Received: 28 April 2024
Accepted: 05 September 2024
Published: 30 October 2024
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12911-024-02668-z

A modified multiple-criteria decision-making approach based on a protein-protein interaction network to diagnose latent tuberculosis

Abstract

Background

Methods

Results

Conclusions

Clinical trial number

Introduction

Related work

Tuberculosis diagnosis by analyzing the gene expression data

Diagnosis of various diseases using data fusion

Diagnosis of various diseases using MCDM

Diagnosis of various diseases using different feature selection methods

Materials and methods

Feature selection, data fusion, and MCDM

Feature selection

Data fusion

Data level

Feature level

Improved distance evaluation (IDE)

Decision level

DST and Yager’s theory

Multiple-criteria decision-making (MCDM)

Linear assignment

DEMATEL

Sequential Forward feature selection (SFFS)

PPI

Dataset

The proposed approach

Dimensionality reduction:

Noise reduction:

Enhanced generalization:

Ambiguity reduction:

Removing noisy features:

Improving pattern detection:

Enhancing algorithm robustness:

Experiments

Evaluation measures

Results and discussion

Data fusion at the feature level, and use of the IDE method

Creating ranking matrices using linear assignment and identifying the best classifiers

Data fusion at the decision level

Using wrapper feature selection method, SFFS

PPI

Using DEMATEL method for weighing features

Comparison of the present study with several related studies

Feature selection methods

PPI

DEMATEL

Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s note

Electronic supplementary material

Supplementary Material 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Medical Informatics and Decision Making

Contact us