Here we
provide statistical code and data for the paper:
Presson AP , Sobel EM , Papp JC , Suarez CJ , Whistler T, Rajeevan MS, Vernon SD, Horvath S (2008) Integrated weighted gene co-expression network analysis with an application to chronic fatigue syndrome. BMC Systems Biology 2008, 2:95.
Background: Systems biologic approaches such as Weighted Gene Co-expression Network Analysis (WGCNA) can effectively integrate gene expression and trait data to identify pathways and candidate biomarkers. Here we show that the additional inclusion of genetic marker data allows one to characterize network relationships as causal or reactive in a chronic fatigue syndrome (CFS) data set.
Results:
We combine WGCNA with genetic marker data to identify a disease-related pathway and its causal drivers, an analysis which we refer to as "Integrated WGCNA" or IWGCNA. Specifically, we present the following IWGCNA approach: 1) construct a co-expression network, 2) identify trait-related modules within the network, 3) use a trait-related genetic marker to prioritize genes within the module, 4) apply an integrated gene screening strategy to identify candidate genes and 5) carry out causality testing to verify and/or prioritize results. By applying this strategy to a CFS data set consisting of microarray, SNP and clinical trait data, we identify a module of 299 highly correlated genes that is associated with CFS severity. Our integrated gene screening strategy results in 20 candidate genes. We show that our approach yields biologically interesting genes that function in the same pathway and are causal drivers for their parent module. We use a separate data set to replicate findings and use Ingenuity Pathways Analysis software to functionally annotate the candidate gene pathways.
Conclusions:
We show how WGCNA can be combined with genetic marker data to identify disease-related pathways and the causal drivers within them. The systems genetics approach described here can easily be used to generate testable genetic hypotheses in other complex disease studies.
Data, R Software Tutorials, and Analysis Outline (Last Updated: 5/11/10)
The chronic fatigue data was generated by the Centers for Disease Control and generously provided as a challenge data set to the 2006 Critical Assessment of Microarray Data Analysis (CAMDA) conference.
Data: The file CFS.Data.zip (11.2 MB) contains
6 data files: "Clinical_data_CFS.txt",
"CFS_trait_legend.xls", "Expression_data_CFS.txt", "SNP_data_CFS.txt",
"std-analysis-29-candidate-genes-IPA.txt" and "CFS_trait_data_127x47.txt ".
Network & causality functions: The file IWGCNA_2010.zip (258 KB) contains 4 files with R functions required for the IWGCNA:
"NetworkFunctions_Jan2010.txt", "neo.txt", "sma_package.txt", and
"CausalityFunctions.txt". These functions have received minor updates
on 1/26/10 from their original post in 2008 due to valuable user input.
The tutorial for the CFS weighted gene co-expression analysis (IWGCNA) is available in both MS Word CFS_Online_Tutorial_Jan2010.doc and Adobe Acrobat CFS_Online_Tutorial_Jan2010.pdf formats. This tutorial contains all analyses described in our
manuscript, and was updated with minor changes on 1/26/10 (see green bolded comments in the tutorials) to
reflect valuable user input.
Construction of a gene co-expression network
Data pre-processing
Code to remove outlying arrays
Code to remove outlying genes
Remove all arrays/samples relating to the intake classification control group (level 5); results in 127 arrays
Use soft thresholding to determine the power for transforming the correlation matrix into an adjacency matrix
Reduce the 8966 gene set to a more manageable number, ~3000 genes, by discarding genes with low connectivity
Create the adjacency and topological overlap matrices
Use hierarchical clustering to define gene modules
Check that these modules are legitimate using heat maps and multi-dimensional scaling plots
Examining network properties
Create data subsets
Compute the SNP significance measure for each subgroup
Compute the connectivity for each subgroup
Construct correlation bar and scatter plots stratified by module to compare the male and female samples
Gene screening strategy
Examine quantiles of the connectivities and correlations between the gene expressions, severity and SNP data
Screen for genes based on correlation thresholds imposed in both males and homogenized females
The screening strategy results in 20 candidate genes
Second data set results
First check for outliers
Compute the equivalent connectivities and correlations in the second data set
Create a gene co-expression network based on the second data set samples and color the clustered genes by their definitions in the original (127 sample) data set
Now check whether the same candidate genes are selected when a similar screening strategy is applied
Summarizing the results in a table of correlations
Causality analysis using LEO (single.marker.analysis)
Calculate LEO.NB.SingleMarker scores for all genes in the candidate module using all samples with severity scores (87)
Calculate LEO.NB.SingleMarker scores for all genes in the candidate module using male and homogenized females with severity scores (76)
Standard analysis of trait and gene expression data (ignoring the SNP marker)
Calculate p-values and q-values for the correlation between severity and the gene expression data, 346 genes have the smallest q-values
These 346 genes were analyzed using Ingenuity Pathways Analysis (IPA) software (August 2008) and the top network was selected, which consisted of 29 candidate genes
Calculate correlations for the 29 candidate genes selected using IPA
Calculate the correlations between these 29 candidate genes and severity
Calculate their correlations with SNP12
Calculate the ranks of their correlations with the blue module eigengene (out of 8966 genes)
Extract these results for the 20 IWGCNA genes
Compare the 20 IWGCNA results with the 29 standard analysis candidates