Simulation Studies for Weighted and Unweighted Co-expression networks

Tova Fuller, Steve Horvath

Human Genetics and Department of Biostatistics
University of California, Los Angeles

The following simulation studies are based on an example in the following paper:

Bin Zhang and Steve Horvath (2005) "A General Framework for Weighted Gene Co-Expression Network Analysis", Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Article 17.

General Description of the Simulation Model

Here we present the original simulated network model, it has some of the properties observed in real networks. An explanation of this model can also be found in the article Zhang and Horvath (2005). The example exhibits a nearly optimal (simulated) signal for weighted and unweighted networks constructed using the scale free topology criterion. The network was simulated to consist of a brown, a blue and a turquoise module with n1=100, n2=200, and n3=300 genes, respectively. Further, it contained 500 grey (non-module) genes. For the genes of the brown module, we simulated an external gene significance measure GS(i)=vsignal(i) and vsignal(i)=(1-0.3 i/n1)5 where 1 < i < n1.

The similarity matrix between genes of a given module contained a signal and a noise part, see the figure below. Specifically, for i< 0.95*n1 and j < 0.95*n1, we assumed that the similarity was given by S(i,j)=min(vsignal(i),vsignal(j)). The remaining 5 percent of (noise) genes were assumed to have a moderate similarity with the true signal genes. Specifically, for i > 0.95*n1 and j < 0.95*n1 (or for i < 0.95*n1 and j>0.95*n1), we set S(i,j)=vnoise(i)* vnoise(j) where vnoise(i)=(0.85+0.1* i/n)5. Further, we assumed that the noise genes had a high similarity between each other: for i > 0.95*n1 and j > 0.95*n1 the similarity matrix was set to S(i,j)=0.955.

Note that that the structure of the whole network similarity matrix is given by the 1100 times 1100 block-diagonal matrix S = BlockDiag(S1,S2,S3,0). Noise epsilon(i,j) was added to the entries S(i,j) such that the result remained in the unit interval [0,1]. Specifically, epsilon(i,j)=(1-S(i,j)) U(i,j)5 where the U(i,j) followed independent uniform distributions on the the unit interval [0,1].

Goals:

  1. To ascertain the relation between simulated biological signal and intramodular connectivity for different connectivity measures
  2. To determine how the biological signal changes for different values of the adjacency function parameters beta (in soft thresholding) and tau (in hard thresholding)
  3. To study the cases when soft thresholding is superior to hard thresholding
  4. To find the relation between the clustering coefficient, CC, and the connectivity measure, k
  5. To study module retrieval: does the TOM matrix and average linkage hierarchical clustering retrieve the underlying module structure in our simulated example?

Summary of Findings

In our simulation example, there are two types of noise present. The first type of noise is added directly to each module's similarity matrix; the parameter that governs the ratio of signal to noise within each module is prop1. The second type is added as a random uniform function to the correlation matrix; this noise is present in non-module genes, or grey genes, as well as module members and simulates "background" or "sampling" noise.

We first consider the omission of both types of noise from our network construction. Without any noise, power (beta) has no influence on biological signal in our weighted network. Furthermore, our biological signal is perfect for all values of beta in soft thresholding; the spearman correlation between gene significance and intramodular connectivity is equal to 1. In our weighted network, our intramodular TOM-based connectivity measure, omega.in, and the intramodular clustering coefficient, cc.in, perform as well as our standard connectivity measure, k.in. There is also a very clear positive, linear correlation between clustering coefficient and connectivity in our weighted network. Hard thresholding does not yield robust results with respect to the adjacency function parameter, tau. In this simulation alone, every gene (excluding grey genes) of the unweighted network has an intramodular clustering coefficient of 1, signifying each module is fully connected.

We also modify prop1 to omit noisy genes from our modules, while still including the "background" noise added to all genes. In the original simulation example, prop1 = 0.95, signifying 95% of each module's genes are true signal genes, and the last 5% are noisy genes. Increasing prop1 to 1.0 yields more robust results and higher biological signal than seen in our original simulation example. Yet, the inclusion of background noise in this simulation leads to imperfect biological signal for most values of beta. In this example, the scale-free criterion works particularly well in determining the power of beta with the highest biological signal. Furthermore, k.in outperforms omega.in for nearly all values of beta.

Similarly, decreasing prop1 leads to lower biological signal values in both unweighted and weighted networks. With 75% or less signal genes per module, we are unable to detect gene significance using our intramodular connectivity measure (biological signal is less than zero for all values of beta).

We also considered the effect power1, the exponent of the signal, has on biological signal. Note we did not alter the exponent of our background noise, so in effect the ratio of noise to signal is being modified. Biological signal is inversely related to the value of power1, with low parameter values leading to high biological signal and very robust results in our weighted network. Interestingly, with low power1 (power1 = 1), cc.in outperforms both connectivity measures. In addition, biological signal is a monotonically decreasing function of beta for low values of power1 with soft thresholding. With high values of power1, there is poor module identification and our results are not robust for all values of beta. Also, with values of power1 greater than or equal to 10, we are again unable to detect gene significance using k.in.

Module size is directly correlated with biological signal and robustness; increased module size leads to more robust results and stronger biological signal in our weighted network. We note that in simulations with both larger and smaller module sizes, omega.in performs better than our standard connectivity measure k.in for nearly all, if not all, values of beta. Choosing identical module sizes led to very poor results due to how the example was constructed. Specifically, gene significance is permuted in all but the smallest module. With three modules of the same size, however, the wrong module may be identified as biologically meaningful, leading to low biological signal.

In nearly all simulations, soft thresholding produced more robust results than hard thresholding. In addition, clustering coefficient nearly always had a positive, linear relationship with k.in in our weighted networks. The only exceptions to these two rules were simulations where the ratio of noise to signal was unrealistically high (prop1 ≤ 0.5 and power1 ≥ 10), or where module sizes were identical.

Selected Studies: R tutorials

More examples can be found on our wiki-blog here.