##*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#* #===================================================================================== # # Included file: version # #===================================================================================== #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#* # Version: 1.1.1.0 # Created: Fri May 30 15:34:59 PDT 2008 ##*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#* #===================================================================================== # # Included file: ../CommonFunctions/NetworkFunctions-V2.txt # #===================================================================================== #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#* # revision: 05/28/2008: # # . restrConnectivity[MS] now uses a sample of genes to compute the # connectivity if number of genes is large. # revision: 05/16/2008: # # . restrConnectivity[MS] now works when there's one gene in the last # batch # # revision: 05/15/2008: # # . batchSize*nGenes in connectivity calculations is checked against an upper bound # .largestBlockSize and batchSize is adjusted if necessary # revision: 05/10/2008: # # . saving and restoring random seed corrected. # revision: 04/30/2008: # # . PickSoftThreshold does not accept "method" anymore; use corFnc and corOptions instead # . fixed a bug in restrConnMS in which sofr powers were used incorrectly # # # # revision: 04/21/2008: # . removed duplicate stderr1. # . add biweight correlation. # . blockwiseModules gets parameters corFnc = "cor" and corOptions = "use = 'p'" # that allow one to specify correlation method and options to be used. # revision: 03/31/2008 # . Steve's changes: # I have updated the function **ImFeelingLazy** and **ImFeelingLazyGS**. # Now these functions output p-values for module membership. # Further, I have created a new function called **HeatmapNetwork**. # And I have made the function **TOMplot1** more robust. # # revision: 03/22/2008 # . Fixed bug in blockwiseModules which caused a crash when all genes # were assigned at the end of a block. # revision: 03/14/2008 # . Steve revised PlotClusterTreeSamples and ModuleEnrichment1 # . added print.flush as a wrapper for printFlush. # . removed printing number of calculated module in ModulePrinComps1 # # # revision: 03/13/2008 # . Minor changes, bugfixes and cleanup in blockwiseModules, blockwiseConsensusModules # this version should hopefully run alright both with old and new moduleColor versions. It # also does some error trapping in blockwiseModules and blockwiseConsensusModules. # revision: 03/06/2008 # Changes: # . blockwiseModules significantly more robust and take a new option # trapErrors # . MergeCloseModules: the default of StandardColors is now # standardColors() (from moduleColor) # revision: 02/29/2008 # Changes: # . blockwiseModules now also returns unmergedColors, mosule colors before merging. Good for # diagnostic purposes. # revision: 02/18/2008 # Changes: # . more robust hclustplotn (by Steve Horvath) # revision: 02/15/2008 # Changes: # . more robust KME # . bug-fixed (minKME was not correctly applies) and more robust blockwiseModules, now also work # with small numbers of genes/samples (see ..minNGenes, ..minNSamples) # . introduced ..minNGenes and ..minNSamples constants to hold the minimum # number of genes and samples for which analysis will be performed. # V2, revision: 01/30/2008: # Changes: This is a relatively major update that may break some old code that uses cutreeDynamic: # . removed cutreeDynamic as it conflicts with the library dynamicTreeCut. Use cutreeDynamic in # package dynamicTreeCut available from # http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/BranchCutting/. The usage is # almost the same but please take a look at the documentation before making the switch. # . Added allGeneModules that is capable of calculating modules and eigengenes from many more # genes than standard methods. This function uses the new cutreeDynamic from package # dynamicTreeCut. # . Some functions now take an optional parameter verbose controling the verbosity of their # output and the parameter indent controling the indentation. These require the dynamicTreeCut # package as well. # Version: 11/16/2007 # Change: \! changed to \n in modulecolor2 to get rid of a warning message. # No functional changes. # Version: 09/21/2007 # Change: ModulePrinComps1: code checks the structure of value returned by # impute.knn to make it compatible both with the CRAN version 1.0-5 and # Bioconductor version 1.8. # This document contains functions that we find useful for gene co-expression network analysis # We recommend to read the tutorial first. # The functions were written by Steve Horvath, Jun Dong, Peter Langfelder, Andy Yip, Bin Zhang # To cite this code or the statistical methods please use the following 2 references: #1) Horvath S, Zhang B, Carlson M, Lu KV, Zhu S, Felciano RM, Laurance MF, Zhao W, Shu, Q, Lee Y, # Scheck AC, Liau LM, Wu H, Geschwind DH, Febbo PG, Kornblum HI, Cloughesy TF, Nelson SF, Mischel PS (2006) #"Analysis of Oncogenic Signaling Networks in Glioblastoma Identifies ASPM as a Novel Molecular Target", # PNAS | November 14, 2006 | vol. 103 | no. 46 | 17402-17407 #2) Zhang B, Horvath S (2005) "A General Framework for Weighted Gene Co-Expression Network Analysis", #Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Article 17 # Technical Report and software code at: www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork. # CONTENTS # This document contains function for carrying out the following tasks # A) Assessing scale free topology and choosing the parameters of the adjacency function # using the scale free topology criterion (Zhang and Horvath 05) # B) Computing the topological overlap matrix # C) Defining gene modules using clustering procedures # D) Summing up modules by their first principal component (first eigengene) # E) Relating a measure of gene significance to the modules # F) Carrying out a within module analysis (computing intramodular connectivity) # and relating intramodular connectivity to gene significance. # G) Miscellaneous other functions, e.g. for computing the cluster coefficient. # H) Network functions from Bin Zhang and Peter Langfelder for dynamic tree cutting #of a hierarchical clustering tree (dendrogram) # I) Peter Langfelder's functions for merging modules, e.g. when their module eigengenes have a high correlation # J) General statistical functions, e.g. for scatterplots # The functions below are organized according into categories A-J. ##################################################################################################### ################################################################################################################################ # A) Assessing scale free topology and choosing the parameters of the adjacency function # using the scale free topology criterion (Zhang and Horvath 05) ################################################################################################################# # Below these tow numbers functions will not work. ..minNGenes = 4; ..minNSamples = 4; if (!exists(".largestBlockSize")) .largestBlockSize = 1e7; # =================================================== #For hard thresholding, we use the signum (step) function if(exists("signum1") ) rm(signum1); signum1=function(corhelp,tau1) { adjmat1= as.matrix(abs(corhelp)>=tau1) dimnames(adjmat1) <- dimnames(corhelp) diag(adjmat1) <- 0 adjmat1} # =================================================== # For soft thresholding, one can use the sigmoid function # But we have focused on the power adjacency function in the tutorial... if (exists("sigmoid1") ) rm(sigmoid1); sigmoid1=function(ss,mu1=0.8,alpha1=20) { 1/(1+exp(-alpha1*(ss-mu1)))} #This function is useful for speeding up the connectivity calculation. #The idea is to partition the adjacency matrix into consecutive baches of a #given size. #In principle, the larger the batch size the faster is the calculation. But #smaller batchsizes require #less memory... # Input: gene expression data set where *rows* correspond to microarray samples #and columns correspond to genes. # If fewer than MinimumNoSamples contain gene expression information for a given # gene, then its connectivity is set to missing. SoftConnectivity=function(datE, power=6, batchsize=1500, MinimumNoSamples=10, method = "pearson", verbose = 2, indent = 0) { spaces = indentSpaces(indent); nGenes=dim(datE)[[2]] no.samples=dim(datE)[[1]] if (nGenes<..minNGenes | no.samples<..minNSamples ) {stop(paste("Error: Something seems to be wrong. Make sure that the input data frame has genes as rows and array samples as columns. Alternatively, there could be fewer than", ..minNGenes, "genes or fewer than", ..minNSamples, "samples. ") ) } else { if (nGenes.largestBlockSize) batchsize = as.integer(.largestBlockSize/nGenes); if (batchsize < 1) { batchsize = 1; printFlush(paste("The data set may be too large for the connectivity calculation.\n", " ..trying anyway; keep your fingers crossed.")); } k=rep(NA,nGenes) no.batches=as.integer(nGenes/ batchsize) if (verbose>0) { cat(paste(spaces, "SoftConnectivity: calculating connectivities..")); pind = initProgInd(); } if (no.batches>0) { for (i in 1:no.batches) { index1=c(1:batchsize)+(i-1)* batchsize ad1=as.matrix(abs(cor(datE[,index1], datE,use="p", method = method))^power) ad1[is.na(ad1)]=0 k[index1]=apply(ad1,1,sum, na.rm = TRUE) # If fewer than MinimumNoSamples contain gene expression information for a given # gene, then we set its connectivity to 0. NoSamplesAvailable=apply(!is.na(datE[,index1]),2,sum) k[index1][NoSamplesAvailable< MinimumNoSamples]=NA if (verbose>0) pind = updateProgInd(i/no.batches, pind); } # end of for (i in 1:no.batches } # end of if (no.batches>0)... if (nGenes-no.batches*batchsize>0 ) { restindex=c((no.batches*batchsize+1):nGenes) ad1=as.matrix(abs(cor(datE[,restindex], datE,use="p", method = method))^power) ad1[is.na(ad1)]=0 k[restindex]=apply(ad1,1,sum, na.rm = TRUE) NoSamplesAvailable=apply(!is.na(datE[,restindex]),2,sum) k[restindex][NoSamplesAvailable< MinimumNoSamples]=NA } # end of if k } # end of else statement } # end of function # =================================================== # The function PickHardThreshold can help one to estimate the cut-off value # when using the signum (step) function. # The first column lists the threshold ("cut"), # the second column lists the corresponding p-value based on the Fisher Transform # of the correlation. # The third column reports the resulting scale free topology fitting index R^2. # The fourth column reports the slope of the fitting line, it shoud be negative for # biologically meaningul networks. # The fifth column reports the fitting index for the truncated exponential model. # Usually we ignore it. # The remaining columns list the mean, median and maximum resulting connectivity. # To pick a hard threshold (cut) with the scale free topology criterion: # aim for high scale free R^2 (column 3), high connectivity (col 6) and negative slope # (around -1, col 4). # The output is a list with 2 components. The first component lists a sugggested cut-off # while the second component contains the whole table. # The removeFirst option removes the first point (k=0, P(k=0)) from the regression fit. # no.breaks specifies how many intervals used to estimate the frequency p(k) i.e. the no. of points in the # scale free topology plot. if (exists("PickHardThreshold")) rm(PickHardThreshold); PickHardThreshold=function(datExpr1,RsquaredCut=0.85, cutvector=seq(0.1,0.9,by=0.05) ,removeFirst=FALSE,no.breaks=10, method = "pearson") { nGenes <- dim(datExpr1)[[2]] nGenes <- dim(datExpr1)[[2]] no.samples= dim(datExpr1)[[1]] colname1=c("Cut","p-value", "scale law R^2", "slope=" ,"truncated R^2","mean(k)","median(k)","max(k)") datout=data.frame(matrix(666,nrow=length(cutvector),ncol=length(colname1) )) names(datout)=colname1 datout[,1]=cutvector for (i in c(1:length(cutvector) ) ){ cut1=cutvector[i] datout[i,2]=2*(1-pt(sqrt(no.samples-1)*cut1/sqrt(1-cut1^2),no.samples-1))} if(exists("fun1")) rm(fun1) fun1=function(x) { corx=abs(cor(x,datExpr1,use="p", method = method)) out1=rep(NA, length(cutvector) ) for (j in c(1:length(cutvector))) {out1[j]=sum(corx>cutvector[j])} out1 } # end of fun1 datk=t(apply(datExpr1,2,fun1)) for (i in c(1:length(cutvector) ) ){ nolinkshelp <- datk[,i]-1 cut2=cut(nolinkshelp,no.breaks) binned.k=tapply(nolinkshelp,cut2,mean) freq1=as.vector(tapply(nolinkshelp,cut2,length)/length(nolinkshelp)) # The following code corrects for missing values etc breaks1=seq(from=min(nolinkshelp),to=max(nolinkshelp),length=no.breaks+1) hist1=hist(nolinkshelp,breaks=breaks1,equidist=F,plot=FALSE,right=TRUE) binned.k2=hist1$mids binned.k=ifelse(is.na(binned.k),binned.k2,binned.k) binned.k=ifelse(binned.k==0,binned.k2,binned.k) freq1=ifelse(is.na(freq1),0,freq1) xx= as.vector(log10(binned.k)) if(removeFirst) {freq1=freq1[-1]; xx=xx[-1]} plot(xx,log10(freq1+.000000001),xlab="log10(k)",ylab="log10(p(k))" ) lm1= lm(as.numeric(log10(freq1+.000000001))~ xx ) lm2=lm(as.numeric(log10(freq1+.000000001))~ xx+I(10^xx) ) datout[i,3]=summary(lm1)$adj.r.squared datout[i,4]=summary(lm1)$coefficients[2,1] datout[i,5]=summary(lm2)$adj.r.squared datout[i,6]=mean(nolinkshelp) datout[i,7]= median(nolinkshelp) datout[i,8]= max(nolinkshelp) } datout=signif(datout,3) print(data.frame(datout)); # the cut-off is chosen as smallest cut with R^2>RsquaredCut ind1=datout[,3]>RsquaredCut indcut=NA indcut=ifelse(sum(ind1)>0,min(c(1:length(ind1))[ind1]),indcut) # this estimates the cut-off value that should be used. # Don't trust it. You need to consider slope and mean connectivity as well! cut.estimate=cutvector[indcut][[1]] list(cut.estimate, data.frame(datout)); } # end of function # =========================================================== # The function PickSoftThreshold allows one to estimate the power parameter when using # a soft thresholding approach with the use of the power function AF(s)=s^Power # The removeFirst option removes the first point (k=1, P(k=1)) from the regression fit. # PL: a rewrite that splits the data into a few batches. if (exists("PickSoftThreshold")) rm(PickSoftThreshold); PickSoftThreshold=function(datExpr1,RsquaredCut=0.85, powervector=c(seq(1,10,by=1),seq(12,20,by=2)), removeFirst=FALSE,no.breaks=10, batchSize = 1000, corFnc = "cor", corOptions = "use = 'p'", verbose = 0, indent = 0) { nGenes <- dim(datExpr1)[[2]] no.samples= dim(datExpr1)[[1]] colname1=c("Power", "scale law R^2" ,"slope", "truncated R^2","mean(k)","median(k)","max(k)") datout=data.frame(matrix(666,nrow=length(powervector),ncol=length(colname1) )) names(datout)=colname1 datout[,1]=powervector spaces = indentSpaces(indent); if (verbose > 0) { cat(paste(spaces, "PickSoftThreshold: calculating connectivity for given powers..." )); pind = initProgInd(); } datk = matrix(0, nrow = nGenes, ncol = length(powervector)); startG = 1; while (startG <= nGenes) { endG = startG + batchSize - 1; if (endG > nGenes) endG = nGenes; if (verbose > 1) printFlush(paste(spaces, " ..working on genes", startG, "through", endG, "of ", nGenes)); corEval = parse(text = paste(corFnc, "(datExpr1, datExpr1[, c(startG:endG)],", corOptions, ")")); #corx = abs(cor(datExpr1, datExpr1[, c(startG:endG)], use="p", method = method)) corx = abs(eval(corEval)) if (sum(is.na(corx))!=0) warning(paste("Some correlations are NA in batch", startG, ":", endG, ".")); for (j in c(1:length(powervector))) {datk[c(startG:endG), j]=apply(corx^powervector[j], 2, sum, na.rm = TRUE)} startG = endG+1; if (verbose>0) pind = updateProgInd(endG/nGenes, pind) } if (verbose>0) cat("\n\n"); for (i in c(1:length(powervector) ) ){ nolinkshelp <- datk[,i]-1 cut2=cut(nolinkshelp,no.breaks) binned.k=tapply(nolinkshelp,cut2,mean) freq1=as.vector(tapply(nolinkshelp,cut2,length)/length(nolinkshelp)) # The following code corrects for missing values etc breaks1=seq(from=min(nolinkshelp),to=max(nolinkshelp),length=no.breaks+1) hist1=hist(nolinkshelp,breaks=breaks1,equidist=F,plot=FALSE,right=TRUE) binned.k2=hist1$mids binned.k=ifelse(is.na(binned.k),binned.k2,binned.k) binned.k=ifelse(binned.k==0,binned.k2,binned.k) freq1=ifelse(is.na(freq1),0,freq1) xx= as.vector(log10(binned.k)) if(removeFirst) {freq1=freq1[-1]; xx=xx[-1]} plot(xx,log10(freq1+.000000001),xlab="log10(k)",ylab="log10(p(k))" ) lm1= lm(as.numeric(log10(freq1+.000000001))~ xx ) lm2=lm(as.numeric(log10(freq1+.000000001))~ xx+I(10^xx) ) datout[i,2]=summary(lm1)$adj.r.squared datout[i,3]=summary(lm1)$coefficients[2,1] datout[i,4]=summary(lm2)$adj.r.squared datout[i,5]=mean(nolinkshelp) datout[i,6]= median(nolinkshelp) datout[i,7]= max(nolinkshelp) } datout=signif(datout,3) print(data.frame(datout)); # the cut-off is chosen as smallest cut with R^2>RsquaredCut ind1=datout[,2]>RsquaredCut indcut=NA indcut=ifelse(sum(ind1)>0,min(c(1:length(ind1))[ind1]),indcut) # this estimates the power value that should be used. # Don't trust it. You need to consider slope and mean connectivity as well! power.estimate=powervector[indcut][[1]] list(power.estimate, data.frame(datout)); } # =================================================== # The function ScaleFreePlot1 creates a plot for checking scale free topology # when truncated1=T is specificed, it provides the R^2 measures for the following # degree distributions: a) scale free topology, b) log-log R^2 and c) truncated exponential R^2 # The function ScaleFreePlot1 creates a plot for checking scale free topology if(exists("ScaleFreePlot1")) rm(ScaleFreePlot1) ; ScaleFreePlot1=function(kk,no.breaks=10,AF1="" ,truncated1=FALSE, removeFirst=FALSE,cex.lab1=1){ cut1=cut(kk,no.breaks) binned.k=tapply(kk,cut1,mean) freq1=tapply(kk,cut1,length)/length(kk) # The following code corrects for missing values etc breaks1=seq(from=min(kk),to=max(kk),length=no.breaks+1) hist1=hist(kk,breaks=breaks1,equidist=F,plot=FALSE,right=TRUE) binned.k2=hist1$mids binned.k=ifelse(is.na(binned.k),binned.k2,binned.k) binned.k=ifelse(binned.k==0,binned.k2,binned.k) freq1=ifelse(is.na(freq1),0,freq1) plot(log10(binned.k),log10(freq1+.000000001),xlab="log10(k)",ylab="log10(p(k))",cex.lab=cex.lab1 ) xx= as.vector(log10(binned.k)) if(removeFirst) {freq1=freq1[-1]; xx=xx[-1]} lm1=lm(as.numeric(log10(freq1+.000000001))~ xx ) lines(xx,predict(lm1),col=1) OUTPUT=data.frame(ScaleFreeRsquared=round(summary(lm1)$adj.r.squared,2),Slope=round(lm1$coefficients[[2]],2)) if (truncated1==TRUE) { lm2=lm(as.numeric(log10(freq1+.000000001))~ xx+I(10^xx) ); OUTPUT=data.frame(ScaleFreeRsquared=round(summary(lm1)$adj.r.squared,2),Slope=round(lm1$coefficients[[2]],2), TruncatedRsquared=round(summary(lm2)$adj.r.squared,2)) print("the red line corresponds to the truncated exponential fit") lines(xx,predict(lm2),col=2); title(paste(AF1, ", scale free R^2=",as.character(round(summary(lm1)$adj.r.squared,2)), ", slope=", round(lm1$coefficients[[2]],2), ", trunc.R^2=",as.character(round(summary(lm2)$adj.r.squared,2))))} else { title(paste(AF1, ", scale R^2=",as.character(round(summary(lm1)$adj.r.squared,2)) , ", slope=", round(lm1$coefficients[[2]],2))) } OUTPUT } # end of function ############################################################################################ ############################################################################################ # B) Computing the topological overlap matrix ############################################################################################ ############################################################################################ # =================================================== #The function TOMdist1 computes a dissimilarity # based on the topological overlap matrix (Ravasz et al) # Input: an Adjacency matrix with entries in [0,1] if(exists("TOMdist1")) rm(TOMdist1); TOMdist1=function(adjmat1, maxADJ=FALSE) { diag(adjmat1)=0; adjmat1[is.na(adjmat1)]=0; maxh1=max(as.dist(adjmat1) ); minh1=min(as.dist(adjmat1) ); if (maxh1>1 | minh1 < 0 ) {print(paste("ERROR: the adjacency matrix contains entries that are larger than 1 or smaller than 0!!!, max=",maxh1,", min=",minh1)) } else { if ( max(c(as.dist(abs(adjmat1-t(adjmat1)))))>10^(-12) ) {print("ERROR: non-symmetric adjacency matrix!!!") } else { adjmat1= (adjmat1+ t(adjmat1) )/2 kk=apply(adjmat1,2,sum) maxADJconst=1 if (maxADJ==TRUE) maxADJconst=max(c(as.dist(adjmat1 ))) Dhelp1=matrix(kk,ncol=length(kk),nrow=length(kk)) denomTOM= pmin(as.dist(Dhelp1),as.dist(t(Dhelp1))) +as.dist(maxADJconst-adjmat1); gc();gc(); numTOM=as.dist(adjmat1 %*% adjmat1 +adjmat1); #TOMmatrix=numTOM/denomTOM # this turns the TOM matrix into a dissimilarity out1=1-as.matrix(numTOM/denomTOM) diag(out1)=1 # setting the diagonal to 1 is unconventional (it should be 0) # but it leads to nicer looking TOM plots... out1 }}} #--------------------------------------------------------------------------- # This is a somewhat modified TOMdist1 - most checks are left out as they are # often not necessary. TOMdist = function(adjmat1, maxADJ=FALSE) { diag(adjmat1)=0; adjmat1[is.na(adjmat1)]=0; kk=apply(adjmat1,2,sum) maxADJconst=1 if (maxADJ==TRUE) maxADJconst=max(c(as.dist(adjmat1 ))) Dhelp1 = matrix(kk,ncol=length(kk),nrow=length(kk)) denomTOM = pmin(as.dist(Dhelp1),as.dist(t(Dhelp1))) + as.dist(maxADJconst-adjmat1); rm(Dhelp1); numTOM=as.dist(adjmat1 %*% adjmat1 +adjmat1); #TOMmatrix=numTOM/denomTOM # this turns the TOM matrix into a dissimilarity out1=1-as.matrix(numTOM/denomTOM) rm(numTOM); rm(denomTOM); collectGarbage(); diag(out1)=1 out1 } #--------------------------------------------------------------------------- # exact equivalent of TOMdist above, but returns similarity. TOMsim = function(adjmat1, maxADJ=FALSE) { diag(adjmat1)=0; adjmat1[is.na(adjmat1)]=0; kk=apply(adjmat1,2,sum) maxADJconst=1 if (maxADJ==TRUE) maxADJconst=max(c(as.dist(adjmat1 ))) Dhelp1 = matrix(kk,ncol=length(kk),nrow=length(kk)) denomTOM = pmin(as.dist(Dhelp1),as.dist(t(Dhelp1))) + as.dist(maxADJconst-adjmat1); rm(Dhelp1); numTOM=as.dist(adjmat1 %*% adjmat1 +adjmat1); #TOMmatrix=numTOM/denomTOM # this turns the TOM matrix into a dissimilarity out1=as.matrix(numTOM/denomTOM) rm(numTOM); rm(denomTOM); collectGarbage(); diag(out1)=1 out1 } # =================================================== # This function computes a TOMk dissimilarity # which generalizes the topological overlap matrix (Ravasz et al) # Input: an Adjacency matrix with entries in [0,1] # WARNING: ONLY FOR UNWEIGHTED NETWORKS, i.e. the adjacency matrix contains binary entries... # This function is explained in Yip and Horvath (2005) # http://www.genetics.ucla.edu/labs/horvath/GTOM/ if(exists("TOMkdist1")) rm(TOMkdist1); TOMkdist1 = function(adjmat1,k=1){ maxh1=max(as.dist(adjmat1) ); minh1=min(as.dist(adjmat1) ); if (k!=round(abs(k))) { stop("k must be a positive integer!!!", call.=TRUE);} if (maxh1>1 | minh1 < 0 ){ print(paste("ERROR: entries of the adjacency matrix must be between inclusively 0 and 1!!!, max=",maxh1,", min=",minh1))} else { if ( max(c(as.dist(abs(adjmat1-t(adjmat1)))))>0 ) {print("ERROR: non-symmetric adjacency matrix!!!") } else { B <- adjmat1; if (k>=2) { for (i in 2:k) { diag(B) <- diag(B) + 1; B = B %*% adjmat1;}} # this gives the number of paths with length at most k connecting a pair B <- (B>0); # this gives the k-step reachability from a node to another diag(B) <- 0; # exclude each node being its own neighbor B <- B %*% B # this gives the number of common k-step-neighbor that a pair of nodes share Nk <- diag(B); B <- B +adjmat1; # numerator diag(B) <- 1; denomTOM=outer(Nk,Nk,FUN="pmin")+1-adjmat1; diag(denomTOM) <- 1; 1 - B/denomTOM # this turns the TOM matrix into a dissimilarity }} } # IGNORE THIS function... # The function TOMdistROW computes the TOM distance of a gene (node) # with that of all other genes in the network. # WhichRow is an integer that specifies which row of the adjacency matrix # corresponds to the gene of interest. # Output=vector of TOM distances. if (exists("TOMdistROW") ) rm(TOMdistROW) TOMdistROW=function(WhichRow=1, adjmat1, maxADJ=FALSE) { diag(adjmat1)=0; maxh1=max(as.dist(adjmat1) ); minh1=min(as.dist(adjmat1) ); if (maxh1>1 | minh1 < 0 ) {print(paste("ERROR: the adjacency matrix contains entries that are larger than 1 or smaller than 0!!!, max=",maxh1,", min=",minh1)) } else { kk=apply(adjmat1,2,sum) numTOM=adjmat1[WhichRow,] %*% adjmat1 +adjmat1[WhichRow,]; numTOM[WhichRow]=1 maxADJconst=1 if (maxADJ==TRUE) maxADJconst=max(c(as.dist(adjmat1 ))) denomTOM=pmin(kk[WhichRow],kk)+maxADJconst-adjmat1[WhichRow,]; denomTOM[WhichRow]=1 #TOMmatrix=numTOM/denomTOM # this turns the TOM matrix into a dissimilarity 1-numTOM/denomTOM } } #============================================================================================= # # vectorTOM: calculate TOM of a vector (or a 'small' matrix) with expression # data. If the number of columns in vect is small (or 1), number of columns in # expr can be large. # #============================================================================================ vectorTOM = function(expr, vect, subtract1 = FALSE, batchSize = 2000, corFnc = "cor", corOptions = "use = 'p'", type = "unsigned", softPower = 6, verbose = 1, indent = 0) { spaces = indentSpaces(indent); if (is.null(dim(vect))) { vect = as.matrix(vect) vectIsVector = TRUE; } else vectIsVector = FALSE; if (nrow(vect)!=nrow(expr)) stop("Input error: numbers of samples in 'vect' and 'expr' must be the same."); if (ncol(vect)>batchSize) stop(paste("Input error: number of columns in 'vect' is too large. ", "If you are certain you want to try anyway, increase 'batchSize' to at least", "the number of columns in 'vect'.")); corEval = parse(text = paste(corFnc, "(expr, vect ", corOptions, ")")); corVE = eval(corEval); if (type=="unsigned") { corVE = abs(corVE); } else if (type=="signed") { corVE = (1+corVE)/2; } else if (type=="signed hybrid") { corVE[corVE < 0] = 0; } else stop("Unrecognized type argument. Recognized values are 'unsigned', 'signed', and 'signed hybrid'."); corVE = corVE^softPower; subtract1 = as.numeric(subtract1); nVect = ncol(vect); nGenes = ncol(expr); TOM = matrix(NA, nrow = nGenes, ncol = nVect); if (verbose > 0) { if (verbose > 1) cat(paste(spaces, "Calculating TOM of a set of vectors with genes")); pind = initProgInd(); } start = 1; denomArr = array(0, dim = c(2, batchSize, nVect)); while (start <= nGenes) { end = min(start + batchSize, nGenes); batchInd = c(start:end); corEval = parse(text = paste(corFnc, "(expr[, batchInd], expr, ", corOptions, ")")); corEE = cor(expr[, batchInd], expr, use = "p"); if (type=="unsigned") { corEE = abs(corEE); } else if (type=="signed") { corEE = (1+corEE)/2; } else if (type=="signed hybrid") { corEE[corEE < 0] = 0; } corEE = corEE^softPower; num = corEE %*% corVE + corVE[c(start:end), ] kV = apply(corVE, 2, sum, na.rm = TRUE) - subtract1 kE = apply(corEE, 1, sum, na.rm = TRUE) - 1; # denomArr[1, batchIndex, ] = matrix(kV, nrow = end-start+1, ncol = nVect, byrow = TRUE); # denomArr[2, batchIndex, ] = matrix(kE, nrow = end-start+1, ncol = nVect); # denom = apply(denomArr[, batchIndex, ], c(2,3), min) + 1 - corVE[batchInd, ]; denom = pmin(kV, kE) + 1 - corVE[batchInd, ]; TOM[batchInd, ] = num/denom; if (verbose > 0) pind = updateProgInd(end/nGenes); start = end + 1; collectGarbage(); } if (verbose>0) printFlush(" "); TOM; } #--------------------------------------------------------------------- # # AdjacencyMatrix # #--------------------------------------------------------------------- # Computes the adjacency from the expression data: takes cor, transforms it as appropriate and possibly # adds a sign if requested. No subselection on ExprData is performed. AdjacencyMatrix = function(ExprData, ExprData2=NULL, SoftPower, signed = FALSE, type = NULL, verbose=1, indent = 0) { spaces = indentSpaces(indent); No.Samples = dim(ExprData)[1]; Methods = c("probability", "power"); if (SoftPower<=0) { method = 1; Power = -SoftPower; } else { method = 2; Power = SoftPower; } if (verbose>2) printFlush(paste(spaces, "Transforming the correlation matrix using", Methods[method], "method with power", Power)); cor_mat = cor(ExprData, ExprData2, use="p"); # trafoed_cor = TrafoCorMatrix(cor_mat, method, SoftPower, No.Samples); collectGarbage(); if (!is.null(type)) { if (type=="unsigned") { cor_mat = abs(cor_mat); } else if (type=="signed") { cor_mat = (1+cor_mat)/2; } else if (type=="signed hybrid") { cor_mat[cor_mat < 0] = 0; } else stop("Unrecognized type argument. Recognized values are 'unsigned', 'signed', and 'signed hybrid'."); } else { if (signed) { cor_mat = (1+cor_mat)/2; } else { cor_mat = abs(cor_mat); } } if (method==2) { cor_mat^Power; } else { norm = pnorm(1, sd = 1/sqrt(No.Samples)); raw_weight = pnorm(cor_mat, sd = 1/sqrt(No.Samples)); weight = (raw_weight - 0.5)/(norm - 0.5); cor_mat * weight^Power; } } # A slighly reworked version that assumes one wants the adjacency matrix of data with itself or a # subset. The data are given only once, and an additional selection index for columns is given. # Caution: no checking of select2 validity is performed. # The probability method is removed as it's not used. AdjacencyMatrix2 = function(ExprData, select2=NULL, softPower, signed = FALSE, type = NULL, verbose=1, indent = 0) { spaces = indentSpaces(indent); No.Samples = dim(ExprData)[1]; if (verbose>2) printFlush(paste(spaces, "Transforming the correlation matrix with power", softPower)); if (is.null(select2)) { cor_mat = cor(ExprData, use = "p"); } else cor_mat = cor(ExprData, ExprData[, select2], use="p"); # trafoed_cor = TrafoCorMatrix(cor_mat, method, softPower, No.Samples); collectGarbage(); if (!is.null(type)) { if (type=="unsigned") { cor_mat = abs(cor_mat); } else if (type=="signed") { cor_mat = (1+cor_mat)/2; } else if (type=="signed hybrid") { cor_mat[cor_mat < 0] = 0; } else stop("Unrecognized type argument. Recognized values are 'unsigned', 'signed', and 'signed hybrid'."); } else { if (signed) { cor_mat = (1+cor_mat)/2; } else { cor_mat = abs(cor_mat); } } cor_mat^softPower; } # A faster and less memory-intensive version, but it only works for method==power and it won't keep # overlap sign AdjacencyMatrixR = function(ExprData, ExprData2 = NULL, SoftPower = 6, verbose=1, indent = 0) { abs(cor(ExprData, ExprData2, use="p"))^SoftPower; } ##################################################################################################### ################################################################################################################################ # C) Defining gene modules using clustering procedures ##################################################################################################### ################################################################################################################################ # =================================================== #The function modulecolor2 function assigns colors to the observations # in the branches of a dendrogram. Now enhanced to handle all possible R colors. # Needs GlobalStandardColors. # we use it to define modules.... if (exists("modulecolor2")) rm(modulecolor2); modulecolor2=function(hier1, h1=0.9,minsize1=50) { # here we define modules by using a height cut-off for the branches labelpred= cutree(hier1,h=h1) sort1=-sort(-table(labelpred)) modulename= as.numeric(names(sort1)) modulebranch= sort1>minsize1 no.modules=sum(modulebranch) # now we assume that there are fewer than a certain number of colors colorcode=GlobalStandardColors # "grey" means not in any module; colorhelp=rep("grey",length(labelpred)) if ( no.modules==0 | no.modules >length(colorcode)){ print(paste("The number of modules is problematic! \n Number of modules = ", as.character(no.modules)))} else { for (i in c(1:no.modules)) {colorhelp=ifelse(labelpred==modulename[i],colorcode[i],colorhelp)}; colorhelp=factor(colorhelp,levels=c(colorcode[1:no.modules],"grey")) } factor(colorhelp, levels=unique(colorhelp[hier1$order] )) } #------------------------------------------------------------------------------------------------ # Create a barplot of colors given in the couleur parameter ordered according # to the order of the hierarchical tree hier1. # Parameters: # hier1: Hierarchical tree (such as returned by hclust). # couleur: Either a vector of colors for each element in the tree, or a matrix in which every row # is a set of colors for the corresponding object in the tree. If a matrix, the colors # will be plotted starting at the bottom. # RowLabels: If given, each color row will be labeled by the corresponding entry in RowLabels; # otherwise the number of the row will be used as a label. # cex.RowLabels: Scale factor for the font size used for labeling rows. # # Return value: None. # # hclustplot1.old implements the case where couleur is a vector, hclustplotn # the case of a matrix; hclustplot1 is a common wrapper. #------------------------------------------------------------------------------------------------ # # hclustplot1, hclustplot1.old, hclustplotn # The funtion hclusplotn was created by Peter Langfelder. #------------------------------------------------------------------------------------------------ # Create a barplot of colors given in the couleur parameter ordered according # to the order of the hierarchical tree hier1. # Parameters: # hier1: Hierarchical tree (such as returned by hclust). # couleur: Either a vector of colors for each element in the tree, or a matrix in which every row # is a set of colors for the corresponding object in the tree. If a matrix, the colors # will be plotted starting at the bottom. # RowLabels: If given, each color row will be labeled by the corresponding entry in RowLabels; # otherwise the number of the row will be used as a label. # cex.RowLabels: Scale factor for the font size used for labeling rows. # # Return value: None. # # hclustplot1.old implements the case where couleur is a vector, hclustplotn # the case of a matrix; hclustplot1 is a common wrapper. if (exists("hclustplot1")) rm(hclustplot1); hclustplot1 = function(hier1, couleur, title1="Colors sorted by hierarchical clustering", RowLabels=NULL, cex.RowLabels = 0.9, ...) { if (length(dim(couleur))>1) { hclustplotn(hier1, couleur, RowLabels, cex.RowLabels, main = title1, ...); } else { hclustplot1.old(hier1,couleur,title1); } } if (exists("hclustplot1.old")) rm(hclustplot1.old); hclustplot1.old=function(hier1,couleur,title1="Colors sorted by hierarchical clustering") { if (length(hier1$order) != length(couleur) ) {print("ERROR: length of color vector not compatible with no. of objects in the hierarchical tree")}; if (length(hier1$order) == length(couleur) ) { barplot(height=rep(1, length(couleur)), col= as.character(couleur[hier1$order]),border=F, main=title1,space=0, axes=F)} } if (exists("hclustplotn") ) rm(hclustplotn); hclustplotn=function(hier1, Color, RowLabels=NULL, cex.RowLabels = 0.9, ...) { if ( is.null(RowLabels) ) RowLabels=names(data.frame(Color)) options(stringsAsFactors=FALSE); if (length(hier1$order) != dim(as.matrix(Color))[[1]] ) { stop("ERROR: length of color vector not compatible with no. of objects in the hierarchical tree"); } else { No.Sets = dim(as.matrix(Color))[[2]]; C = as.matrix(Color)[hier1$order, ]; step = 1/dim(as.matrix(Color))[[1]]; ystep = 1/No.Sets; barplot(height=1, col = "white", border=F,space=0, axes=F, ...) for (j in 1:No.Sets) { for (i in 1:dim(as.matrix(C))[[1]]) { rect(xleft=((i-1)*step), ybottom=ystep*(j-1), xright = (i) * step, ytop = ystep*j, col = as.character(as.matrix(C)[i,j]), border = as.character(as.matrix(C)[i,j])); } if (is.null(RowLabels)) { text(as.character(j), pos=2, x=0, y=ystep*(j-0.5), cex=cex.RowLabels, xpd = TRUE); } else { text(RowLabels[j], pos=2, x=0, y=ystep*(j-0.5), cex=cex.RowLabels, xpd = TRUE); } } for (j in 1:No.Sets) lines(x=c(0,1), y=c(ystep*j,ystep*j)); } } # end of function # This function can be used to create an average linkage hierarchical # clustering tree # or the microarray samples. The rows of datExpr correspond to the samples and # the columns to the genes # You can optionally input a quantitative microarray sample trait. if(exists("PlotClusterTreeSamples")) rm(PlotClusterTreeSamples); PlotClusterTreeSamples=function(datExpr, y=NULL) { hierSamples=hclust( dist( datExpr ), method="average" ) if (is.null(y) ) {plot(hierSamples, main="Cluster Tree of the Samples (average linkage, Euclidean dist.)", sub="")} if (!is.null(y) ) { if (!is.numeric(y) ) {warning("The microarray sample trait y is not numeric. Therefore, the function PlotClusterTreeSamples will transform it to a numeric variable"); y=as.numeric(as.character(y)) } # end of if (!is.numeric(y) ) if ( dim(as.matrix(datExpr))[[1]] != length(y) ) stop("Input Error: The number of microarray sample arrays does not match the length of the sample trait. Hint: consider transposing datExpr. Also make sure that y is a vector. dim(as.matrix(datExpr))[[1]] != length(y)") par(mar=c(2,2,2,2), mfrow=c(2,1)) plot(hierSamples, main="Cluster Tree of the Samples (Euclidean dist.)", sub="") if ( is.integer(y) ) { if (min(y, na.rm=T)<0 ) y=y- min(y, na.rm=T)+1 hclustplotn(hierSamples, data.frame(y),main="Array Samples Colored by the Sample Trait") } # end of if ( is.integer(y) ) if ( !is.integer(y) ) { yBinary =as.integer(ifelse( y>=median(y, na.rm=T), 2 , 1 )) hclustplotn(hierSamples, data.frame(yBinary),main="Samples Colored by the Dichotomized (Binary) Trait") } # end of if ( !is.integer(y) ) } # end of if }# end of function PlotClusterTreeSamples # =================================================== # The function TOMplot1 creates a TOM plot # Inputs: distance measure, hierarchical (hclust) object, color label=couleur TOMplot1=function(disttom,hier1, couleur=NULL,terrainColors=FALSE) { if ( is.null(couleur) ) couleur=rep("white", dim(as.matrix(disttom))[[1]] ) no.nodes=length(couleur) if (no.nodes<2) warning("You have only 1 or 2 genes in TOMplot1. No plot will be produced") if (no.nodes>=2) { if (no.nodes != dim(disttom)[[1]] ) stop("ERROR: number of genes for the color label does not equal number of nodes in disttom. no.nodes != dim(disttom)[[1]] ") labeltree=as.character(couleur) labelrow = labeltree labelrow[hier1$order[length(labeltree):1]]=labelrow[hier1$order] options(expressions = 10000) if (terrainColors) heatmap(as.matrix(disttom),Rowv=as.dendrogram(hier1),Colv= as.dendrogram(hier1), scale="none",revC=T, ColSideColors=as.character(labeltree),RowSideColors=as.character(labelrow), labRow=F, labCol=F, col = terrain.colors(1000)) else heatmap(as.matrix(disttom),Rowv=as.dendrogram(hier1),Colv= as.dendrogram(hier1), scale="none",revC=T, ColSideColors=as.character(labeltree),RowSideColors=as.character(labelrow), labRow=F, labCol=F) } #end of if } #end of function # =================================================== # The function TOMplot2 creates a TOM plot where the top and left color bars can be different # Inputs: distance measure, hierarchical (hclust) object, color label=couleurTop, couleurLeft if (exists("TOMplot2")) rm(TOMplot2); TOMplot2=function(disttom,hier1, couleurTop, couleurLeft) { no.nodes=length(couleurTop) if (no.nodes != length(couleurLeft)) {stop("ERROR: number of top color labels does not equal number of left color labels")} if (no.nodes != dim(disttom)[[1]] ) {stop("ERROR: number of color labels does not equal number of nodes in disttom")} else { labeltree = as.character(couleurTop) labelrow = as.character(couleurLeft) labelrow[hier1$order[length(labeltree):1]]=labelrow[hier1$order] options(expressions = 10000) heatmap(as.matrix(disttom),Rowv=as.dendrogram(hier1),Colv= as.dendrogram(hier1), scale="none", revC=T, ColSideColors=as.character(labeltree),RowSideColors=as.character(labelrow), labRow=F, labCol=F) } } #end of function if (exists("HeatmapNetwork")) rm(HeatmapNetwork); HeatmapNetwork=function(datE, gene.name.vector, TOM=T, power1=6 , networkType="unsigned",main="Heatmap of the network") { match1=match( gene.name.vector ,names(data.frame(datE)) ) match1=match1[ !is.na(match1)] nGenes=length(match1) if ( sum( !is.na(match1) ) != length(gene.name.vector) ) {warning(paste("Not all gene names were recognized. Only the above genes were recognized. ")); print(paste(names(data.frame(datE))[match1] )) } if (nGenes< 3 ) {warning("Since you have fewer than 3 genes, the network will not be visualized. Hint: please input more genes."); plot(1,1)} if (nGenes>2 ) { datErest=datE[, match1 ] if ( networkType=="unsigned" ) ADJ1= abs(cor(datErest,use="p"))^power1 if ( networkType=="signed" ) ADJ1= ( (1+cor(datErest,use="p"))/2)^(power1) diss1=1-ADJ1 if (TOM) diss1=TOMdist1(ADJ1) diag(diss1)=NA hier1=hclust(as.dist(diss1), method="average" ) couleur1=rep("white", nGenes) labeltree = names(data.frame(datErest)) labelrow = names(data.frame(datErest)) labelrow[hier1$order[length(labeltree):1]]=labelrow[hier1$order] options(expressions = 10000) heatmap(as.matrix(diss1),Rowv=as.dendrogram(hier1),Colv= as.dendrogram(hier1), scale="none", revC=T, labRow= labeltree, labCol= labeltree,main=main) } # end of if (nGenes> 2 ) } # end of function # IGNORE THIS FUNCTION... # The function "BestHeightCut" allows one to find the best height cut-off # for a hierarchical clustering tree when external gene information is available # It computes a Kruskal Wallis-test p-value for each height cut-off # based on determining whether gene significance differs across branch membership. if(exists("BestHeightCut")) rm(BestHeightCut); BestHeightCut=function(hier1, GeneSignif, hcut=seq(0.1,.95,by=0.01) ) { pvalues=rep(NA, length(hcut)) for (i in c(1:length(hcut))) { colorhelp=modulecolor2(hier1,hcut[i]) if (length(unique(colorhelp))>1 ) {pvalues[i]=kruskal.test(GeneSignif, colorhelp)$p.value} data.frame(hcut,pvalues) }} ##################################################################################################### ################################################################################################################################ # D) Summing up modules using their first principal components (first eigengene) ##################################################################################################### ################################################################################################################################ # =================================================== #The function ModulePrinComps1 finds the first principal component (eigengene) in each # module defined by the colors of the input vector "couleur" (Pardon my French). # It also reports the variances explained by the first 5 principal components. # This requires the R library impute # The output is a list with 2 components: # 1) a data frame of module eigengenes (PCs), # 2) a data frame that lists the percent variance explained by the first 5 PCs of a module # Options: if removeGrey=T, then no output is generated for the grey module. # Recall that grey often denotes genes outside proper modules. if (exists("ModulePrinComps1" ) ) rm(ModulePrinComps1); ModulePrinComps1=function(datexpr,couleur,removeGrey=F, FiveComponents=F) { if (require(moduleColor, quietly = TRUE, warn.conflicts = FALSE) | exists("moduleEigengenes", mode = "function", inherits = TRUE)) { MEs = moduleEigengenes(datexpr, couleur, excludeGrey = removeGrey, nPC = ifelse(FiveComponents, 5, 1), verbose = 0) list(PrinComps = MEs$eigengenes, varexplained = MEs$varExplained); } else { cat("\n\n"); cat(paste("Warning: The function ModulePrinComps1 is not maintained anymore. \n", " Please install the packages 'dynamicTreeCut' and 'moduleColor'\n", " that contain an uptodate, improved version of this function.\n", " Once these packages are installed, you may continue using the same call;\n", " the new version of this function will be called automatically. \n", " This is only a warning: the old calculation will proceed normally.\n\n")) modlevels=levels(factor(couleur)) if ( removeGrey ) modlevels=setdiff(modlevels, c("grey") ); if (FiveComponents ) {print("To speed up the calculation, we only compute the five principal components of each module. Therefore, the estimate of the proportion of variance explained is no longer accurate. If you want an accurate estimate of the proportion of var explained, please choose the option FiveComponents=F ") ;} PrinComps=data.frame(matrix(666,nrow=dim(datexpr)[[1]],ncol= length(modlevels))) varexplained= data.frame(matrix(666,nrow= 5,ncol= length(modlevels))) names(PrinComps)=paste("ME",modlevels,sep="") for(i in c(1:length(modlevels)) ){ #print(i) modulename = modlevels[i] restrict1= as.character(couleur)== modulename # in the following, rows are genes and columns are samples datModule=t(datexpr[, restrict1]) is.saved = FALSE; if (exists(".Random.seed")) { saved.seed = .Random.seed; is.saved = TRUE; } datModule=impute.knn(as.matrix(datModule), k = min(10, nrow(datModule))) if ( (length(datModule)==3) && (!is.null(names(datModule))) && (names(datModule)[1]=="data") ) datModule = datModule$data; if (is.saved) .Random.seed <<- saved.seed; datModule=t(scale(t(datModule))) if (FiveComponents ) { svd1 =svd(datModule, nu = 5, nv = 5) } else {svd1=svd(datModule)} varexplained[,i]= (svd1$d[1:5])^2/sum(svd1$d^2) # this is the first principal component pc1=svd1$v[,1] signh1=sign(sum(cor(pc1, t(datModule)))) if (signh1 != 0) pc1=signh1* pc1 PrinComps[,i]= pc1 } list(PrinComps=PrinComps, varexplained=varexplained) } } ##################################################################################################### ################################################################################################################################ # E) Relating a measure of gene significance to the modules ##################################################################################################### ################################################################################################################################ # =================================================== # The function ModuleEnrichment1 creates a bar plot that shows whether modules are enriched with # significant genes. # More specifically, it reports the mean gene significance for each module. # The gene significance can be a binary variable or a quantitative variable. # It also plots the 95% confidence interval of the mean (CI=mean +/- 1.96* standard error). # It also reports a Kruskal Wallis P-value. if( exists("ModuleEnrichment1") ) rm(ModuleEnrichment1); ModuleEnrichment1=function(genesignif1,couleur,title1="gene significance across modules",labely="Gene Significance",boxplot=F) { if (length(genesignif1) != length(couleur) ) print("Error: vectors don't have the same lengths") else { if (boxplot != TRUE) { mean1=function(x) mean(x,na.rm=T) means1=as.vector(tapply(genesignif1,couleur,mean1)); se1= as.vector(tapply(genesignif1,couleur,stderr1)) par(mfrow=c(1,1)) barplot(means1, names.arg=names(table(couleur) ),col= names(table(couleur) ) ,ylab=labely) err.bp(as.vector(means1), as.vector(1.96*se1), two.side=T)} else { boxplot(split(genesignif1,couleur),notch=T,varwidth=T, col= names(table(couleur) ),ylab=labely)} no.colors=length(names(table(couleur) )) if (no.colors==1) pp=NA if (no.colors>1) { pp=try(kruskal.test(genesignif1,factor(couleur))$p.value) if (class(pp)=='try-error') pp=NA } title(paste(title1,", p-value=", signif(pp,2) )) } } # end of function # IGNORE THIS... # =================================================== #The function fisherPvector allows one to compute Fisher exact p-values # Thus it allows one to carry out an EASE analysis # Output: a table of Fisher's exact p-values # Input: annotation1 is a vector of gene annotations # Input: couleur (French for color) denotes the module color of each gene # Only those gene functions (levels of annotation1) that occur a certain mininum number of times # (parameter= minNumberAnnotation) in the data set will be considered. if (exists("fisherPvector" ) ) rm(fisherPvector); fisherPvector=function(couleur,annotation1,minNumberAnnotation=50) { levelsannotation1=levels(annotation1) levelscouleur=levels(factor(couleur)) no.couleur=length(levelscouleur) restannotation1=table(annotation1)>minNumberAnnotation no.annotation=sum( restannotation1) datoutP=data.frame(matrix(666,nrow=no.annotation,ncol=no.couleur) ) #datoutProp=data.frame(matrix(666,nrow=no.annotation,ncol=2*no.couleur) ) #names(datoutProp)=paste("Prop",paste( rep(levelscouleur ,rep(2, length(levelscouleur))) ) , c("Y","N") ,sep=".") datoutProp=data.frame(matrix(666,nrow=no.annotation,ncol=no.couleur) ) names(datoutProp)=paste("Perc",levelscouleur , sep=".") names(datoutP)=paste("p",levelscouleur,sep=".") restlevelsannotation1= levelsannotation1[restannotation1] row.names(datoutP)= restlevelsannotation1 for (i in c(1:no.annotation) ) { for (j in c(1:no.couleur) ){ tab1=table( annotation1 !=restlevelsannotation1[i], couleur !=levelscouleur[j]) datoutP[i,j]=signif(fisher.test(tab1)$p.value,2) #datoutProp[i,2*j-1]=signif(tab1[1,1]/sum(tab1[,1] ),2) #datoutProp[i,2*j]= signif(tab1[1,2]/sum(tab1[,2]) ,2) } table2=table(annotation1 !=restlevelsannotation1[i], couleur) datoutProp[i,]= signif(table2[1,]/apply(table2,2,sum),2) } data.frame(datoutP,datoutProp) } # end of function fisherPvector ##################################################################################################### ################################################################################################################################ # F) Carrying out a within module analysis (computing intramodular connectivity etc) ##################################################################################################### ################################################################################################################################ # =================================================== #The function DegreeInOut computes for each gene #a) the total number of connections, #b) the number of connections with genes within its module, #c) the number of connections with genes outside its module # When scaledToOne=TRUE, the within module connectivities are scaled to 1, i.e. the max(K.Within)=1 for each module if (exists("DegreeInOut")) rm(DegreeInOut); DegreeInOut =function(adj1, couleur,scaledToOne=FALSE) { no.nodes=length(couleur) couleurlevels=levels(factor(couleur)) no.levels=length(couleurlevels) kWithin=rep(-666,no.nodes ) diag(adj1)=0 for (i in c(1:no.levels) ) { rest1=couleur==couleurlevels[i]; if (sum(rest1) <3 ) { kWithin[rest1]=0 } else { kWithin[rest1]=apply(adj1[rest1,rest1],2,sum) if (scaledToOne) kWithin[rest1]=kWithin[rest1]/max(kWithin[rest1])} } kTotal= apply(adj1,2,sum) kOut=kTotal-kWithin if (scaledToOne) kOut=NA kDiff=kWithin-kOut data.frame(kTotal,kWithin,kOut,kDiff) } # ======================================================================= # The function WithinModuleCindex1 relates the node measures (e.g. connectivities) # to "external" node significance information within each module, # i.e. it carries out a by module analysis. # Output: first column reports the spearman correlation p-value between the network variable and the # node significance. The next columns contain the Spearman correlations between the variables. if (exists("WithinModuleAnalysis1")) rm(WithinModuleAnalysis1); WithinModuleAnalysis1=function(datnetwork,nodesignif, couleur) { cortesthelp=function( x ) { len1=dim(x)[[2]]-1 out1=rep(666, len1); for (i in c(1:len1) ) {out1[i]= signif( cor.test(x[,i+1], x[,1], method="s",use="p" )$p.value ,2) } data.frame( variable=names(x)[-1] , NS.CorPval=out1, NS.cor=t(signif(cor (x[,1], x[,-1],use="p",method="s"),2)), signif(cor(x[,-1],use="p",method="s"),2) ) } #end of function cortesthelp print("IGNORE the warnings..."); by( data.frame(nodesignif, datnetwork),couleur,cortesthelp); } #end of function WithinModuleAnalysis # ======================================================================= # The function WithinModuleCindex1 relates the node measures (e.g. connectivities) # to "external" node significance information within each module, # i.e. it carries out a by module analysis. # BUT it focuses on the C-index also known as area under the ROC curve # This measure is related to Kendall's Tau statistic and Somer's D, # see F. Harrel (Regression Modeling Strategies). Springer. # It requires the following library # library(Hmisc) # Output: the first column reports the C-index and the second, p-value if (exists("WithinModuleCindex1")) rm(WithinModuleCindex1); WithinModuleCindex1=function(datnetwork,nodesignif, couleur) { CindexFun=function( x ) { len1=dim(x)[[2]]-1 outC=rep(666, len1); outP=rep(666, len1); for (i in c(1:len1) ) {rcor1=rcorr.cens(x[,i+1], x[,1]) outC[i]=rcor1[[1]] outP[i]=1- pnorm(abs(rcor1[[2]]/rcor1[[3]])) } data.frame( variable=names(x)[-1] , C.index=outC, p.value=outP) } #end of function CindexFun #print("IGNORE the warnings..."); by( data.frame(nodesignif, datnetwork),couleur,CindexFun); } #end of function WithinModuleAnalysis # The following function allows on to plot a gene (node) significance measure versus # connectivity. if(exists("plotConnectivityGeneSignif1") ) rm( plotConnectivityGeneSignif1); plotConnectivityGeneSignif1=function(degree1,genesignif1,color1="black", title1="Gene Significance vs Connectivity" , xlab1="Connectivity", ylab1="GeneSignificance") { lm1=lm(genesignif1~degree1 ,na.action="na.omit") plot(degree1, genesignif1, col=color1,ylab=ylab1,xlab=xlab1,main=paste(title1, ", cor=", signif(cor( genesignif1,degree1, method="s",use="p" ) ,2) )) abline(lm1) } var1=function(x) var(x, na.rm=T) if (exists("no.present")) rm(no.present); no.present=function(x) sum(!is.na(x)) ##################################################################################################### ################################################################################################################################ # G) Miscellaneous other functions, e.g. for computing the cluster coefficient. ##################################################################################################### ################################################################################################################################ # The function signedKME computes the module eigengene based connectivity. # Input: datExpr= a possibly very large gene expression data set where the rows # correspond to samples and the columns represent genes # datPC=data frame of module eigengenes (columns correspond to module eigengenes or PCs) # A module eigengene based connectivity KME value will be computed if the gene has # a non missing expression value in at least MinimumNoSamples arrays. # Output a data frame where columns are the KME values # corresponding to different modules. # By splitting the expression data into different batches, the function can deal with # tens of thousands of gene expression data. # If there are many eigengenes (say hundreds) consider decreasing the batch size. if(exists("signedKME") ) rm(signedKME) signedKME=function(datExpr, datPC, outputColumnName="kME") { datExpr=data.frame(datExpr) datPC=data.frame(datPC) output=list() if (dim(as.matrix(datPC))[[1]] != dim(as.matrix(datExpr))[[1]] ) stop("ERROR in signedKME function: dim(datPC)[[1]] != dim(datExpr)[[1]] ") varianceZeroIndicatordatExpr=as.vector(apply(as.matrix(datExpr),2,var1))==0 varianceZeroIndicatordatPC =as.vector(apply(as.matrix(datPC),2,var1))==0 if (sum(varianceZeroIndicatordatExpr,na.rm=T)>0 ) warning("warning in signedKME: some genes are constant. Hint: consider removing constant columns from datExpr." ) if (sum(varianceZeroIndicatordatPC,na.rm=T)>0 ) warning("warning in signedKME: some module eigengenes are constant, which is very weird. Module eigengenes should have mean zero and variance 1. Hint: consider removing constant columns from datPC." ) no.presentdatExpr=as.vector(apply(!is.na(as.matrix(datExpr)),2, sum) ) if (min(no.presentdatExpr)<4 ) warning("warning in signedKME: some gene expressions have fewer than 4 observations. Hint: consider removing genes with too many missing values or collect more arrays.") output=data.frame(cor(datExpr, datPC, use="p")) output[no.presentdatExpr<4,]=NA names(output)=paste(outputColumnName, substring(names(datPC), first=3, last=100), sep="") dimnames(output)[[1]] = names(datExpr) output } # end of function signedKME # =================================================== # The function ClusterCoef.fun computes the cluster coefficients. # Input is an adjacency matrix if(exists("ClusterCoef.fun")) rm(ClusterCoef.fun) ; ClusterCoef.fun=function(adjmat1) { diag(adjmat1)=0 no.nodes=dim(adjmat1)[[1]] computeLinksInNeighbors <- function(x, imatrix){x %*% imatrix %*% x} nolinksNeighbors <- c(rep(-666,no.nodes)) total.edge <- c(rep(-666,no.nodes)) maxh1=max(as.dist(adjmat1) ); minh1=min(as.dist(adjmat1) ); if (maxh1>1 | minh1 < 0 ) {print(paste("ERROR: the adjacency matrix contains entries that are larger than 1 or smaller than 0!!!, max=",maxh1,", min=",minh1)) } else { nolinksNeighbors <- apply(adjmat1, 1, computeLinksInNeighbors, imatrix=adjmat1) plainsum <- apply(adjmat1, 1, sum) squaresum <- apply(adjmat1^2, 1, sum) total.edge = plainsum^2 - squaresum CChelp=rep(-666, no.nodes) CChelp=ifelse(total.edge==0,0, nolinksNeighbors/total.edge) CChelp} } # end of function # =================================================== # The function err.bp is used to create error bars in a barplot # usage: err.bp(as.vector(means), as.vector(stderrs), two.side=F) err.bp<-function(daten,error,two.side=F){ if(!is.numeric(daten)) { stop("All arguments must be numeric")} if(is.vector(daten)){ xval<-(cumsum(c(0.7,rep(1.2,length(daten)-1)))) }else{ if (is.matrix(daten)){ xval<-cumsum(array(c(1,rep(0,dim(daten)[1]-1)), dim=c(1,length(daten))))+0:(length(daten)-1)+.5 }else{ stop("First argument must either be a vector or a matrix") } } MW<-0.25*(max(xval)/length(xval)) ERR1<-daten+error ERR2<-daten-error for(i in 1:length(daten)){ segments(xval[i],daten[i],xval[i],ERR1[i]) segments(xval[i]-MW,ERR1[i],xval[i]+MW,ERR1[i]) if(two.side){ segments(xval[i],daten[i],xval[i],ERR2[i]) segments(xval[i]-MW,ERR2[i],xval[i]+MW,ERR2[i]) } } } # =================================================== # this function computes the standard error if (exists("stderr1")) rm(stderr1) stderr1 <- function(x){ sqrt( var(x,na.rm=T)/sum(!is.na(x)) ) } # =================================================== # The following two functions are for displaying the pair-wise correlation in a panel when using the command "pairs()" # Typically, we use "pairs(DATA, upper.panel=panel.smooth, lower.panel=panel.cor, diag.panel=panel.hist)" to # put the correlation coefficients on the lower panel. panel.cor <- function(x, y, digits=2, prefix="", cex.cor){ usr <- par("usr"); on.exit(par(usr)) par(usr = c(0, 1, 0, 1)) r <- abs(cor(x, y)) txt <- format(c(r, 0.123456789), digits=digits)[1] txt <- paste(prefix, txt, sep="") if(missing(cex.cor)) cex <- 0.8/strwidth(txt) text(0.5, 0.5, txt, cex = cex * r) } panel.hist <- function(x, ...){ usr <- par("usr"); on.exit(par(usr)) par(usr = c(usr[1:2], 0, 1.5) ) h <- hist(x, plot = FALSE) breaks <- h$breaks; nB <- length(breaks) y <- h$counts; y <- y/max(y) rect(breaks[-nB], 0, breaks[-1], y, col="cyan", ...) } # =================================================== # This function collects garbage if (exists("collect_garbage")) rm(collect_garbage); collect_garbage=function(){while (gc()[2,4] != gc()[2,4] | gc()[1,4] != gc()[1,4]){}} collect_garbage() # this function is used for computing the Rand index below... # =================================================== if (exists("choosenew") ) rm(choosenew) choosenew <- function(n,k){ n <- c(n) out1 <- rep(0,length(n)) for (i in c(1:length(n)) ){ if (n[i]=MinLabel) & (Labels <= length(StandardColors)) )!= length(Labels)) stop(paste("Input error: something's wrong with Labels. Either they are not a numeric vector,", "or some values are below", MinLabel, "or some values are above the maximum number of colors", length(StandardColors))); Colors = rep("grey", length(Labels)); Colors[Labels!=0] = StandardColors[Labels[Labels!=0]]; Colors;} #--------------------------------------------------------------------------------------------------------- # # DisplayColors # #--------------------------------------------------------------------------------------------------------- # This function plots a barplot with all colors given. If Colors are not given, GlobalStandardColors are # used, i.e. if you want to see the GlobalStandardColors, just call this function without parameters. DisplayColors = function(Colors = NULL) { if (is.null(Colors)) Colors = GlobalStandardColors; barplot(rep(1, length(Colors)), col = Colors, border = Colors);} #----------------------------------------------------------## ColorsFromLabels##------------------------------ ############################################################################### # I) Functions for merging modules based on a high correlation of the module eigengenes ############################################################################### # The following network functions were created by Peter Langfelder. # This function is an alternative to the print command that enforces immediate output. # This function forms the average gene expression for each color in the color vector. # Input: gene expression data (rows are samples, columns are genes) # In many applications, we recommend to input normalized gene expression values, i.e. # NormExprData=scale(datExpr). colors= is a vector that encodes the module color of each gene. AverageExprMatrix = function(NormExprData, colors) { nGenes = dim(NormExprData)[2] no.samples = dim(NormExprData)[1] colorsf = as.factor(colors) AverageExpr = matrix(ncol=nlevels(colorsf), nrow = no.samples) ExprDataMtrx = as.matrix(NormExprData) for (i in (1:no.samples)) AverageExpr[i,] = tapply(ExprDataMtrx[i,], colorsf, mean) AverageExpr } #--------------------------------------------------------------------------------- # # ModuleNumber. Function from Peter Langfelder # #--------------------------------------------------------------------------------- # Similar to modulecolor2 above, but returns numbers instead of colors, which is oftentimes more useful. # 0 means unassigned. # This is particularly useful when dealing with a lot of branches in the tree. # Or when integer values instead of strings are preferred to label modules. # Return value is a simple vector, not a factor. # Caution: the module numbers are neither sorted nor sequential, the only guarranteed fact is that grey # probes are labeled by 0 and all probes belonging to the same module have the same number. ModuleNumber = function(HierTree, CutHeight = 0.9, MinSize = 50) { Branches = cutree(HierTree, h = CutHeight); NOnBranches = table(Branches); #NOnBranches[i][[1]] somehow gives the number of elements on branch i. TrueBranch = NOnBranches >= MinSize; Branches[!TrueBranch[Branches]] = 0; #NewLabels = levels(factor(Branches)); #for (lab in 1:length(NewLabels)) if (NewLabels[lab]!=0) # Branches[Branches==NewLabels[lab]] = lab; Branches; } #------------------------------------------------------------------------------------- # # ModulePrincipalComponents # #------------------------------------------------------------------------------------- # This function is an alternative to ModulePrincomps1. Probably it produces the same output. # This function computes the principal components of modules of a given network. # The first PC of each module is the module eigengene. # Input: Data: expression data, module colors. # In general, the first PC is only defined up to a sign. # This function optionally orients the first PC so that it has a positive correlation # with the average normalized gene expression profile, see the function AverageExpression # AlignPCs can take the values "", "along average". # output : a dataframe of module eigengenes (PCs) ModulePrincipalComponents = function(Data, ModuleColors, AlignPCs = "along average", verbose = 1, print.level=0) { spaces = PrintSpaces(print.level); AlignPCsRecognizedValues = c("", "along average"); if (!is.element(AlignPCs, AlignPCsRecognizedValues)) { stop(paste("ModulePrincipalComponents: Error:", "parameter AlignPCs has an unrecognised value:", AlignPCs, "; Recognized values are ", AlignPCsRecognizedValues)); } if (verbose>0) printFlush(paste(spaces, "ModulePrincipalComponents: Calculating PCs")); FullPCs = ModulePrinComps1(Data, ModuleColors); PCs = FullPCs$PrinComps; if (AlignPCs == "") AlignedPCs = PCs; if (AlignPCs == "along average") { if (verbose>0) printFlush(paste(spaces,"ModulePrincipalComponents:", "Aligning PCs with average expression for each module.")) if (verbose>1) printFlush(paste(spaces," ++ Calculating averages...")); NormData = scale(Data); AverageModuleExpr = data.frame(AverageExprMatrix(NormData, ModuleColors)); if (verbose>1) printFlush(paste(spaces," ++ Aligning principal components...")); AverageAndPCCor = diag(cor(PCs, AverageModuleExpr, use = "p")); sign.matrix = matrix(0, nrow = dim(PCs)[2], ncol = dim(PCs)[2]); diag(sign.matrix) = sign(AverageAndPCCor); AlignedPCs = as.data.frame(as.matrix(PCs) %*% sign.matrix); names(AlignedPCs) = names(PCs); rownames(AlignedPCs) = rownames(PCs); names(AverageModuleExpr) = names(PCs); rownames(AverageModuleExpr) = rownames(PCs); if (verbose>1) printFlush(paste(spaces," ++ done.")); } RetPCs = list(data = AlignedPCs, VarExplained = FullPCs$varexplained, ModuleConformity = FullPCs$ModuleConformity, AverageExpr = AverageModuleExpr); RetPCs; } #-------------------------------------------------------------------------------------- # # ModulePCs # This helper function is used for the function MergeCloseModules. #-------------------------------------------------------------------------------------- # Input: Data are gene expression data where columns correspond to genes # and rows correspond to samples that may correspond to different subsets (e.g. study 1 and study 2) # Subsets is a vector of labels "1", "2", "3", etc (i.e. positive integers). # Output: a vector of lists where each list contains the following # data=module eigengenes, # AverageExpr = average normalized module expression # VarExplained= variance explained by the eigengenes # Options: AlignPCs= chooses the sign of each module eigengene by # enforcing a positive correlation with the average normalized expression. ModulePCs = function(Data, Subsets, ModuleColors, OnlySet = NULL, AlignPCs="along average", verbose=1, print.level=0) { spaces = PrintSpaces(print.level) SubsetFactor = factor(Subsets); No.Subsets = nlevels(SubsetFactor); if (verbose>0) printFlush(paste(spaces,"ModulePCs: Looking for module PCs.")); PCs = vector(mode="list", length=No.Subsets); if (is.null(OnlySet)) { CalculatedSubsets = c(1:No.Subsets); } else { CalculatedSubsets = c(OnlySet); } for (subs.ind in CalculatedSubsets) { subset = levels(SubsetFactor)[subs.ind]; if (verbose>0) printFlush(paste(spaces," Working on subset", as.character(subset), "...")); SubsetNetworkData = Data[SubsetFactor == subset, ]; SubsetColors = ModuleColors; SubsetPCs = ModulePrincipalComponents(Data = SubsetNetworkData, ModuleColors = SubsetColors, AlignPCs = AlignPCs, verbose = verbose-1, print.level = print.level+1); PCs[[subs.ind]] = list(data = SubsetPCs$data, AverageExpr = SubsetPCs$AverageExpr, ModuleConformity = SubsetPCs$ModuleConformity, VarExplained = SubsetPCs$VarExplained); rm(SubsetNetworkData); rm(SubsetColors); rm(SubsetPCs); collect_garbage(); } PCs; } #--------------------------------------------------------------------------------------------- # # DynamicMergeCut # #--------------------------------------------------------------------------------------------- if(exists("DynamicMergeCut") ) rm(DynamicMergeCut); DynamicMergeCut=function(n, truemergecor=.9, Zquantile=2.35) { if (truemergecor>1 | truemergecor<0 ) stop("Input error in the function DynamicMergeCut. Please specify 0<=truemergecor<=1") if (truemergecor==1) { print("Warning in function DynamicMergeCut: truemergecor=1. I will set it to .999."); truemermergecor=.999} if (n<4 ) {print("Warning in function DynamicMergeCut: too few observations for the dynamic assignment of the merge threshold. I will set the threshold to .35"); mergethreshold=.35} if (n>3 ) { # Fisher transform of the true merge correlation Fishertruemergecor=.5*log((1+truemergecor)/(1-truemergecor)) E=exp(2*( Fishertruemergecor -Zquantile/sqrt(n-3))) LowerBoundCIcor=(E-1)/(E+1) mergethreshold=1- LowerBoundCIcor} if (mergethreshold>1) 1 mergethreshold }# end of function DynamicMergeCut #============================================================================================= # # Correlation p-value for multiple correlation values # #============================================================================================= corPVal = function(cor, nSamples) { z = abs(0.5 * log((1+cor)/(1-cor)) * sqrt(nSamples-3)); 1-pnorm(z); } #====================================================================================================== # # print.flush # # ===================================================================================================== print.flush = function(...) { printFlush(...); } #--------------------------------------------------------------------------------------------- # # MergeCloseModules # #--------------------------------------------------------------------------------------------- # This function merges modules whose PCs fall on one branch of a hierarchical clustering tree # Method: # First, the module eigenes are computed corresponding to each color (but see option PCs) # Second, we use average linkage hierarchical clustering of the module eigengenes to arrive at a dendrogram # Third, branches are cut-off the dendrogram using a given height (CutHeight) # Fourth, modules whose PCs fall on one branch are merged. # If the option Relabel=True, then the colors of the resulting merged modules # are assigned according to size, e.g. turquoise encodes the largest module # If Relabel=F, then the color of the merged module is chosen according to that of the first module. # Options: the parameter PCs can be used to input a vector of lists # where each list must have a component named data that contains the data frame of module eigengenes. # UseAbs specifies whether the absolute value of the correlation should be used for constructing the clustering tree of the PCs. # if CalculateNewPCs = TRUE is specified then the PCs are newly calculated based on the new merged colors. # The options Subsets=NULL, OnlySet = NULL are useful when the data are comprised of multiple subsets (e.g. study 1 and 2). # StandardColors allows one to specify your own order of colors to be assigned according to module size. # Thus, specifying StandardColors only makes sense if Relabel=True. if(exists("MergeCloseModules")) rm(MergeCloseModules); MergeCloseModules = function(datE, OriginalColors, CutHeight=0.20, PCs = NULL, UseAbs = F, Relabel = FALSE, StandardColors = standardColors(), CalculateNewPCs = TRUE, Subsets = NULL, OnlySet = NULL, verbose = 1, print.level=0) { PCsInSingleFrame = FALSE; spaces = PrintSpaces(print.level); if (verbose>0) printFlush(paste(spaces, "MergeCloseModules: Merging modules whose distance is less than", CutHeight)); if (is.null(Subsets)) { Subsets = rep(1, dim(datE)[1]); if (!is.null(PCs)) { if (is.null(PCs[[1]]$data)) { if (verbose>1) printFlush(paste(spaces, " PCs appear to be a single data frame - will work on", "that assumption.")); if (length(dim(PCs))!=2) stop(paste("PCs must be given either as a vector of lists, one list for each subset, with", "element 'data' containing the frame or matrix of eigengenes", " OR for a single data set PCs can be given as a dataframe or matrix of correct", "dimensions.")); PCsX = vector(mode="list", length = 1); PCsX[[1]] = list(data = PCs); rm(PCs); PCs = PCsX; rm(PCsX); PCsInSingleFrame = TRUE; } if (dim(PCs[[1]]$data)[1]!=dim(datE)[1]) stop(paste("Number of samples in PCs is incompatible with number of samples in datE.")); } } else { SubsetsX = Subsets; Subsets = as.numeric(factor(Subsets)); } if (!is.null(PCs)) { SubsetsX = Subsets; No.Sets = max(Subsets); if (length(PCs)!=No.Sets) stop(paste("Number of sets given by Subsets is incompatible with the number of eigengene sets", "given in PCs.")); SamplesInSet = table(Subsets); for (i in 1:No.Sets) { if (dim(PCs[[i]]$data)[1]!=SamplesInSet[i]) stop(paste("Number of samples in PCs is incompatible with subset length for subset", levels(factor(SubsetsX))[i], " (first occurence).")); } } if (dim(datE)[1]!=length(Subsets)) stop("Number of genes in datE is different from the length of given subset vector. They must equal."); if (dim(datE)[2]!=length(OriginalColors)) stop("Number of genes in datE is different from the length of original colors. They must equal."); if ((CutHeight <0) | (CutHeight>(1+as.integer(UseAbs)))) stop(paste("Given CutHeight is out of sensible range between 0 and", 1+as.integer(UseAbs) )); # If ordered PCs were not given, calculate them if (is.null(PCs)) { PCs = ModulePCs(datE, Subsets, ModuleColors = OriginalColors, OnlySet = OnlySet, verbose = verbose-1, print.level = print.level+1); collect_garbage(); } else if (nlevels(as.factor(OriginalColors))!=dim(PCs[[1]]$data)[2]) { if (verbose>0) printFlush(paste(spaces, " Number of given module colors", "does not match number of given MEs => recalculating the MEs.")) PCs = ModulePCs(datE, Subsets, ModuleColors = OriginalColors, OnlySet = OnlySet, verbose = verbose-1, print.level = print.level+1); collect_garbage(); } # Cluster the found module eigengenes and merge ones that are too close to one another _in both sets_. No.Sets = nlevels(as.factor(Subsets)); MEDiss = vector(mode="list", length = No.Sets); if (is.null(OnlySet)) { CalculatedSubsets = c(1:No.Sets); } else { CalculatedSubsets = c(OnlySet); } for (set in CalculatedSubsets) { IndexRange = c(1:(nlevels(as.factor(OriginalColors))-1)); if (UseAbs) { diss = 1-abs(cor(PCs[[set]]$data[, IndexRange], use = "p")); } else { diss = 1-cor(PCs[[set]]$data[, IndexRange], use = "p"); } MEDiss[[set]] = list(Diss = diss); } if (is.null(OnlySet)) { ConsDiss = (MEDiss[[1]]$Diss) if (No.Sets>1) for (set in 2:No.Sets) ConsDiss = pmax(ConsDiss, MEDiss[[set]]$Diss); } else { ConsDiss = MEDiss[[OnlySet]]$Diss; } METree = hclust(as.dist(ConsDiss), method = "average"); METreeBranches = as.factor(ModuleNumber(HierTree = METree, CutHeight = CutHeight, MinSize = 1)); # Analyze the branches: look for the ones that contain more than one original module MEUniqueBranches = levels(METreeBranches); MENo.Branches = nlevels(METreeBranches) MENumberOnBranch = rep(0, times = MENo.Branches); for (branch in 1:MENo.Branches) { MENumberOnBranch[branch] = sum(METreeBranches==MEUniqueBranches[branch]); } MergedColors = OriginalColors; # Merge modules on the same branch for (branch in 1:MENo.Branches) if (MENumberOnBranch[branch]>1) { ModulesOnThisBranch = names(METreeBranches)[METreeBranches==MEUniqueBranches[branch]]; ColorsOnThisBranch = substring(ModulesOnThisBranch, 3); if (verbose>3) printFlush(paste(spaces, " Merging original colors", paste(ColorsOnThisBranch, collapse=", "))); for (color in 2:length(ColorsOnThisBranch)) MergedColors[MergedColors==ColorsOnThisBranch[color]] = ColorsOnThisBranch[1]; } No.Mods = nlevels(as.factor(MergedColors)); RawModuleColors = levels(as.factor(MergedColors)); # print(paste("No. of new modules: ", No.Mods)); # print(paste("Merged module colors:")); # print(table(as.factor(MergedColors))); if (Relabel) { # Relabel the merged colors to the usual order based on the number of genes in each module if (is.null(StandardColors)) StandardColors = standardColors(); No.GenesInModule = rep(0, No.Mods); for (mod in 1:No.Mods) No.GenesInModule[mod] = sum(MergedColors==RawModuleColors[mod]); # print(paste("No.GenesInModule: ", paste(No.GenesInModule, collapse=", "))); SortedRawModuleColors = RawModuleColors[order(-No.GenesInModule)] # print(paste("SortedRawModuleColors:", paste(SortedRawModuleColors, collapse=", "))); # Change the color names to the standard sequence, but leave grey grey (that's why rank in general does # not equal color) MergedNewColors = MergedColors; if (verbose>3) print(paste(spaces, " Changing original colors:")); rank = 0; for (color in 1:length(SortedRawModuleColors)) if (SortedRawModuleColors[color]!="grey") { rank = rank + 1; if (verbose>3) print(paste(spaces, " ", SortedRawModuleColors[color], "to ", StandardColors[rank])); MergedNewColors[MergedColors==SortedRawModuleColors[color]] = StandardColors[rank]; } # print("Table of new MergedColors:"); # print(table(as.factor(MergedNewColors))); } else { MergedNewColors = MergedColors; } if (CalculateNewPCs) { if (verbose>0) printFlush(paste(spaces, " Calculating new PCs...")); NewPCs = ModulePCs(datE, Subsets, ModuleColors = MergedNewColors, OnlySet = OnlySet, verbose = verbose-1, print.level = print.level+1); } else { NewPCs = NULL; } if (PCsInSingleFrame) { NewPCs = NewPCs[[1]]$data; PCs = PCs[[1]]$data; } list(Colors = MergedNewColors, ClustTree = METree, CutHeight = CutHeight, OldPCs = PCs, NewPCs = NewPCs); } #========================================================================================================== # # blockwiseModules # #========================================================================================================== # Function to calculate modules and eigengenes from all genes. blockwiseModules = function(expr, geneOrder = NULL, geneRank = NULL, corFnc = "cor", corOptions = "use = 'p'", power = 6, blockSize = min(2000, ncol(expr)), stopNGenes = 0.7*blockSize, KMEcut=0.8, TOMLevel = 1, networkType = "unsigned", deepSplit = 2, detectCutHeight = 0.995, minModuleSize = min(20, ncol(expr)/2 ), maxCoreScatter = NULL, minGap = NULL, maxAbsCoreScatter = NULL, minAbsGap = NULL, labelUnlabeled = TRUE, minModuleKME = KMEcut/2, minModuleKMESize = minModuleSize/3, mergeCutHeight = 0.20, impute = TRUE, getTOMs = FALSE, trapErrors = FALSE, numericLabels = FALSE, verbose = 0, indent = 0) { spaces = indentSpaces(indent); if (verbose>0) printFlush(paste(spaces, "Calculating module eigengenes batch-wise from all genes")); #debug = data.frame(debug = 0); #debug$debug = debug$debug+1; #write.table(debug, file = "debug.txt"); if (!is.null(geneOrder)) { warning("Argument 'geneOrder' is deprecated, please use 'geneRank' instead."); if (is.null(geneRank)) { geneRank = rep(0, times = length(geneOrder)); geneRank[geneOrder] = c(1:length(geneOrder)); } else warning(paste("blockwiseModules: 'geneRank' and 'geneOrder' given at the same", "time => disregarding 'geneOrder'.")); } if (length(corOptions)>1) stop("Input error: corOptions must have length 1."); if ( (TOMLevel != 0) & (TOMLevel != 1)) stop("TOMLevel must be 0 or 1."); if ( (power<1) | (power>30) ) stop("power must be between 1 and 30."); if ( (KMEcut>1) | (KMEcut <0) ) stop("KMEcut must be between 0 and 1."); if (is.na(match(deepSplit, c(0:3)))) stop("deepSplit must be an integer between 0 and 3."); intNetworkType = charmatch(networkType, c("unsigned", "signed", "signed hybrid")); if (is.na(intNetworkType)) stop("Unrecognized networkType argument.", "Recognized values are (unique abbreviations of) 'unsigned', 'signed', and 'signed hybrid'."); dimEx = dim(expr); if (length(dimEx)!=2) stop("expr has incorrect dimensions.") nGenes = dimEx[2]; nSamples = dimEx[1]; IndexToOrder = c(1:nGenes); allLabels = rep(0, nGenes); AllPCs = NULL; if (!is.null(geneRank) && (length(geneRank)!=nGenes)) stop("Input error: the length of 'geneRank' does not equal the number of genes in given 'expr'."); # Check data for genes and samples that have too many missing values var = apply(expr, 2, var, na.rm = TRUE); nNAsGenes = apply(is.na(expr), 2, sum); goodGenes = (nNAsGenes < 2*nSamples/3 & var>0 & (nSamples-nNAsGenes >= ..minNSamples)); if (sum(goodGenes) < ..minNGenes) stop("Too few genes with valid expression levels in the required number of samples."); if (verbose>0 & (nGenes - sum(goodGenes) > 0)) printFlush(paste(" ..Excluding", nGenes - sum(goodGenes), "genes from the calculation due to too many missing samples or zero variance.")); nNAsSamples = apply(is.na(expr), 1, sum); goodSamples = (nNAsSamples < 2*nGenes/3 & (nGenes - nNAsSamples >= ..minNGenes)); if (sum(goodSamples) < ..minNSamples) stop("Too few samples with valid expression levels for the required number of genes."); if (verbose>0 & (nSamples - sum(goodSamples)>0)) printFlush(paste(" ..Excluding", nSamples - sum(goodSamples), "samples from the calculation due to too many missing genes.")); expr = expr[goodSamples, goodGenes]; if (!is.null(geneRank)) geneRank = rank(geneRank[goodGenes], ties.method = "first"); nGGenes = sum(goodGenes); nGSamples = sum(goodSamples); IndexToOrder = IndexToOrder[goodGenes]; # Save the "good" expression data and IndexToOrder; will need it at the end. goodExpr = expr; goodIndexTO = IndexToOrder; if (is.null(geneRank)) { if (verbose>1) printFlush(paste(spaces, "....calculating connectivity..")); conn = restrConnectivity(expr, nBest = minModuleSize, softPower = power, verbose = verbose-1, indent = indent + 2); geneRank = rank(-conn, ties.method = "first"); } # Initialize various variables dendros = list(); TOMs = list(); batchGenes = list(); nRemainGenes = nGGenes; batch = 1; maxUsedLabel = 0; while (nRemainGenes > max(minModuleSize, stopNGenes-1)) { if (verbose>1) printFlush(paste(spaces, "..Working on batch", batch, ":", nRemainGenes, "genes remaining.")); # Select most connected genes block = c(1:nRemainGenes)[geneRank <= min(blockSize, nRemainGenes)]; batchGenes[[batch]] = IndexToOrder[block]; selExpr = expr[, block]; # Calculate adjacency if (verbose>2) printFlush(paste(spaces, "....calculating adjacency..")); corEval = parse(text = paste(corFnc, "(x = selExpr, ", corOptions, ")")) c = eval(corEval); if (intNetworkType==1) { c = abs(c); } else if (intNetworkType==2) { c = (1+c)/2; } else if (intNetworkType==3) { c[c < 0] = 0; } else stop("Internal error: intNetworkType has wrong value:", intNetworkType, ". Sorry!"); adj = c^power; # Calculate TOM, if requested if (TOMLevel==1) { if (verbose>2) printFlush(paste(spaces, "....calculating TOM..")); dissTOM = TOMdist1(adj); } else { dissTOM = 1-adj; } if (verbose>2) printFlush(paste(spaces, "....clustering..")); dendros[[batch]] = hclust(as.dist(dissTOM), method = "average"); if (verbose>2) printFlush(paste(spaces, "....detecting modules..")); labels = try(cutreeDynamic(dendro = dendros[[batch]], deepSplit = deepSplit, cutHeight = detectCutHeight, minClusterSize = minModuleSize, method ="hybrid", maxCoreScatter = maxCoreScatter, minGap = minGap, maxAbsCoreScatter = maxAbsCoreScatter, minAbsGap = minAbsGap, labelUnlabeled = labelUnlabeled, distM = dissTOM, verbose = verbose-3, indent = indent + 2), silent = TRUE); collectGarbage(); if (class(labels)=='try-error') { if (verbose>0) { printFlush(paste(spaces, "*** cutreeDynamic returned the following error:\n", spaces, labels, spaces, "Stopping the module detection here.")); } else warning(paste("blockwiseModules: cutreeDynamic returned the following error:\n", " ", labels, "---> Module detection stopped before", "all genes could be considered.")); break; } if (sum(labels>0)==0) { if (verbose>1) { printFlush(paste(spaces, "No modules detected in batch", batch, "with", nRemainGenes, "genes remaining.")); printFlush(paste(spaces, " Stopping the module detection here.")) } break; } labels[labels>0] = labels[labels>0] + maxUsedLabel; maxUsedLabel = max(labels); if (verbose>2) printFlush(paste(spaces, "....calculating module eigengenes..")); PCs = try(moduleEigengenes(selExpr[, labels!=0], labels[labels!=0], impute = impute, # subHubs = TRUE, trapErrors = FALSE, verbose = verbose - 3, indent = indent + 2), silent = TRUE); if (class(PCs)=='try-error') { if (trapErrors) { if (verbose>0) { printFlush(paste(spaces, "*** moduleEigengenes failed with the following message:")); printFlush(paste(spaces, " ", PCs)); printFlush(paste(spaces, " ---> Stopping module detection here.")); } else warning(paste("blockwiseModules: moduleEigengenes failed with the following message:", "\n ", PCs, "---> Module detection stopped before", "all genes could be considered.")); break; } else error(PCs); } else { if (!PCs$allOK) labels[labels!=0] = PCs$validColors; if (sum(labels!=0)==0) { if (verbose>1) { printFlush(paste(spaces, "All modules were removed in batch", batch, "with", nRemainGenes, "genes remaining.")); printFlush(paste(spaces, " Stopping the module detection here.")) } break; } } #propPCs = as.data.frame(PCs$eigengenes[, names(PCs$eigengenes)!="ME0"]); propPCs = PCs$eigengenes; currentLabelIndex = as.numeric(levels(as.factor(labels[labels!=0]))); # Check modules: make sure that of the genes present in the module, at least a minimum number # have a correlation with the eigengene higher than a given cutoff. if (verbose>2) printFlush(paste(spaces, "....checking modules for statistical meaningfulness..")); deleteModules = NULL; for (mod in 1:ncol(propPCs)) { modGenes = (labels==currentLabelIndex[mod]); corEval = parse(text = paste(corFnc, "(selExpr[, modGenes], propPCs[, mod],", corOptions, ")")); KME = as.vector(eval(corEval)); if (networkType=="unsigned") KME = abs(KME); if (sum(KME>minModuleKME) < minModuleKMESize) { labels[modGenes] = 0; deleteModules = c(deleteModules, mod); if (verbose>3) printFlush(paste(spaces, " ..deleting module ", mod, ": of ", sum(modGenes), " total genes in the module only ", sum(KME>minModuleKME), " have the requisite high correlation with the eigengene.", sep="")); } } if (sum(labels>0)==0) { if (verbose>1) { printFlush(paste(spaces, "No significant modules detected in batch", batch, "with", nRemainGenes, "genes remaining.")) printFlush(paste(spaces, "Stopping the module detection here.")); } break; } if (!is.null(deleteModules)) { propPCs = propPCs[, -deleteModules]; currentLabelIndex = currentLabelIndex[-deleteModules]; } if (is.null(dim(propPCs))) dim(propPCs) = c(length(propPCs), 1); if (is.null(AllPCs)) { AllPCs = propPCs; } else { AllPCs = cbind(AllPCs, propPCs); } # record the labels of assigned genes and remove the assigned genes from the pool assigned = block[labels!=0]; allLabels[IndexToOrder[assigned]] = labels[labels!=0]; if (length(assigned) == ncol(expr)) { nRemainGenes = 0; batch = batch + 1; break; } expr = expr[, -assigned]; IndexToOrder = IndexToOrder[-assigned]; geneRank = geneRank[-assigned]; # find genes whose closest module eigengene has cor higher than KMEcut, record labels and # remove them corEval = parse(text = paste(corFnc, "(expr, propPCs,", corOptions, ")")); KME = eval(corEval); if (networkType=="unsigned") KME = abs(KME); KMEmax = apply(KME, 1, max); ClosestModule = currentLabelIndex[apply(KME, 1, which.max)]; assign = (KMEmax >= KMEcut); if (sum(assign>0)) { allLabels[IndexToOrder[assign]] = ClosestModule[assign]; expr = expr[, !assign] IndexToOrder = IndexToOrder[!assign]; geneRank = geneRank[!assign]; } if (getTOMs) TOMs[[batch]] = 1-dissTOM; collectGarbage(); nRemainGenes = ncol(expr); if (!is.null(nRemainGenes)) { batch = batch + 1; geneRank = rank(geneRank, ties.method = "first") } else nRemainGenes = 0; } if (verbose>1) printFlush(paste(spaces, "..merging modules that are too close..")); if (numericLabels) { colors = allLabels } else { colors = labels2colors(allLabels) } mergedAllColors = colors; MEsOK = TRUE; mergedMods = try(mergeCloseModules(goodExpr, colors[goodGenes], cutHeight = mergeCutHeight, relabel = TRUE, # trapErrors = FALSE, impute = impute, verbose = verbose-2, indent = indent + 2), silent = TRUE); if (class(mergedMods)=='try-error') { warning(paste("blockwiseModules: mergeCloseModules failed with the following error message:\n ", mergedMods, "\n--> returning unmerged colors.\n")); MEs = try(moduleEigengenes(goodExpr, colors[goodGenes], # subHubs = TRUE, trapErrors = FALSE, impute = impute, verbose = verbose-3, indent = indent+3), silent = TRUE); if (class(MEs) == 'try-error') { if (!trapErrors) stop(MEs); if (verbose>0) { printFlush(paste(spaces, "*** moduleEigengenes failed with the following error message:")); printFlush(paste(spaces, " ", MEs)); printFlush(paste(spaces, "*** returning no module eigengenes.\n")); } else warning(paste("blockwiseModules: moduleEigengenes failed with the following error message:\n ", MEs, "\n--> returning no module eigengenes.\n")); allSamplePCs = NULL; MEsOK = FALSE; } else { if (sum(!MEs$validMEs)>0) { colors[goodGenes] = MEs$validColors; MEs = MEs$eigengenes[, MEs$validMEs]; } else MEs = MEs$eigengenes; allSamplePCs = as.data.frame(matrix(NA, nrow = nSamples, ncol = ncol(MEs))); allSamplePCs[goodSamples, ] = MEs[,]; names(allSamplePCs) = names(MEs); } } else { mergedAllColors[goodGenes] = mergedMods$colors; allSamplePCs = as.data.frame(matrix(NA, nrow = nSamples, ncol = ncol(mergedMods$newMEs))); allSamplePCs[goodSamples, ] = mergedMods$newMEs[,]; names(allSamplePCs) = names(mergedMods$newMEs); } if (!getTOMs) TOMs = NULL; list(colors = mergedAllColors, unmergedColors = colors, MEs = allSamplePCs, goodSamples = goodSamples, goodGenes = goodGenes, dendrograms = dendros, TOMs = TOMs, blockGenes = batchGenes, MEsOK = MEsOK); } #========================================================================================================== # # blockwiseConsensusModules # #========================================================================================================== # Function to calculate consensus modules and eigengenes from all genes. blockwiseConsensusModules = function(expr, geneRank = NULL, corFnc = "cor", corOptions = "use = 'p'", power = 6, blockSize = 2000, stopNGenes = 0.7*blockSize, checkStopNGenes = TRUE, KMEcut=0.8, TOMLevel = 1, networkType = "unsigned", scaleTOMs = TRUE, scaleQuantile = 0.95, deepSplit = 2, detectCutHeight = 0.995, minModuleSize = 20, checkMinModuleSize = TRUE, maxCoreScatter = NULL, minGap = NULL, maxAbsCoreScatter = NULL, minAbsGap = NULL, labelUnlabeled = TRUE, minModuleKME = KMEcut/2, minModuleKMESize = minModuleSize/3, mergeCutHeight = 0.20, impute = TRUE, getTOMs = FALSE, getOrigin = FALSE, getSetTOMs = FALSE, trapErrors = FALSE, checkPower = TRUE, numericLabels = FALSE, verbose = 0, indent = 0) { spaces = indentSpaces(indent); dataSize = checkSets(expr); nSets = dataSize$nSets; nGenes = dataSize$nGenes; if (length(power)!=1) { if (length(power)!=nSets) stop("Invalid arguments: Length of 'power' must equal number of sets given in 'expr'."); } else { power = rep(power, nSets); } if (!is.null(geneRank) && (length(geneRank)!=nGenes)) stop("Input error: length of 'geneRank' must equal number of genes in 'expr'."); if (length(corOptions)>1) stop("Input error: corOptions must have length 1."); if (blockSize>nGenes) blockSize = nGenes; if (checkStopNGenes & (stopNGenes>0.7*blockSize)) { stopNGenes = 0.7*blockSize; warning("blockwiseConsensusModules: stopNGenes appeared too large and was lowered to", stopNGenes, ". If you insist on keeping the original stopNGenes, set checkStopNGenes = FALSE."); } if (checkMinModuleSize & (minModuleSize > nGenes/2)) { minModuleSize = nGenes/2; warning("blockwiseConsensusModules: minModuleSize appeared too large and was lowered to", minModuleSize, ". If you insist on keeping the original minModuleSize, set checkMinModuleSize = FALSE."); } if (verbose>0) printFlush(paste(spaces, "Calculating consensus modules and module eigengenes", "block-wise from all genes")); if ( (TOMLevel != 0) & (TOMLevel != 1)) stop("TOMLevel must be 0 or 1."); if ( checkPower & ((sum(power<1)>0) | (sum(power>50)>0) ) ) stop("power must be between 1 and 50."); if ( (KMEcut>1) | (KMEcut <0) ) stop("KMEcut must be between 0 and 1."); if (is.na(match(deepSplit, c(0:3)))) stop("deepSplit must be an integer between 0 and 3."); intNetworkType = charmatch(networkType, c("unsigned", "signed", "signed hybrid")); if (is.na(intNetworkType)) stop("Unrecognized networkType argument.", "Recognized values are (unique abbreviations of) 'unsigned', 'signed', and 'signed hybrid'."); nGenes = dataSize$nGenes nSamples = dataSize$nSamples; IndexToOrder = c(1:nGenes); if (getSetTOMs) nTOMSets = nSets + 1 else nTOMSets = 1; allLabels = rep(0, nGenes); # Check data for genes and samples that have too many missing values goodGenes = rep(TRUE, nGenes); for (set in 1:nSets) { var = sd(expr[[set]]$data, na.rm = TRUE); nNAsGenes = apply(is.na(expr[[set]]$data), 2, sum); goodGenes[nNAsGenes > 2*nSamples[set]/3 | var==0 | (nSamples[set]-nNAsGenes <= ..minNSamples)] = FALSE; } if (sum(goodGenes) < ..minNGenes) stop("Too few genes with valid expression levels in the required number of samples in all sets."); if (verbose>0 & (nGenes - sum(goodGenes) > 0)) printFlush(paste(" ..Excluding", nGenes - sum(goodGenes), "genes from the calculation due to too many missing samples or zero variance.")); goodSamples = list(); for (set in 1:nSets) { nNAsSamples = apply(is.na(expr[[set]]$data), 1, sum); goodSamples[[set]] = (nNAsSamples < 2*nGenes/3 & (nGenes - nNAsSamples >= ..minNGenes)); if (sum(goodSamples[[set]]) < ..minNGenes) stop("Too few samples with valid expression levels for the required number of genes in set", set); if (verbose>0 & (nSamples[set] - sum(goodSamples[[set]])>0)) printFlush(paste(" ..Set", set,": Excluding", nSamples - sum(goodSamples[[set]]), "samples from the calculation due to too many missing genes.")); } for (set in 1:nSets) expr[[set]]$data = expr[[set]]$data[goodSamples[[set]], goodGenes]; if (!is.null(geneRank)) geneRank = rank(geneRank[goodGenes], ties.method = "first"); nGGenes = sum(goodGenes); nGSamples = rep(0, nSets); for (set in 1:nSets) nGSamples[set] = sum(goodSamples[[set]]); # Save the "good" expression data and IndexToOrder; will need it at the end. goodExpr = expr; IndexToOrder = IndexToOrder[goodGenes]; goodIndexTO = IndexToOrder; conn = matrix(0, nrow = nGenes, ncol = nSets); if (is.null(geneRank)) { if (verbose>1) printFlush(paste(spaces, '....calculating "consensus" connectivity to order genes..')); conn = restrConnectivityMS(expr, nBest = minModuleSize, softPower = power, blockSize = min(blockSize, 500), verbose = verbose-1, indent = indent + 2); for (set in 1:nSets) { conn[, set] = rank(conn[, set], ties.method = "first"); } consConn = apply(as.matrix(conn), 1, min); geneRank = rank(-consConn, ties.method = "first"); } # Initialize various variables TOMs = vector(mode = "list", length = nTOMSets); consMEs = vector(mode = "list", length = nSets); dendros = list(); scalePowers = rep(0, times = nSets); scaleQuants = rep(0, times = nSets); originCount = rep(0, nSets); batchGenes = list(); nRemainGenes = nGGenes; batch = 1; maxUsedLabel = 0; while (nRemainGenes > max(minModuleSize, stopNGenes)) { if (verbose>1) printFlush(paste(spaces, "..Working on batch", batch, ":", nRemainGenes, "genes remaining.")); # Select most connected genes block = c(1:nRemainGenes)[geneRank <= min(blockSize, nRemainGenes)]; nBlockGenes = length(block); batchGenes[[batch]] = IndexToOrder[block]; blockTOM = array(0, dim = c(nBlockGenes, nBlockGenes, nSets+1)); scaleQuant = rep(0, nSets); scalePowers = rep(1, nSets); selExpr = vector(mode = "list", length = nSets); errorOccurred = FALSE; for (set in 1:nSets) { if (verbose>2) printFlush(paste(spaces, "....Working on set", set)) selExpr[[set]] = list(data = expr[[set]]$data[, block]); # Calculate adjacency if (verbose>2) printFlush(paste(spaces, "......calculating adjacency..")); corEval = parse(text = paste(corFnc, "(x = selExpr[[set]]$data, ", corOptions, ")")) c = eval(corEval); if (intNetworkType==1) { c = abs(c); } else if (intNetworkType==2) { c = (1+c)/2; } else if (intNetworkType==3) { c[c < 0] = 0; } else stop("Internal error: intNetworkType has wrong value:", intNetworkType, ". Sorry!"); tom = c^power[set]; # Calculate TOM, if requested if (TOMLevel==1) { if (verbose>2) printFlush(paste(spaces, "......calculating TOM..")); tom = TOMsim(tom); } scaleQuant[set] = quantile(x = as.vector(tom), probs = scaleQuantile, type = 8); if (set>1) { scalePowers[set] = log(scaleQuant[1])/log(scaleQuant[set]); tom = tom^scalePowers[set]; } blockTOM[ , , set+1] = tom; } if (verbose>2) printFlush(paste(spaces, "....Calculating consensus network")) blockTOM[ , , 1] = apply(blockTOM[, , c(2:(nSets+1))], c(1:2), min); if (getOrigin) { origin = apply(blockTOM[, , c(2:(nSets+1))], c(1:2), which.min); originCount = originCount + as.vector(table(as.vector(origin))); rm(origin); collectGarbage(); } # Some debugging code: # blockTOMind = apply(blockTOM[, , c(2:(nSets+1))], c(1:2), which.min); # table(blockTOMind[upper.tri(blockTOMind)]) if (verbose>2) printFlush(paste(spaces, "....clustering and detecting modules..")); errorOccured = FALSE; dendros[[batch]] = hclust(as.dist(1-blockTOM[, , 1]), method = "average"); labs = try(cutreeDynamic(dendro = dendros[[batch]], deepSplit = deepSplit, cutHeight = detectCutHeight, minClusterSize = minModuleSize, method ="hybrid", maxCoreScatter = maxCoreScatter, minGap = minGap, maxAbsCoreScatter = maxAbsCoreScatter, minAbsGap = minAbsGap, labelUnlabeled = labelUnlabeled, distM = 1-blockTOM[, , 1], verbose = verbose-3, indent = indent + 2), silent = TRUE); if (class(labs)=='try-error') { warning(paste("blockwiseConsensusModules: cutreeDynamic failed:\n ", labs, "Error occured in batch", batch, "and set", set-1, "\nStopping module detection here.")); errorOccured = TRUE; } else { blockLabs = labs blockLabs[blockLabs>0] = blockLabs[blockLabs>0] + maxUsedLabel; maxUsedLabel = max(blockLabs); } if (errorOccured || sum(blockLabs>0)==0) { if (verbose>1) { printFlush(paste(spaces, "No modules detected in batch", batch, "with", nRemainGenes, "genes remaining.")); printFlush(paste(spaces, " Stopping the module detection here.")) } break; } # Check consensus modules for statistical meaningfulness if (verbose>2) printFlush(paste(spaces, "....checking modules for statistical meaningfulness..")); blockAssigned = c(1:nBlockGenes)[blockLabs!=0]; currentLabelIndex = as.numeric(levels(as.factor(blockLabs[blockAssigned]))); blockConsMEs = try(multiSetMEs(selExpr, universalColors = blockLabs, excludeGrey = TRUE, grey = 0, impute = impute, # trapErrors = TRUE, returnValidOnly = TRUE, verbose = verbose-4, indent = indent + 3), silent = TRUE); if (class(blockConsMEs)=='try-error') { if (verbose>0) { printFlush(paste(spaces, "*** multiSetMEs failed with the message:")); printFlush(paste(spaces, " ", blockConsMEs)); printFlush(paste(spaces, "*** --> Ending module detection here")); } else warning(paste("blocwiseConsensusModules: multiSetMEs failed with the message: \n", " ", blockConsMEs, "\n--> Ending module detection here")); break; } if (!blockConsMEs[[1]]$allOK) { blockLabs = blockConsMEs[[1]]$validColors; blockAssigned = c(1:nBlockGenes)[blockLabs!=0]; currentLabelIndex = as.numeric(levels(as.factor(blockLabs[blockAssigned]))); } if (sum(blockLabs>0)==0) { if (verbose>1) { printFlush(paste(spaces, "No modules detected in batch", batch, "with", nRemainGenes, "genes remaining.")); printFlush(paste(spaces, " Stopping the module detection here.")) } break; } # Check modules: make sure that of the genes present in the module, at least a minimum number # have a correlation with the eigengene higher than a given cutoff. if (verbose>2) printFlush(paste(spaces, "....checking consensus modules for statistical meaningfulness..")); deleteModules = NULL; for (mod in 1:ncol(blockConsMEs[[1]]$data)) { modGenes = (blockLabs==currentLabelIndex[mod]); KME = matrix(0, nrow = sum(modGenes), ncol = nSets); corEval = parse(text = paste(corFnc, "( selExpr[[set]]$data[, modGenes], blockConsMEs[[set]]$data[, mod],", corOptions, ")")) for (set in 1:nSets) KME[, set] = as.vector(eval(corEval)); if (networkType=="unsigned") KME = abs(KME); consKME = apply(KME, 1, min, na.rm = TRUE); if (sum(consKME>minModuleKME) < minModuleKMESize) { blockLabs[modGenes] = 0; deleteModules = c(deleteModules, mod); if (verbose>3) printFlush(paste(spaces, " ..deleting module ",currentLabelIndex[mod], ": of ", sum(modGenes), " total genes in the module only ", sum(consKME>minModuleKME), " have the requisite high correlation with the eigengene in all sets.", sep="")); } } blockAssigned = blockLabs!=0; if (sum(blockAssigned>0)==0) { if (verbose>1) { printFlush(paste(spaces, "** No significant modules detected in batch", batch, "with", nRemainGenes, "genes remaining.")) printFlush(paste(spaces, "** Stopping the module detection here.")); } break; } if (!is.null(deleteModules)) { for (set in 1:nSets) blockConsMEs[[set]]$data = blockConsMEs[[set]]$data[, -deleteModules]; currentLabelIndex = currentLabelIndex[-deleteModules]; } for (set in 1:nSets) if (is.null(dim(blockConsMEs[[set]]$data))) dim(blockConsMEs[[set]]$data) = c(length(blockConsMEs[[set]]$data), 1); if (is.null(consMEs[[1]])) { for (set in 1:nSets) consMEs[[set]] = blockConsMEs[[set]]$data; } else for (set in 1:nSets) consMEs[[set]] = cbind(consMEs[[set]], blockConsMEs[[set]]$data); # record the labels of assigned genes and remove the assigned genes from the pool assigned = block[blockAssigned]; allLabels[IndexToOrder[assigned]] = blockLabs[blockAssigned]; if (length(assigned) == nRemainGenes) { nRemainGenes = 0; batch = batch + 1; break; } for (set in 1:nSets) expr[[set]]$data = expr[[set]]$data[, -assigned]; nRemainGenes = ncol(expr[[1]]$data); IndexToOrder = IndexToOrder[-assigned]; geneRank = geneRank[-assigned]; # find genes whose closest module eigengene has cor higher than KMEcut, record labels and # remove them blockKME = array(0, dim = c(nRemainGenes, ncol(blockConsMEs[[1]]$data), nSets)); corEval = parse(text = paste(corFnc, "( expr[[set]]$data, blockConsMEs[[set]]$data,", corOptions, ")")) for (set in 1:nSets) blockKME[, , set] = eval(corEval); if (networkType=="unsigned") blockKME = abs(blockKME); consKME = apply(blockKME, c(1,2), min); consKMEmax = apply(consKME, 1, max); closestModule = currentLabelIndex[apply(consKME, 1, which.max)]; assign = (consKMEmax >= KMEcut); if (sum(assign>0)) { allLabels[IndexToOrder[assign]] = closestModule[assign]; for (set in 1:nSets) expr[[set]]$data = expr[[set]]$data[, !assign] IndexToOrder = IndexToOrder[!assign]; geneRank = geneRank[!assign]; } if (getTOMs) { for (set in 1:nTOMSets) TOMs[[set]][[batch]] = blockTOM[, , set]; } collectGarbage(); nRemainGenes = ncol(expr[[1]]$data); if (!is.null(nRemainGenes)) { batch = batch + 1; geneRank = rank(geneRank, ties.method = "first") } else nRemainGenes = 0; } if (verbose>1) printFlush(paste(spaces, "..merging consensus modules that are too close..")); #print(table(allLabels)); #print(is.numeric(allLabels)) if (numericLabels) { colors = allLabels } else { colors = labels2colors(allLabels) } mergedColors = colors; mergedMods = try(mergeCloseModules(goodExpr, colors[goodGenes], cutHeight = mergeCutHeight, relabel = TRUE, impute = impute, verbose = verbose-2, indent = indent + 2), silent = TRUE); if (class(mergedMods)=='try-error') { if (verbose>0) { printFlush(paste(spaces, 'blockwiseConsensusModules: mergeCloseModule failed with this message:\n', spaces, ' ', mergedMods, spaces, '---> returning unmerged consensus modules')); } else warning(paste('blockwiseConsensusModules: mergeCloseModule failed with this message:\n ', mergedMods, '---> returning unmerged consensus modules')); MEs = try(multiSetMEs(goodExpr, universalColors = colors[goodGenes, 1] # trapErrors = TRUE, returnValidOnly = TRUE ), silent = TRUE); if (class(MEs)=='try-error') { warning(paste('blockwiseConsensusModules: ME calculation failed with this message:\n ', MEs, '---> returning empty module eigengenes')); allSampleMEs = NULL; } else { if (!MEs[[1]]$allOK) mergedColors[goodGenes, 1] = MEs[[1]]$validColors; allSampleMEs = vector(mode = "list", length = nSets); for (set in 1:nSets) { allSampleMEs[[set]] = list(data = as.data.frame(matrix(NA, nrow = nSamples[set], ncol = ncol(MEs[[set]]$data)))); allSampleMEs[[set]]$data[goodSamples[[set]], ] = MEs[[set]]$data[,]; names(allSampleMEs[[set]]$data) = names(MEs[[set]]$data); } } } else { mergedColors[goodGenes] = mergedMods$colors; allSampleMEs = vector(mode = "list", length = nSets); for (set in 1:nSets) { allSampleMEs[[set]] = list(data = as.data.frame(matrix(NA, nrow = nSamples[set], ncol = ncol(mergedMods$newMEs[[1]]$data)))); allSampleMEs[[set]]$data[goodSamples[[set]], ] = mergedMods$newMEs[[set]]$data[,]; names(allSampleMEs[[set]]$data) = names(mergedMods$newMEs[[set]]$data); } } if (!getTOMs) TOMs = NULL; if (getSetTOMs) { setTOMs = list(); for (set in 1:nSets) setTOMs[[set]] = TOMs[[set+1]]; } else { setTOMs = NULL; } consTOM = TOMs[[1]]; rm(TOMs); collectGarbage(); list(consColors = mergedColors, consUnmergedColors = colors, consMEs = allSampleMEs, goodSamples = goodSamples, goodGenes = goodGenes, consDendros = dendros, consTOM = consTOM, setTOMs = setTOMs, blockGenes = batchGenes, originCount = originCount); } ############################################################################################## # I) GENERAL STATISTICAL FUNCTIONS ############################################################################################## mean1=function(x) mean(x,na.rm=T) var1=function(x) var(x,na.rm=T) if(exists("scatterplot1") ) rm(scatterplot1); scatterplot1=function(x,y, title1="",col1="black",xlab1=NA,ylab1=NA, cex1=1, cex.axis1=1.5,cex.lab1=1.5, cex.main1=1.5 ,ylim1=-1,correlationmethod="p" ){ if ( is.na(xlab1) ) xlab1= as.character(match.call(expand.dots = FALSE)$x) if ( is.na(ylab1) ) ylab1= as.character(match.call(expand.dots = FALSE)$y) x= as.numeric(as.character(x)) y= as.numeric(as.character(y)) cor1=signif(cor(x,y,use="p",method=correlationmethod),2) corp=signif(cor.test(x,y,use="p",method=correlationmethod)$p.value,2) if (corp<10^(-20) ) corp="<10^{-20}" if ( length(ylim1)==2) {plot(x,y, main=paste(title1,"cor=", cor1,"p=",corp),col=col1,xlab=xlab1,ylab=ylab1, cex=cex1, cex.axis=cex.axis1,cex.lab=cex.lab1, cex.main=cex.main1,ylim=ylim1)} else { plot(x,y, main=paste(title1,"cor=", cor1,"p=",corp),col=as.character(col1),xlab=xlab1,ylab=ylab1, cex=cex1, cex.axis=cex.axis1,cex.lab=cex.lab1, cex.main=cex.main1) } } if (exists("boxplot1") ) rm(boxplot1); boxplot1=function(x,g,title1=" ",xlab1=NA,ylab1=NA, cex.axis1=1.5,cex.lab1=1.5, cex.main1=1.5, cex1=1.5,col1="white") { if ( is.na(xlab1) ) xlab1= as.character(match.call(expand.dots = FALSE)$x) print(xlab1) if ( is.na(ylab1) ) ylab1= as.character( match.call(expand.dots = FALSE)$g) print(ylab1) p1=signif(kruskal.test(x, factor(g) )$p.value,2) if (p1< 5.0*10^(-22) ) p1="<5.0x10^{-22}" boxplot(x~factor(g), notch=T, varwidth=T, main=paste(title1,", p=",as.character(p1) ), col=col1,xlab=xlab1,ylab=ylab1, cex=cex1, cex.axis=cex.axis1, cex.lab=cex.lab1, cex.main=cex.main1) } # this function compute an asymptotic p-value for a given correlation (r) and sample size (n) if (exists("FisherTransformP") ) rm(FisherTransformP); FisherTransformP=function(r, n) { #Z=sqrt(n-3) * 0.5*log((1+r)/(1-r)) #2*(1-pnorm(abs(Z) )) # the following is implemented in the cor.test function T=sqrt(n-2) * r/sqrt(1-r^2) 2*(1-pt(abs(T),n-2)) } # Outlier resistant correlations corNoOutlier = function(x, y = NULL, use = use, method = "pearson", diagOnly = FALSE) { x = as.matrix(x); median = apply(x, 2, median, na.rm = TRUE); removeSampleX = apply(abs(x - matrix(median, nrow = nrow(x), ncol = ncol(x), byrow = TRUE)), 2, which.max); for (col in 1:ncol(x)) x[removeSampleX[col], col] = NA; if (is.null(y)) { if (diagOnly) stop("Bad input: diagOnly makes sense only if x and y are both given."); cor(x, use = use, method = method); } else { y = as.matrix(y); median = apply(y, 2, median, na.rm = TRUE); removeSampleY = apply(abs(y - matrix(median, nrow = nrow(y), ncol = ncol(y), byrow = TRUE)), 2, which.max); for (col in 1:ncol(y)) y[removeSampleY[col], col] = NA; if (diagOnly) { nc = min(ncol(x), ncol(y)); cor = rep(0, nc); for (col in 1:nc) cor[col] = cor(x[, col], y[, col], use = use, method = method); cor } else { cor(x, y, use = use, method = method) } } } # Biweight correlation. The original functions are from # http://www.unt.edu/benchmarks/archives/2001/december01/rss.htm and that in # turn seems to be based on Wilcox (1997, page 197). # The actually useful versions are PL's rewrites that use block code so the # functions can be used with matrices as inputs. PL's function could be # improved since it does some calculations unnecessarily twice when y==NULL. bicov.original<-function(x, y, na.rm = FALSE){ mx <- median(x, na.rm = na.rm) my <-median(y, na.rm = na.rm) ux <- abs((x - mx)/(9 * qnorm(0.75) * mad(x, na.rm = na.rm))) uy <- abs((y - my)/(9 * qnorm(0.75) * mad(y, na.rm = na.rm))) aval <- ifelse(ux <= 1, 1, 0) bval <- ifelse(uy <= 1, 1, 0) top <- sum(aval * (x - mx) * (1 - ux^2)^2 * bval * (y - my) * (1 - uy^2)^2, na.rm = na.rm) top <- sum(!is.na(x) & !is.na(y)) * top botx <- sum(aval * (1 - ux^2) * (1 - 5 * ux^2), na.rm = na.rm) boty <- sum(bval * (1 - uy^2) * (1 - 5 * uy^2), na.rm = na.rm) bi <- top/(botx * boty) bi } bicor.original<-function(x, y, na.rm = FALSE){ x <-as.numeric(x) y<-as.numeric(y) bicov(x,y, na.rm = na.rm)/(sqrt(bicov(x,x, na.rm = na.rm)*bicov(y,y, na.rm = na.rm))) } # This function should now work for both vector and matrix x and y (all combinations); # Big WARNING: The use = pairwise.complete.obs simply works by removing NAs independently in x and # y!!!!! bicov = function(x, y = NULL, robustX = TRUE, robustY = TRUE, use = "all.obs", diagOnly = FALSE) { na.method <- pmatch(use, c("all.obs", "pairwise.complete.obs")) if (is.na(na.method)) stop(paste("Unrecognized parameter 'use'. Recognized values are \n", "'all.obs', 'pairwise.complete.obs'")) na.rm = (na.method==2); if (is.null(y)) { x = as.matrix(x); if (robustX) { mx <- apply(x, 2, median, na.rm = na.rm) mxMat = matrix(mx, nrow = nrow(x), ncol = ncol(x), byrow = TRUE); madxMat = matrix(apply(x, 2, mad, na.rm = na.rm), nrow = nrow(x), ncol = ncol(x), byrow = TRUE); ux <- as.matrix(abs((x - mxMat)/(9 * qnorm(0.75) * madxMat))) } else { mx = apply(x, 2, mean, na.rm = na.rm); mxMat = matrix(mx, nrow = nrow(x), ncol = ncol(x), byrow = TRUE); ux = matrix(0, nrow = nrow(mxMat), ncol = ncol(mxMat)); } aval <- ifelse(ux <= 1, 1, 0) botx <- apply(as.matrix(aval * (1 - ux^2) * (1 - 5 * ux^2)), 2, sum, na.rm = na.rm); ux[is.na(ux)] = 1; aval[is.na(aval)] = 0; if (diagOnly) { # diagOnly makes sense particularly for y=NULL because it is used in the # normalization in bicor. topFact = apply(!is.na(x), 2, sum) x[is.na(x)] = mxMat[is.na(x)]; top <- apply(as.matrix(aval * (x - mxMat) * (1 - ux^2)^2)^2, 2, sum); top <- top * topFact; bi <- top/((botx^2) * (1-1/nrow(x))); } else { topFact = t(!is.na(x)) %*% !is.na(x) x[is.na(x)] = mxMat[is.na(x)]; top <- t(as.matrix(aval * (x - mxMat) * (1 - ux^2)^2)) %*% as.matrix((aval * (x - mxMat) * (1 - ux^2)^2)) top <- top * topFact; bi <- top/(botx %o% (botx *(1-1/nrow(x)) )) } bi } else { x = as.matrix(x); y = as.matrix(y); if (robustX) { mx <- apply(x, 2, median, na.rm = na.rm) mxMat = matrix(mx, nrow = nrow(x), ncol = ncol(x), byrow = TRUE); madxMat = matrix(apply(x, 2, mad, na.rm = na.rm), nrow = nrow(x), ncol = ncol(x), byrow = TRUE); ux <- as.matrix(abs((x - mxMat)/(9 * qnorm(0.75) * madxMat))) } else { mx = apply(x, 2, mean, na.rm = na.rm); mxMat = matrix(mx, nrow = nrow(x), ncol = ncol(x), byrow = TRUE); ux = matrix(0, nrow = nrow(mxMat), ncol = ncol(mxMat)); } if (robustY) { my <- apply(y, 2, median, na.rm = na.rm) myMat = matrix(my, nrow = nrow(y), ncol = ncol(y), byrow = TRUE); madyMat = matrix(apply(y, 2, mad, na.rm = na.rm), nrow = nrow(y), ncol = ncol(y), byrow = TRUE); uy <- as.matrix(abs((y - myMat)/(9 * qnorm(0.75) * madyMat))) } else { my = apply(y, 2, mean, na.rm = na.rm); myMat = matrix(my, nrow = nrow(y), ncol = ncol(y), byrow = TRUE); uy = matrix(0, nrow = nrow(myMat), ncol = ncol(myMat)); } aval <- ifelse(ux <= 1, 1, 0) bval <- ifelse(uy <= 1, 1, 0) botx <- apply(as.matrix(aval * (1 - ux^2) * (1 - 5 * ux^2)), 2, sum, na.rm = na.rm) boty <- apply(as.matrix(bval * (1 - uy^2) * (1 - 5 * uy^2)), 2, sum, na.rm = na.rm) ux[is.na(ux)] = 1; uy[is.na(uy)] = 1; aval[is.na(aval)] = 0; bval[is.na(bval)] = 0; topFact = t(!is.na(x)) %*% !is.na(y) x[is.na(x)] = mxMat[is.na(x)]; y[is.na(y)] = myMat[is.na(y)]; top <- t(as.matrix(aval * (x - mxMat) * (1 - ux^2)^2)) %*% as.matrix((bval * (y - myMat) * (1 - uy^2)^2)) top <- top * topFact; bi <- top/(botx %o% (boty * (1-1/nrow(x)))) bi } } #### Calculate biweight midcorrelation. # # if robustX,Y = FALSE, the calculation on that variable will be the usual correlation # The reason is that on bbinary variables the biweight correlation, covariance and variance returns NA. # bicor<-function(x, y = NULL, robustX = TRUE, robustY = TRUE, use = "all.obs") { na.method <- pmatch(use, c("all.obs", "pairwise.complete.obs")) if (is.na(na.method)) stop(paste("Unrecognized parameter 'use'. Recognized values are \n", "'all.obs', 'pairwise.complete.obs'")) if (is.null(y)) { bcx = sqrt(bicov(x, robustX = robustX, use = use, diagOnly = TRUE)); bi = bicov(x, robustX = robustX, use = use)/(bcx %o% bcx); (bi + t(bi))/2; } else bicov(x, y, robustX = robustX, robustY = robustY, use = use)/ (sqrt(bicov(x, robustX = robustX, use = use, diagOnly = TRUE)) %o% sqrt(bicov(y, robustX = robustY, use = use, diagOnly = TRUE))) } # This version of bicor handles NA's in y correctly, but may run somewhat slower. Note that NA's in # x are still not handled quite correctly bicorNAy<-function(x, y = NULL, robustX = TRUE, robustY = TRUE, use = "all.obs") { na.method <- pmatch(use, c("all.obs", "pairwise.complete.obs")) if (is.na(na.method)) stop(paste("Unrecognized parameter 'use'. Recognized values are \n", "'all.obs', 'pairwise.complete.obs'")) if (is.null(y)) { bcx = sqrt(bicov(x, robustX = robustX, use = use, diagOnly = TRUE)); bicov(x, robustX = robustX, use = use)/(bcx %o% bcx); } else { if (na.method==1) { bicov(x, y, robustX = robustX, robustY = robustY, use = use)/ (sqrt(bicov(x, robustX = robustX, use = use, diagOnly = TRUE)) %o% sqrt(bicov(y, robustX = robustY, use = use, diagOnly = TRUE))) } else { bi = matrix(NA, nrow = ncol(x), ncol = ncol(y)); for (i in 1:ncol(y)) { xx = x; xx[is.na(y[, i]), ] = NA; bi[, i] = bicov(xx, y[,i], robustX = robustX, robustY = robustY, use = use)/ (sqrt(bicov(xx, robustX = robustX, use = use, diagOnly = TRUE)) %o% sqrt(bicov(y[,i], robustX = robustY, use = use, diagOnly = TRUE))) } bi; } } } # Weighted version of bicov and bicor bicovw = function(x, xWeight, y = NULL, yWeight = NULL, robustX = TRUE, robustY = TRUE, use = "all.obs", diagOnly = FALSE) { x = as.matrix(x); xWeight = as.matrix(xWeight); if (!all.equal(dim(x), dim(xWeight))) stop("x and xWeight must have the same dimensions."); na.method <- pmatch(use, c("all.obs", "pairwise.complete.obs")) if (is.na(na.method)) stop(paste("Unrecognized parameter 'use'. Recognized values are \n", "'all.obs', 'pairwise.complete.obs'")) na.rm = (na.method==2); if (is.null(y)) { if (robustX) { #mx <- apply(x, 2, median, na.rm = na.rm) mx = medianw(x, xWeight); mxMat = matrix(mx, nrow = nrow(x), ncol = ncol(x), byrow = TRUE); madxMat = matrix(apply(x, 2, mad, na.rm = na.rm), nrow = nrow(x), ncol = ncol(x), byrow = TRUE); ux <- as.matrix(abs((x - mxMat)/(9 * qnorm(0.75) * madxMat))) } else { mx = meanw(x, xWeight) mxMat = matrix(mx, nrow = nrow(x), ncol = ncol(x), byrow = TRUE); ux = matrix(0, nrow = nrow(mxMat), ncol = ncol(mxMat)); } aval = ifelse(ux <= 1, 1, 0) * xWeight; botx <- apply(as.matrix(aval * (1 - ux^2) * (1 - 5 * ux^2)), 2, sum, na.rm = na.rm); ux[is.na(ux)] = 1; aval[is.na(aval)] = 0; if (diagOnly) { # diagOnly makes sense particularly for y=NULL because it is used in the # normalization in bicor. topFact = apply(!is.na(x), 2, sum) x[is.na(x)] = mxMat[is.na(x)]; top <- apply(as.matrix(aval * (x - mxMat) * (1 - ux^2)^2)^2, 2, sum); top <- top * topFact; bi <- top/((botx^2) * (1-1/nrow(x))); } else { topFact = t(!is.na(x)) %*% !is.na(x) x[is.na(x)] = mxMat[is.na(x)]; top <- t(as.matrix(aval * (x - mxMat) * (1 - ux^2)^2)) %*% as.matrix((aval * (x - mxMat) * (1 - ux^2)^2)) top <- top * topFact; bi <- top/(botx %o% (botx *(1-1/nrow(x)) )) } bi } else { y = as.matrix(y); yWeight = as.matrix(yWeight); if (robustX) { #mx <- apply(x, 2, median, na.rm = na.rm) mx = medianw(x, xWeight) mxMat = matrix(mx, nrow = nrow(x), ncol = ncol(x), byrow = TRUE); madxMat = matrix(apply(x, 2, mad, na.rm = na.rm), nrow = nrow(x), ncol = ncol(x), byrow = TRUE); ux <- as.matrix(abs((x - mxMat)/(9 * qnorm(0.75) * madxMat))) } else { mx = meanw(x, xWeight) mxMat = matrix(mx, nrow = nrow(x), ncol = ncol(x), byrow = TRUE); ux = matrix(0, nrow = nrow(mxMat), ncol = ncol(mxMat)); } if (robustY) { #my <- apply(y, 2, median, na.rm = na.rm) my = medianw(y, yWeight) myMat = matrix(my, nrow = nrow(y), ncol = ncol(y), byrow = TRUE); madyMat = matrix(apply(y, 2, mad, na.rm = na.rm), nrow = nrow(y), ncol = ncol(y), byrow = TRUE); uy <- as.matrix(abs((y - myMat)/(9 * qnorm(0.75) * madyMat))) } else { my = meanw(y, yWeight); myMat = matrix(my, nrow = nrow(y), ncol = ncol(y), byrow = TRUE); uy = matrix(0, nrow = nrow(myMat), ncol = ncol(myMat)); } aval <- ifelse(ux <= 1, 1, 0) * xWeight; bval <- ifelse(uy <= 1, 1, 0) * yWeight; botx <- apply(as.matrix(aval * (1 - ux^2) * (1 - 5 * ux^2)), 2, sum, na.rm = na.rm) boty <- apply(as.matrix(bval * (1 - uy^2) * (1 - 5 * uy^2)), 2, sum, na.rm = na.rm) ux[is.na(ux)] = 1; uy[is.na(uy)] = 1; aval[is.na(aval)] = 0; bval[is.na(bval)] = 0; topFact = t(!is.na(x)) %*% !is.na(y) x[is.na(x)] = mxMat[is.na(x)]; y[is.na(y)] = myMat[is.na(y)]; top <- t(as.matrix(aval * (x - mxMat) * (1 - ux^2)^2)) %*% as.matrix((bval * (y - myMat) * (1 - uy^2)^2)) top <- top * topFact; bi <- top/(botx %o% (boty * (1-1/nrow(x)))) bi } } # weighted version of biweight mid-corellation # bicorw<-function(x, xWeight, y = NULL, yWeight = NULL, robustX = TRUE, robustY = TRUE, use = "all.obs") { na.method <- pmatch(use, c("all.obs", "pairwise.complete.obs")) if (is.na(na.method)) stop(paste("Unrecognized parameter 'use'. Recognized values are \n", "'all.obs', 'pairwise.complete.obs'")) if (!all.equal(dim(x), dim(xWeight))) stop("x and xWeight must have the same dimensions"); if (is.null(y)) { bcx = sqrt(bicov(x, robustX = robustX, use = use, diagOnly = TRUE)); bicov(x, robustX = robustX, use = use)/(bcx %o% bcx); } else { if (!all.equal(dim(y), dim(yWeight))) stop("y and yWeight must have the same dimensions"); bicov(x, y, robustX = robustX, robustY = robustY, use = use)/ (sqrt(bicov(x, robustX = robustX, use = use, diagOnly = TRUE)) %o% sqrt(bicov(y, robustX = robustY, use = use, diagOnly = TRUE))) } } meanw = function(x, w, na.rm = FALSE) { x = as.matrix(x); w = as.matrix(w); if (!all.equal(dim(x), dim(w))) stop("Dimensions of x and w must be the same."); if (na.rm) { w[is.na(w)] = 0; w[is.na(x)] = 0; } s = apply(x*w, 2, sum, na.rm = TRUE); sw = apply(w, 2, sum, na.rm = TRUE); if (sum(sw==0)>0) stop("Sum of weights is zero in some columns.") s/sw; } medianw = function(x, w, na.rm = FALSE) { x = as.matrix(x); w = as.matrix(w); if (!all.equal(dim(x), dim(w))) stop("Dimensions of x and w must be the same."); if (na.rm) { w[is.na(w)] = 0; w[is.na(x)] = 0; } if (sum(w<0)>0) stop("Weights w must be non-negative."); order = apply(x, 2, order, na.last = TRUE); for (i in 1:ncol(x)) w[, i] = w[order[, i], i]; nSamples = nrow(w); csw = apply(w, 2, cumsum) sw = apply(w, 2, sum); if (sum(sw==0)>0) stop("Some columns of w have no non-zero entries.") # csw are cumulative sums of weights csw = csw * matrix(nSamples/csw[nSamples, ], nrow = nSamples, ncol = ncol(w), byrow = TRUE); midw = nSamples/2; # midh will be the index in each column where csw rises above midw. Can be anywehere between 1 and # nSamples. midh = apply(csw-midw>0, 2, match, x = TRUE) # Couldn't figure out a way to do this in block notation. medians = rep(0, ncol(x)); for (i in 1:ncol(x)) if (midh[i]==1) { medians[i] = x[order[1, i], i]; } else { w0 = abs(csw[midh[i]-1]); x0 = x[order[midh[i]-1, i], i]; w1 = csw[midh[i]]; x1 = x[order[midh[i], i], i]; medians[i] = (w0*x1 + w1* x0)/(w1+w0); } medians; } propVarExplained = function(datExpr, colors, MEs, corFnc = "cor", corOptions = "use = 'p'") { fc = as.factor(colors); mods = levels(fc); nMods = nlevels(fc); if (nMods!=ncol(MEs)) stop(paste("Input error: number of distinct 'colors' differs from\n", "the number of module eigengenes given in ME.")); if (ncol(datExpr)!=length(colors)) stop("Input error: number of probes (columns) in 'datExpr' differs from the length of goven 'colors'."); if (nrow(datExpr)!=nrow(MEs)) stop("Input error: number of observations (rows) in 'datExpr' and 'MEs' differ."); PVE = rep(0, nMods); col2MEs = match(mods, substring(names(MEs), 3)); if (sum(is.na(col2MEs))>0) stop("Input error: not all given colors could be matched to names of module eigengenes."); for (mod in 1:nMods) { modGenes = c(1:nGenes)[as.character(colors)==mods[mod]]; corExpr = parse(text = paste(corFnc, "(datExpr[, modGenes], MEs[, col2MEs[mod]],", corOptions, ")")); PVE[mod] = mean(as.vector(eval(corExpr)^2)); } names(PVE) = paste("PVE", mods, sep = ""); PVE } #=================================================================================== # # addGrid # #=================================================================================== # This function adds a horizontal grid to a plot addGrid = function(linesPerTick = NULL, horiz = TRUE, vert = FALSE, col = "grey", lty = 3) { box = par("usr"); if (horiz) { ticks = par("yaxp"); nTicks = ticks[3]; if (is.null(linesPerTick)) { if (nTicks < 6) linesPerTick = 5 else linesPerTick = 2; } spacing = (ticks[2]-ticks[1])/(linesPerTick*nTicks); first = ceiling((box[3] - ticks[1])/spacing); last = floor((box[4] - ticks[1])/spacing); #print(paste("addGrid: first=", first, ", last =", last, "box = ", paste(signif(box,2), collapse = ", "), #"ticks = ", paste(signif(ticks, 2), collapse = ", "), "spacing =", spacing )); for (k in first:last) lines(x = box[c(1,2)], y = rep(ticks[1] + spacing * k, 2), col = col, lty = lty); } if (vert) { ticks = par("xaxp"); nTicks = ticks[3]; if (is.null(linesPerTick)) { if (nTicks < 6) linesPerTick = 5 else linesPerTick = 2; } spacing = (ticks[2]-ticks[1])/(linesPerTick*ticks[3]); first = ceiling((box[1] - ticks[1])/spacing); last = floor((box[2] - ticks[1])/spacing); #print(paste("addGrid: first=", first, ", last =", last, "box = ", paste(signif(box,2), collapse = ", "), # "ticks = ", paste(signif(ticks, 2), collapse = ", "), "spacing =", spacing )); for (l in first:last) lines(x = rep(ticks[1] + spacing * l, 2), y = box[c(3,4)], col = col, lty = lty); } } # Related: this function plots a legend but cleans up the legend rectangle # first. Do not give it a plot argument, it will cause an error. legendClean = function(...) { box = legend(..., plot = FALSE)$rect rect(box$left, box$top-box$h, box$left +box$w, box$top, col = "white", border = "white"); legend(...); } #----------------------------------------------------------------------------------------------- # # Add vertical "grid" lines to a dendrogram to facilitate identification of clusters with color bars # #----------------------------------------------------------------------------------------------- addGuideLines = function(dendro, count = 50, positions = NULL, col = "grey60", lty = 3, hang = 0) { if (is.null(positions)) { lineSpacing = (length(dendro$height)+1)/count; positions = 1:count* lineSpacing; } objHeights = rep(0, length(dendro$height+1)); objHeights[-dendro$merge[dendro$merge[,1]<0,1]] = dendro$height[dendro$merge[,1]<0]; objHeights[-dendro$merge[dendro$merge[,2]<0,2]] = dendro$height[dendro$merge[,2]<0]; box = par("usr"); ymin = box[3]; ymax = box[4]; objHeights = objHeights - hang*(ymax - ymin); objHeights[objHeights nLinks); } if (sampleLinks) nLinks = min(nLinks, nGenes) else nLinks = nGenes; #printFlush(paste("blockSize =", blockSize)); #printFlush(paste("nGenes =", nGenes)); #printFlush(paste(".largestBlockSize =", .largestBlockSize)); if (blockSize * nLinks>.largestBlockSize) blockSize = as.integer(.largestBlockSize/nLinks); if (blockSize < 1) { blockSize = 1; printFlush(paste("The data set may be too large for the connectivity calculation.\n", " ..trying anyway; keep your fingers crossed.")); } intNetworkType = charmatch(type, c("unsigned", "signed", "signed hybrid")); if (is.na(intNetworkType)) stop("Unrecognized networkType argument.", "Recognized values are (unique abbreviations of) 'unsigned', 'signed', and 'signed hybrid'."); subtract = rep(1, nGenes); if (sampleLinks) { if (verbose > 0) printFlush(paste(spaces, "nearestNeighborConnectivity: selecting sample pool of size", nLinks, "..")) sd = sd(expr, na.rm = TRUE); order = order(-sd); saved = FALSE; if (exists(".Random.seed")) { saved = TRUE; savedSeed = .Random.seed if (is.numeric(setSeed)) set.seed(setSeed); } samplePool = order[sample(x = nGenes, size = nLinks)] if (saved) .Random.seed <<- savedSeed; poolExpr = expr[, samplePool]; subtract[-samplePool] = 0; } if (verbose>0) { printFlush(paste(spaces, "restrConnectivity: received", "dataset with nGenes =", as.character(nGenes))); cat(paste(spaces, "..using nBest =", nBest, "and blockSize =", blockSize, " ")); pind = initProgInd(trailStr = " done"); } restrConn = rep(0, nGenes); if (sampleLinks) { corEval = parse(text = paste(corFnc, "(poolExpr, expr[, BatchIndex], ", corOptions, ")")) } else { corEval = parse(text = paste(corFnc, "(expr, expr[, BatchIndex], ", corOptions, ")")) } No.Batches = as.integer((nGenes-1)/blockSize); SetRestrConn = NULL; start = 1; while (start <= nGenes) { end = start + blockSize-1; if (end>nGenes) end = nGenes; BatchIndex = c(start:end); #if (verbose>1) printFlush(paste(spaces, "..working on genes", start, "through", end, "of", nGenes)) c = eval(corEval); if (intNetworkType==1) { c = abs(c); } else if (intNetworkType==2) { c = (1+c)/2; } else if (intNetworkType==3) { c[c < 0] = 0; } else stop("Internal error: intNetworkType has wrong value:", intNetworkType, ". Sorry!"); adj_mat = as.matrix(c^softPower); adj_mat[is.na(adj_mat)] = 0; sortedAdj = as.matrix(apply(adj_mat, 2, sort, decreasing = TRUE)[1:(nBest+1), ]); restrConn[BatchIndex] = apply(sortedAdj, 2, sum)-subtract[BatchIndex]; start = end+1; if (verbose>0) pind = updateProgInd(end/nGenes, pind); collectGarbage(); } if (verbose>0) printFlush(" "); restrConn; } #------------------------------------------------------------------------------------------- # # restrConnectivityMS # #------------------------------------------------------------------------------------------- # This function takes expression data (rows=samples, colummns=genes) in the multi-set format # and the softPower exponent used in weighting the # correlations to get the network adjacency matrix, and returns an array of dimensions # nGenes * nSets containing the connectivities of each gene in each subset. restrConnectivityMS = function(expr, nBest = 20, softPower=6, signed = FALSE, type = "unsigned", corFnc = "cor", corOptions = "use = 'p'", blockSize = 1000, sampleLinks = NULL, nLinks = 5000, setSeed = 36492, verbose=1, indent=0) { spaces = indentSpaces(indent); setsize = checkSets(expr); nGenes = setsize$nGenes; nSamples = setsize$nSamples; nSets = setsize$nSets; if (is.null(sampleLinks)) { sampleLinks = (nGenes > nLinks); } if (sampleLinks) nLinks = min(nLinks, nGenes) else nLinks = nGenes; #printFlush(paste("blockSize =", blockSize)); #printFlush(paste("nGenes =", nGenes)); #printFlush(paste(".largestBlockSize =", .largestBlockSize)); if (blockSize * nLinks>.largestBlockSize) blockSize = as.integer(.largestBlockSize/nLinks); if (blockSize < 1) { blockSize = 1; printFlush(paste("The data set may be too large for the connectivity calculation.\n", " ..trying anyway; keep your fingers crossed.")); } intNetworkType = charmatch(type, c("unsigned", "signed", "signed hybrid")); if (length(softPower)==1) { softPower = rep(softPower, nSets); } else if (length(softPower)!=nSets) stop("Invalid arguments: length of 'softPower' must equal number sets in 'expr'"); intNetworkType = charmatch(type, c("unsigned", "signed", "signed hybrid")); if (is.na(intNetworkType)) stop("Unrecognized networkType argument.", "Recognized values are (unique abbreviations of) 'unsigned', 'signed', and 'signed hybrid'."); subtract = rep(1, nGenes); if (sampleLinks) { if (verbose > 0) printFlush(paste(spaces, "nearestNeighborConnectivityMS: selecting sample pool of size", nLinks, "..")) sd = sd(expr[[1]]$data, na.rm = TRUE); order = order(-sd); saved = FALSE; if (exists(".Random.seed")) { saved = TRUE; savedSeed = .Random.seed if (is.numeric(setSeed)) set.seed(setSeed); } samplePool = order[sample(x = nGenes, size = nLinks)] if (saved) .Random.seed <<- savedSeed; subtract[-samplePool] = 0; } if (verbose>0) printFlush(paste(spaces, "restrConnectivityMS: received", nSets, "datasets with nGenes =", as.character(nGenes))); if (verbose>0) printFlush(paste(spaces, " Using nBest =", nBest, "and blockSize =", blockSize, " ")); restrConn = matrix(nrow = nGenes, ncol = nSets); if (sampleLinks) { corEval = parse(text = paste(corFnc, "(expr[[set]]$data[, samplePool], expr[[set]]$data[, BatchIndex], ", corOptions, ")")) } else { corEval = parse(text = paste(corFnc, "(expr[[set]]$data, expr[[set]]$data[, BatchIndex], ", corOptions, ")")) } for (set in 1:nSets) { if (verbose>0) { cat(paste(spaces, " Working on set", set)); pind = initProgInd(trailStr = " done"); } No.Batches = as.integer((nGenes-1)/blockSize); SetRestrConn = NULL; start = 1; while (start <= nGenes) { end = start + blockSize-1; if (end>nGenes) end = nGenes; BatchIndex = c(start:end); #if (verbose>1) printFlush(paste(spaces, " .. working on genes", start, "through", end, "of", nGenes)) c = eval(corEval); if (intNetworkType==1) { c = abs(c); } else if (intNetworkType==2) { c = (1+c)/2; } else if (intNetworkType==3) { c[c < 0] = 0; } else stop("Internal error: intNetworkType has wrong value:", intNetworkType, ". Sorry!"); adj_mat = as.matrix(c^softPower[set]); adj_mat[is.na(adj_mat)] = 0; sortedAdj = as.matrix(apply(adj_mat, 2, sort, decreasing = TRUE)[1:(nBest+1), ]); restrConn[BatchIndex, set] = apply(sortedAdj, 2, sum)-subtract[BatchIndex]; collectGarbage(); start = end + 1; if (verbose > 0) pind = updateProgInd(end/nGenes, pind); collectGarbage(); } if (verbose>0) printFlush(" "); } restrConn; } #====================================================================================================== # # Nifty display of progress. # # ===================================================================================================== initProgInd = function( leadStr = "..", trailStr = "", quiet = !interactive()) { oldStr = " "; cat(oldStr); progInd = list(oldStr = oldStr, leadStr = leadStr, trailStr = trailStr); class(progInd) = "progressIndicator"; updateProgInd(0, progInd, quiet); } updateProgInd = function(newFrac, progInd, quiet = !interactive()) { if (class(progInd)!="progressIndicator") stop("Parameter progInd is not of class 'progressIndicator'. Use initProgInd() to initialize", "it prior to use."); if (quiet) { progInd$oldStr = paste(progInd$leadStr, as.integer(newFrac*100), "% ", progInd$trailStr, sep = "") } else { cat(paste(rep("\b", nchar(progInd$oldStr)), collapse="")); progInd$oldStr = paste(progInd$leadStr, as.integer(newFrac*100), "% ", progInd$trailStr, sep = "") cat(progInd$oldStr); if (exists("flush.console")) flush.console(); } progInd; } #====================================================================================================== # # Plot a dendrogram and a set of labels underneath # # ===================================================================================================== # plotDendroAndGroups = function(dendro, groups, groupLabels = NULL, setLayout = TRUE, groupHeight = 0.3, autoGroupHeight = TRUE, dendroLabels = NULL, addGuide = FALSE, guideAll = FALSE, guideCount = 50, guideHang = 0.20, group.cex = 0.8, marAll = c(1,5,3,1), cex = 0.9, saveMar = TRUE, abHeight = NULL, abCol = "red", ...) { oldMar = par("mar"); if (!is.null(dim(groups))) { nRows = dim(groups)[2]; } else nRows = 1; if (autoGroupHeight) groupHeight = 0.2 + 0.4 * (1-exp(-(nRows-1)/6)) if (setLayout) layout(matrix(c(1:2), 2, 1), heights = c(1-groupHeight, groupHeight)); par(mar = c(0, marAll[2], marAll[3], marAll[1])); plot(dendro, labels = dendroLabels, cex = cex, ...); if (addGuide) addGuideLines(dendro, count = ifelse(guideAll, length(dendro$height)+1, guideCount), hang = guideHang); if (!is.null(abHeight)) abline(h=abHeight, col = abCol); par(mar = c(marAll[1], marAll[2], 0, marAll[4])); plotHclustColors(dendro, groups, groupLabels, cex.rowLabels = group.cex) if (saveMar) par(mar = oldMar); } goodmargins1= c(0, 5, 2.3, 1) #------------------------------------------------------------------------------------------------------- # # Print some welcome remarks. # Remove old cutreeDynamic, if it exists. Define a wrapper cutreeDynamic.1 # that is compatible with the old function. # #------------------------------------------------------------------------------------------------------- cat("\n"); cat("\n"); cat("*****************************************************************************\n"); cat("\n"); cat(paste(" Welcome to NetworkFunctions-V2. \n\n", "Please be aware that there are some", "changes compared to old NetworkFunctions.\n", "We did our best to avoid conflicts and issue warnings if they arise,\n", "but occasionally an error may result from combining old code with", "this new one.\n")); cat("\n"); cat("*****************************************************************************\n"); cat("\n"); cat("\n"); if (exists("cutreeDynamic", inherits = FALSE, mode = "function")) { cat("\n"); cat(paste("Warning: a function of the name 'cutreeDynamic' appears to be defined", "in the main environment.\n")); cat(paste(" It is likely that an old version of NetworkFunctions has been loaded.\n")); cat(paste(" Please do not use old NetworkFunctions, they are not compatible with this code.\n")); cat("\n"); cat(paste(" The old definition has been removed; if the old functionality of cutreeDynamic is needed, \n", " replace calls to the old cutreeDynamic(...) by cutreeDynamic.1(...).\n", " Arguments to the function can be left unchanged.\n")); cat("\n"); cat(paste(" After loading the dynamicTreeCut package, type help(cutreeDynamic) to see information\n", " about the new dynamic tree cutting function and its use.\n")); cat("\n"); rm("cutreeDynamic", inherits = FALSE); } cutreeDynamic.1 = function(hierclust, maxTreeHeight=1, deepSplit=TRUE, minModuleSize=50, minAttachModuleSize=100, nameallmodules=FALSE, useblackwhite=FALSE) { if ( (minAttachModuleSize!=100) | (nameallmodules!=FALSE) | (useblackwhite!=FALSE) ) waring("cutreeDynamic.1: Parameters minAttachModuleSize, nameallmodules, useblackwhite are ignored."); labels2colors(cutreeDynamicTree(hierclust, maxTreeHeight, deepSplit, minModuleSize)); } ##*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#* #===================================================================================== # # Included file: ../RLibs/moduleColor/R/Functions.R # #===================================================================================== #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#* # These functions are writen in the framework where for several sets the expression data are a vector of # lists, with each list having a component "data" in which the actual expression data for the set are # stored. #----------------------------------------------------------------------------------------------- # # Overall options and settings for the package # #----------------------------------------------------------------------------------------------- .moduleColorOptions = list(MEprefix = "ME", version = "1.06-1", revisionDate = "May 16, 2008") .onLoad = function(libname, pkgname) { cat(paste("Package moduleColor, version", .moduleColorOptions$version, "revision date", .moduleColorOptions$revisionDate, "\n")); cat(paste(" Active module eigengene name prefix:", .moduleColorOptions$MEprefix, "\n")); #cat(paste(" To change it, use moduleColor.setMEprefix(newPrefix).\n \n")); } moduleColor.setMEprefix = function(prefix) { stop("This function does not currently work. Sorry!"); prefix = as.character(prefix[1]); if (nchar(prefix)!=2) stop(paste("Module eigengene prefix is currently restricted to be two characters long.", "Recommended values are PC and ME")); .moduleColorOptions$MEprefix = prefix; } moduleColor.getMEprefix = function() { .moduleColorOptions$MEprefix; } moduleColor.version = function() { .moduleColorOptions$version; } moduleColor.revisionDate = function() { .moduleColorOptions$revisionDate; } #----------------------------------------------------------------------------------------------- # # plotHclustColors # #----------------------------------------------------------------------------------------------- plotHclustColors=function(dendro, colors, rowLabels = NULL, cex.rowLabels = 0.9, ...) { colors = as.matrix(colors); dimC = dim(colors) if (is.null(rowLabels) & (length(dimnames(colors)[[2]])==dimC[2])) rowLabels = names(as.data.frame(colors)); options(stringsAsFactors=FALSE); if (length(dendro$order) != dimC[1] ) stop("ERROR: length of colors vector not compatible with number of objects in the hierarchical tree."); nSets = dimC[2]; C = colors[dendro$order, ]; step = 1/(dimC[1]-1); ystep = 1/nSets; barplot(height=1, col = "white", border=FALSE, space=0, axes=FALSE, ...) for (j in 1:nSets) { ind = (1:dimC[1]); xl = (ind-1.5) * step; xr = (ind-0.5) * step; yb = rep(ystep*(j-1), dimC[1]); yt = rep(ystep * j, dimC[1]); if (is.null(dim(C))) { rect(xl, yb, xr, yt, col = as.character(C), border = as.character(C)); } else { rect(xl, yb, xr, yt, col = as.character(C[,j]), border = as.character(C[,j])); } if (is.null(rowLabels)) { text(as.character(j), pos=2, x=0, y=ystep*(j-0.5), cex=cex.rowLabels, xpd = TRUE); } else { text(rowLabels[j], pos=2, x=0, y=ystep*(j-0.5), cex=cex.rowLabels, xpd = TRUE); } } for (j in 1:nSets) lines(x=c(0,1), y=c(ystep*j,ystep*j)); } # =================================================== #The function moduleEigengenes finds the first principal component (eigengene) in each # module defined by the colors of the input vector "colors". # The theoretical underpinnings are described in Horvath, Dong, Yip (2005) # http://www.genetics.ucla.edu/labs/horvath/ModuleConformity/ # This requires the R library impute moduleEigengenes = function(expr, colors, impute = TRUE, nPC = 1, align = "along average", excludeGrey = FALSE, grey = ifelse(is.numeric(colors), 0, "grey"), subHubs = TRUE, trapErrors = FALSE, returnValidOnly = trapErrors, softPower = 6, verbose = 0, indent = 0) { spaces = indentSpaces(indent); if (verbose==1) printFlush(paste(spaces, "moduleEigengenes: Calculating", nlevels(as.factor(colors)), "module eigengenes in given set.")); if (is.null(expr)) { stop("moduleEigengenes: Error: expr is NULL. "); } if (is.null(colors)) { print("moduleEigengenes: Error: colors is NULL. "); stop() } if (is.null(dim(expr)) || length(dim(expr))!=2) stop("moduleEigengenes: Error: expr must be two-dimensional."); if (dim(expr)[2]!=length(colors)) stop("moduleEigengenes: Error: ncol(expr) and length(colors) must be equal (one color per gene)."); if (is.factor(colors)) { nl = nlevels(colors); nlDrop = nlevels(colors[, drop = TRUE]); if (nl > nlDrop) stop(paste("Argument 'colors' contains unused levels (empty modules). ", "Use colors[, drop=TRUE] to get rid of them.")); } if (softPower < 0) stop("softPower must be non-negative"); alignRecognizedValues = c("", "along average"); if (!is.element(align, alignRecognizedValues)) { printFlush(paste("ModulePrincipalComponents: Error:", "parameter align has an unrecognised value:", align, "; Recognized values are ", alignRecognizedValues)); stop() } maxVarExplained = 10; if (nPC>maxVarExplained) warning(paste("Given nPC is too large. Will use value", maxVarExplained)); nVarExplained = min(nPC, maxVarExplained); modlevels=levels(factor(colors)) if (excludeGrey) modlevels = modlevels[as.character(modlevels)!=as.character(grey)] PrinComps = data.frame(matrix(NA,nrow=dim(expr)[[1]], ncol= length(modlevels))) averExpr = data.frame(matrix(NA,nrow=dim(expr)[[1]], ncol= length(modlevels))) varExpl= data.frame(matrix(NA, nrow= nVarExplained, ncol= length(modlevels))) validMEs = rep(TRUE, length(modlevels)); validAEs = rep(FALSE, length(modlevels)); isPC = rep(TRUE, length(modlevels)); isHub = rep(FALSE, length(modlevels)); validColors = colors; names(PrinComps)=paste(moduleColor.getMEprefix(), modlevels, sep="") names(averExpr)=paste("AE",modlevels,sep="") for(i in c(1:length(modlevels)) ) { if (verbose>1) printFlush(paste(spaces, "moduleEigengenes : Working on ME for module", modlevels[i])); modulename = modlevels[i] restrict1 = as.character(colors)== as.character(modulename) datModule = as.matrix(t(expr[, restrict1])); n = dim(datModule)[1]; p = dim(datModule)[2]; pc = try( { if (nrow(datModule)>1 && impute) { seedSaved = FALSE; if (exists(".Random.seed")) { saved.seed = .Random.seed; seedSaved = TRUE; } datModule = impute.knn(as.matrix(datModule), k = min(10, nrow(datModule)-1)) # The <<- in the next line is extremely important. Using = or <- will create a local variable of # the name .Random.seed and will leave the important global .Random.seed untouched. if (seedSaved) .Random.seed <<- saved.seed; } datModule=t(scale(t(datModule))); svd1 = svd(datModule, nu = min(n, p, nPC), nv = min(n, p, nPC)); # varExpl[,i]= (svd1$d[1:min(n,p,nVarExplained)])^2/sum(svd1$d^2) veMat = cor(svd1$v[, c(1:min(n,p,nVarExplained))], t(datModule), use = "p") varExpl[c(1:min(n,p,nVarExplained)),i]= apply(veMat^2, 1, mean, na.rm = TRUE) # this is the first principal component svd1$v[,1] }, silent = TRUE); if (class(pc)=='try-error') { if ( (!subHubs) && (!trapErrors) ) stop(pc); if (subHubs) { if (verbose>0) { printFlush(paste(spaces, " ..principal component calculation for module", modulename, "failed with the following error:")); printFlush(paste(spaces, " ", pc, spaces, " ..hub genes will be used instead of principal components.")); } isPC[i] = FALSE; pc = try( { scaledExpr = scale(t(datModule)); covEx = cov(scaledExpr, use = "p"); modAdj = abs(covEx)^softPower; kIM = (apply(modAdj, 1, sum, na.rm = TRUE) - 1)^3; kIM[is.na(kIM)] = 0; hub = which.max(kIM) alignSign = sign(covEx[, hub]); alignSign[is.na(alignSign)] = 0; isHub[i] = TRUE; pcxMat = scaledExpr * matrix(kIM * alignSign, nrow = nrow(scaledExpr), ncol = ncol(scaledExpr), byrow = TRUE) / sum(kIM); pcx = apply(pcxMat, 1, sum, na.rm = TRUE); varExpl[1, i] = mean(cor(pcx, t(datModule), use = "p")^2, na.rm = TRUE) pcx }, silent = TRUE); } } if (class(pc)=='try-error') { if (!trapErrors) stop(pc); if (verbose>0) { printFlush(paste(spaces, " ..ME calculation of module", modulename, "failed with the following error:")); printFlush(paste(spaces, " ", pc, spaces, " ..the offending module has been removed.")); } warning(paste("Eigengene calculation of module", modulename, "failed with the following error \n ", pc, "The offending module has been removed.\n")); validMEs[i] = FALSE; isPC[i] = FALSE; isHub[i] = FALSE; validColors[restrict1] = grey; } else { PrinComps[, i] = pc; ae = try( { if (isPC[i]) scaledExpr = scale(t(datModule)); averExpr[, i] = apply(scaledExpr, 1, mean, na.rm = TRUE); if (align == "along average") { if (verbose>4) printFlush(paste(spaces, " .. aligning module eigengene with average expression.")) if (cor(averExpr[,i], PrinComps[,i], use = "p")<0) PrinComps[,i] = -PrinComps[,i] } 0; }, silent = TRUE); if (class(ae)=='try-error') { if (!trapErrors) stop(ae); if (verbose>0) { printFlush(paste(spaces, " ..Average expression calculation of module", modulename, "failed with the following error:")); printFlush(paste(spaces, " ", ae, spaces, " ..the returned average expression vector will be invalid.")); } warning(paste("Average expression calculation of module", modulename, "failed with the following error \n ", ae, "The returned average expression vector will be invalid.\n")); } validAEs[i] = !(class(ae)=='try-error'); } } allOK = (sum(!validMEs)==0) if (returnValidOnly && sum(!validMEs)>0) { PrinComps = PrinComps[, validMEs] averExpr = averExpr[, validMEs]; varExpl = varExpl[, validMEs]; validMEs = rep(TRUE, times = ncol(PrinComps)); isPC = isPC[validMEs]; isHub = isHub[validMEs]; validAEs = validAEs[validMEs]; } allPC = (sum(!isPC)==0); allAEOK = (sum(!validAEs)==0) list(eigengenes = PrinComps, averageExpr = averExpr, varExplained = varExpl, nPC = nPC, validMEs = validMEs, validColors = validColors, allOK = allOK, allPC = allPC, isPC = isPC, isHub = isHub, validAEs = validAEs, allAEOK = allAEOK) } #--------------------------------------------------------------------------------------------- # # removeGrey # #--------------------------------------------------------------------------------------------- # This function removes the grey eigengene from supplied module eigengenes. removeGreyME = function(MEs, greyMEName = paste(moduleColor.getMEprefix(), "grey", sep="")) { newMEs = MEs; if (is.vector(MEs) & mode(MEs)=="list") { warned = 0; newMEs = vector(mode = "list", length = length(MEs)); for (set in 1:length(MEs)) { if (!is.data.frame(MEs[[set]]$data)) stop("MEs is a vector list but the list structure is missing the correct 'data' component."); newMEs[[set]] = MEs[[set]]; if (greyMEName %in% names(MEs[[set]]$data)) { newMEs[[set]]$data = MEs[[set]]$data[, names(MEs[[set]]$data)!=greyMEName]; } else { if (warned==0) { warning("removeGreyME: The given grey ME name was not found among the names of given MEs."); warned = 1; } } } } else { if (length(dim(MEs))!=2) stop("Argument 'MEs' has incorrect dimensions.") MEs = as.data.frame(MEs); if (greyMEName %in% names(MEs)) { newMEs = MEs[, names(MEs)!=greyMEName]; } else { warning("removeGreyME: The given grey ME name was not found among the names of given MEs."); } } newMEs; } #------------------------------------------------------------------------------------- # # ModulePrincipalComponents # #------------------------------------------------------------------------------------- # Has been superseded by moduleEigengenes above. # =================================================== # This function collects garbage collectGarbage=function(){while (gc()[2,4] != gc()[2,4] | gc()[1,4] != gc()[1,4]){}} #-------------------------------------------------------------------------------------- # # orderMEs # #-------------------------------------------------------------------------------------- # # performs hierarchical clustering on MEs and returns the order suitable for plotting. orderMEs = function(MEs, greyLast = TRUE, greyName = paste(moduleColor.getMEprefix(), "grey", sep=""), orderBy = 1, order = NULL, useSets = NULL, verbose = 0, indent = 0) { spaces = indentSpaces(indent); check = checkSets(MEs, checkStructure = TRUE, useSets = useSets); if (check$structureOK) { multiSet = TRUE; } else { multiSet = FALSE; MEs = fixDataStructure(MEs); useSets = NULL; orderBy = 1; } if (!is.null(useSets)) if (is.na(match(orderBy, useSets))) orderBy = useSets[1]; if (is.null(order)) { if (verbose>0) printFlush(paste(spaces, "orderMEs: order not given, calculating using given set", orderBy)); corPC = cor(MEs[[orderBy]]$data, use="p") disPC = as.dist(1-corPC); order = .clustOrder(disPC, greyLast = greyLast, greyName = greyName); } if (length(order)!=dim(MEs[[orderBy]]$data)[2]) stop("orderMEs: given MEs and order have incompatible dimensions."); nSets = length(MEs); orderedMEs = MEs; if (is.null(useSets)) useSets = c(1:nSets); for (set in useSets) { orderedMEs[[set]]$data = as.data.frame(MEs[[set]]$data[,order]); names(orderedMEs[[set]]$data) = names(MEs[[set]]$data)[order]; if (!is.null(MEs[[set]]$averageExpr)) { orderedMEs[[set]]$averageExpr = MEs[[set]]$averageExpr[, order] names(orderedMEs[[set]]$averageExpr) = names(MEs[[set]]$data)[order]; } if (!is.null(MEs[[set]]$varExplained)) { orderedMEs[[set]]$varExplained = MEs[[set]]$varExplained[, order] names(orderedMEs[[set]]$varExplained) = names(MEs[[set]]$data)[order]; } } if (multiSet) { return(orderedMEs); } else { return(orderedMEs[[1]]$data); } } #--------------------------------------------------------------------------------------------- # # .clustOrder # #--------------------------------------------------------------------------------------------- .clustOrder = function(distM, greyLast = TRUE, greyName = paste(moduleColor.getMEprefix(), "grey", sep="")) { greyInd = match(greyName, names(distM)); if (greyLast && !is.na(greyInd)) { clusterMEs = (greyName!=names(distM)); if (sum(clusterMEs)>1) { h = hclust(as.dist(distM[clusterMEs, clusterMEs]), method = "average"); order = h$order; if (sum(order>=greyInd)>0) order[order>=greyInd] = order[order>=greyInd]+1; order = c(order, greyInd); } else if (length(names(distM))>1) { if (greyInd==1) { order = c(2, 1) } else order = c(1, 2); } else order = 1; } else { if (length(distM)>1) { h = hclust(as.dist(distM), method = "average"); order = h$order; } else order = 1; } order; # print(paste("names:", names(distM), collapse = ", ")); # print(paste("order:", order, collapse=", ")) } #--------------------------------------------------------------------------------------------- # # consensusOrderMEs # #--------------------------------------------------------------------------------------------- # Orders MEs by the dendrogram of their consensus dissimilarity. consensusOrderMEs = function(MEs, useAbs = FALSE, useSets = NULL, greyLast = TRUE, greyName = paste(moduleColor.getMEprefix(), "grey", sep=""), method = "consensus") { Diss = consensusMEDissimilarity(MEs, useAbs = useAbs, useSets = useSets, method = method); order = .clustOrder(Diss, greyLast, greyName); orderMEs(MEs, greyLast = greyLast, greyName = greyName, order = order, useSets = useSets); } #--------------------------------------------------------------------------------------------- # # consensusMEDissimilarity # #--------------------------------------------------------------------------------------------- # This function calcualtes a consensus dissimilarity (i.e., correlation) among sets of MEs (more generally, # any sets of vectors). # CAUTION: when not using absolute value, the minimum similarity will favor the large negative values! consensusMEDissimilarity = function(MEs, useAbs = FALSE, useSets = NULL, method = "consensus") { methods = c("consensus", "majority"); m = charmatch(method, methods); if (is.na(m)) stop("Unrecognized method given. Recognized values are", paste(methods, collapse =", ")); nSets = length(MEs); MEDiss = vector(mode="list", length = nSets); if (is.null(useSets)) useSets = c(1:nSets); for (set in useSets) { if (useAbs) { diss = 1-abs(cor(MEs[[set]]$data, use="p")); } else { diss = 1-cor(MEs[[set]]$data, use="p"); } MEDiss[[set]] = list(Diss = diss); } for (set in useSets) if (set==useSets[1]) { ConsDiss = MEDiss[[set]]$Diss; } else { if (m==1) { ConsDiss = pmax(ConsDiss, MEDiss[[set]]$Diss); } else { ConsDiss = ConsDiss + MEDiss[[set]]$Diss; } } if (m==2) ConsDiss = ConsDiss/nSets; ConsDiss = as.data.frame(ConsDiss); names(ConsDiss) = names(MEs[[useSets[1]]]$data); rownames(ConsDiss) = names(MEs[[useSets[1]]]$data); ConsDiss; } #====================================================================================================== # ColorHandler.R #====================================================================================================== # A set of global variables and functions that should help handling color names for some 400+ modules. # A vector called .GlobalStandardColors is defined that holds color names with first few entries # being the well-known and -loved colors. The rest is randomly chosen from the color names of R, # excluding grey colors. #--------------------------------------------------------------------------------------------------------- # # .GlobalStandardColors # #--------------------------------------------------------------------------------------------------------- # This code forms a vector of color names in which the first entries are given by BaseColors and the rest # is "randomly" chosen from the rest of R color names that do not contain "grey" nor "gray". BaseColors = c("turquoise","blue","brown","yellow","green","red","black","pink","magenta", "purple","greenyellow","tan","salmon","cyan", "midnightblue", "lightcyan", "grey60", "lightgreen", "lightyellow", "royalblue", "darkred", "darkgreen", "darkturquoise", "darkgrey", "orange", "darkorange", "white", "skyblue", "saddlebrown", "steelblue", "paleturquoise", "violet", "darkolivegreen", "darkmagenta" ); RColors = colors()[-grep("grey", colors())]; RColors = RColors[-grep("gray", RColors)]; InBase = match(BaseColors, RColors); ExtraColors = RColors[-c(InBase[!is.na(InBase)])]; nExtras = length(ExtraColors); # Here is the vector of colors that should be used by all functions: .GlobalStandardColors = c(BaseColors, ExtraColors[rank(sin(13*c(1:nExtras) +sin(13*c(1:nExtras))) )] ); standardColors = function(n = NULL) { if (is.null(n)) return(.GlobalStandardColors); if ((n>0) && (n<=length(.GlobalStandardColors))) { return(.GlobalStandardColors[c(1:n)]); } else { stop("Invalid number of standard colors requested."); } } rm(BaseColors, RColors, ExtraColors, nExtras, InBase); #--------------------------------------------------------------------------------------------------------- # # normalizeLabels # #--------------------------------------------------------------------------------------------------------- # "Normalizes" numerical labels such that the largest group is labeled 1, the next largest 2 etc. # If KeepZero == TRUE, label zero is preserved. normalizeLabels = function(labels, keepZero = TRUE) { if (keepZero) { NonZero = (labels!=0); } else { NonZero = rep(TRUE, length(labels)); } f = as.numeric(factor(labels[NonZero])); t = table(labels[NonZero]); # print(t) r = rank(-as.vector(t), ties.method = "first"); norm_labs = rep(0, times = length(labels)); norm_labs[NonZero] = r[f]; norm_labs; } #--------------------------------------------------------------------------------------------------------- # # labels2colors # #--------------------------------------------------------------------------------------------------------- # This function converts integer numerical labels labels into color names in the order either given by # colorSeq, # or (if colorSeq==NULL) by standardColors(). If GreyIsZero == TRUE, labels 0 will be assigned # the color grey; otherwise presence of labels below 1 will trigger an error. # dimensions of labels (if present) are preserved. labels2colors = function(labels, zeroIsGrey = TRUE, colorSeq = NULL) { if (is.null(colorSeq)) colorSeq = standardColors(); if (zeroIsGrey) minLabel = 0 else minLabel = 1 if (sum(is.na(labels)) > 0) stop("'labels' must not contain NA's"); if (sum( labels>=minLabel )!= length(labels)) stop(paste("Input error: something's wrong with labels. Either they are not a numeric vector,", "or some values are below", minLabel)); if (max(labels) > length(colorSeq)) { nRepeats = as.integer((max(labels)-1)/length(colorSeq)) + 1; warning(paste("labels2colors: Number of labels exceeds number of avilable colors.", "Some colors will be repeated", nRepeats, "times.")) extColorSeq = colorSeq; for (rep in 1:nRepeats) extColorSeq = c(extColorSeq, paste(colorSeq, ".", rep, sep="")); } else { nRepeats = 1; extColorSeq = colorSeq; } colors = rep("grey", length(labels)); colors[labels!=0] = extColorSeq[labels[labels!=0]]; if (!is.null(dim(labels))) { dim(colors) = dim(labels); } colors; } #======================================================================================== # # MergeCloseModules # #======================================================================================== #--------------------------------------------------------------------------------- # # moduleNumber # #--------------------------------------------------------------------------------- # Similar to modulecolor2 above, but returns numbers instead of colors, which is oftentimes more useful. # 0 means unassigned. # Return value is a simple vector, not a factor. # Caution: the module numbers are neither sorted nor sequential, the only guarranteed fact is that grey # probes are labeled by 0 and all probes belonging to the same module have the same number. moduleNumber = function(dendro, cutHeight = 0.9, minSize = 50) { Branches = cutree(dendro, h = cutHeight); NOnBranches = table(Branches); TrueBranch = NOnBranches >= minSize; Branches[!TrueBranch[Branches]] = 0; Branches; } #-------------------------------------------------------------------------------------- # # fixDataStructure # #-------------------------------------------------------------------------------------- # Check input data: if they are not a vector of lists, put them into the form of a vector of lists. fixDataStructure = function(data, verbose = 0, indent = 0) { spaces = indentSpaces(indent); if ((class(data)!="list") || (class(data[[1]])!="list")) { if (verbose>0) printFlush(paste(spaces, "fixDataStructure: data is not a vector of lists: converting it into one.")); x = data; data = vector(mode = "list", length = 1); data[[1]] = list(data = x); rm(x); } data; } #------------------------------------------------------------------------------------------- # # checkSets # #------------------------------------------------------------------------------------------- # Checks sets for consistency and returns some diagnostics. checkSets = function(data, checkStructure = FALSE, useSets = NULL) { nSets = length(data); if (is.null(useSets)) useSets = c(1:nSets); if (nSets<=0) stop("No data given."); structureOK = TRUE; if ((class(data)!="list") || (class(data[[useSets[1]]])!="list")) { if (checkStructure) { structureOK = FALSE; nGenes = 0; nSamples = 0; } else { stop("data does not appear to have the correct format. Consider using fixDataStructure", "or setting checkStructure = TRUE when calling this function."); } } else { nSamples = vector(length = nSets); nGenes = dim(data[[useSets[1]]]$data)[2]; for (set in useSets) { if (nGenes!=dim(data[[set]]$data)[2]) { if (checkStructure) { structureOK = FALSE; } else { stop(paste("Incompatible number of genes in set 1 and", set)); } } nSamples[set] = dim(data[[set]]$data)[1]; } } list(nSets = nSets, nGenes = nGenes, nSamples = nSamples, structureOK = structureOK); } #-------------------------------------------------------------------------------------- # # multiSetMEs # #-------------------------------------------------------------------------------------- multiSetMEs = function(exprData, colors, universalColors = NULL, useSets = NULL, impute = TRUE, nPC = 1, align = "along average", excludeGrey = FALSE, grey = ifelse(is.null(universalColors), ifelse(is.numeric(colors), 0, "grey"), ifelse(is.numeric(universalColors), 0, "grey")), subHubs = TRUE, trapErrors = FALSE, returnValidOnly = trapErrors, softPower = 6, verbose = 1, indent = 0) { spaces = indentSpaces(indent); nSets = length(exprData); setsize = checkSets(exprData, useSets = useSets); nGenes = setsize$nGenes; nSamples = setsize$nSamples; if (verbose>0) printFlush(paste(spaces,"multiSetMEs: Looking for module MEs.")); MEs = vector(mode="list", length=nSets); consValidMEs = NULL; if (!is.null(universalColors)) consValidColors = universalColors; if (is.null(useSets)) useSets = c(1:nSets); for (set in useSets) { if (verbose>0) printFlush(paste(spaces," Working on set", as.character(set), "...")); if (is.null(universalColors)) { setColors = colors[,set]; } else { setColors = universalColors; } setMEs = moduleEigengenes(expr = exprData[[set]]$data, color = setColors, impute = impute, nPC = nPC, align = align, excludeGrey = excludeGrey, grey = grey, trapErrors = trapErrors, subHubs = subHubs, returnValidOnly = FALSE, softPower = softPower, verbose = verbose-1, indent = indent+1); if (!is.null(universalColors) && (!setMEs$allOK)) { if (is.null(consValidMEs)) { consValidMEs = setMEs$validMEs; } else { consValidMEs = consValidMEs * setMEs$validMEs; } consValidColors[setMEs$validColors!=universalColors] = setMEs$validColors[setMEs$validColors!=universalColors] } MEs[[set]] = setMEs; names(MEs[[set]])[names(setMEs)=='eigengenes'] = 'data'; # Here's what moduleEigengenes returns: # # list(eigengenes = PrinComps, averageExpr = averExpr, varExplained = varExpl, nPC = nPC, # validMEs = validMEs, validColors = validColors, allOK = allOK, allPC = allPC, isPC = isPC, # isHub = isHub, validAEs = validAEs, allAEOK = allAEOK) } if (!is.null(universalColors)) { for (set in 1:nSets) { if (!is.null(consValidMEs)) MEs[[set]]$validMEs = consValidMEs; MEs[[set]]$validColors = consValidColors; } } for (set in 1:nSets) { MEs[[set]]$allOK = (sum(!MEs[[set]]$validMEs)==0); if (returnValidOnly) { valid = (MEs[[set]]$validMEs > 0); MEs[[set]]$data = MEs[[set]]$data[, valid]; MEs[[set]]$averageExpr = MEs[[set]]$averageExpr[, valid]; MEs[[set]]$varExplained = MEs[[set]]$varExplained[, valid]; MEs[[set]]$isPC = MEs[[set]]$isPC[valid]; MEs[[set]]$allPC = (sum(!MEs[[set]]$isPC)==0) MEs[[set]]$isHub = MEs[[set]]$isHub[valid]; MEs[[set]]$validAEs = MEs[[set]]$validAEs[valid]; MEs[[set]]$allAEOK = (sum(!MEs[[set]]$validAEs)==0) MEs[[set]]$validMEs = rep(TRUE, times = ncol(MEs[[set]]$data)); } } MEs; } #--------------------------------------------------------------------------------------------- # # MergeCloseModules # #--------------------------------------------------------------------------------------------- # This function merges modules whose MEs fall on one branch of a hierarchical clustering tree mergeCloseModules = function(exprData, colors, cutHeight = 0.2, MEs = NULL, impute = TRUE, useAbs = FALSE, iterate = TRUE, relabel = FALSE, colorSeq = NULL, getNewMEs = TRUE, getNewUnassdME = TRUE, useSets = NULL, checkDataFormat = TRUE, unassdColor = ifelse(is.numeric(colors), 0, "grey"), trapErrors = FALSE, verbose = 1, indent = 0) { MEsInSingleFrame = FALSE; spaces = indentSpaces(indent); #numCols = is.numeric(colors); #facCols = is.factor(colors); #charCols = is.character(colors); origColors = colors; colors = colors[, drop = TRUE]; greyMEname = paste(moduleColor.getMEprefix(), unassdColor, sep=""); if (verbose>0) printFlush(paste(spaces, "mergeCloseModules: Merging modules whose distance is less than", cutHeight)); if (!checkSets(exprData, checkStructure = TRUE, useSets = useSets)$structureOK) { if (checkDataFormat) { exprData = fixDataStructure(exprData); MEsInSingleFrame = TRUE; } else { stop("Given exprData appear to be misformatted."); } } setsize = checkSets(exprData, useSets = useSets); nSets = setsize$nSets; if (!is.null(MEs)) { checkMEs = checkSets(MEs, checkStructure = TRUE, useSets = useSets); if (checkMEs$structureOK) { if (nSets!=checkMEs$nSets) stop("Input error: numbers of sets in exprData and MEs differ.") for (set in 1:nSets) { if (checkMEs$nSamples[set]!=setsize$nSamples[set]) stop(paste("Number of samples in MEs is incompatible with subset length for set", set)); } } else { if (MEsInSingleFrame) { MEs = fixDataStructure(MEs); checkMEs = checkSets(MEs); } else { stop("MEs do not have the appropriate structure (same as exprData). "); } } } if (setsize$nGenes!=length(colors)) stop("Number of genes in exprData is different from the length of original colors. They must equal."); if ((cutHeight <0) | (cutHeight>(1+as.integer(useAbs)))) stop(paste("Given cutHeight is out of sensible range between 0 and", 1+as.integer(useAbs) )); done = FALSE; iteration = 1; MergedColors = colors; ok = try( { while (!done) { if (is.null(MEs)) { MEs = multiSetMEs(exprData, colors = NULL, universalColors = colors, useSets = useSets, impute = impute, subHubs = TRUE, trapErrors = FALSE, excludeGrey = TRUE, grey = unassdColor, verbose = verbose-1, indent = indent+1); MEs = consensusOrderMEs(MEs, useAbs = useAbs, useSets = useSets, greyLast = FALSE); collectGarbage(); } else if (nlevels(as.factor(colors))!=checkMEs$nGenes) { if ((iteration==1) & (verbose>0)) printFlush(paste(spaces, " Number of given module colors", "does not match number of given MEs => recalculating the MEs.")) MEs = multiSetMEs(exprData, colors = NULL, universalColors = colors, useSets = useSets, impute = impute, subHubs = TRUE, trapErrors = FALSE, excludeGrey = TRUE, grey = unassdColor, verbose = verbose-1, indent = indent+1); MEs = consensusOrderMEs(MEs, useAbs = useAbs, useSets = useSets, greyLast = FALSE); collectGarbage(); } if (iteration==1) oldMEs = MEs; # Check colors for number of distinct colors that are not grey colLevs = as.character(levels(as.factor(colors))); if ( length(colLevs[colLevs!=as.character(unassdColor)])<2 ) { printFlush(paste(spaces, "mergeCloseModules: less than two proper modules.")); printFlush(paste(spaces, " ..color levels are", paste(colLevs, collapse = ", "))); printFlush(paste(spaces, " ..there is nothing to merge.")); MergedNewColors = colors; MergedColors = colors; nOldMods = 1; nNewMods = 1; oldTree = NULL; Tree = NULL; break; } # Cluster the found module eigengenes and merge ones that are too close _in all sets_. MEDiss = vector(mode="list", length = nSets); if (is.null(useSets)) useSets = c(1:nSets); for (set in useSets) { useMEs = c(1:dim(MEs[[set]]$data)[2])[names(MEs[[set]]$data)!=greyMEname]; if (length(useMEs)<2) stop("Something is wrong: there are two or more proper modules, but less than two proper", "eigengenes. Please check that the grey color label and module eigengene label", "are correct."); if (useAbs) { diss = 1-abs(cor(MEs[[set]]$data[, useMEs], use = "p")); } else { diss = 1-cor(MEs[[set]]$data[, useMEs], use = "p"); } MEDiss[[set]] = list(Diss = diss); } for (set in useSets) if (set==useSets[1]) { ConsDiss = MEDiss[[set]]$Diss; } else { ConsDiss = pmax(ConsDiss, MEDiss[[set]]$Diss); } nOldMods = nlevels(as.factor(colors)); Tree = hclust(as.dist(ConsDiss), method = "average"); if (iteration==1) oldTree = Tree; TreeBranches = as.factor(moduleNumber(dendro = Tree, cutHeight = cutHeight, minSize = 1)); UniqueBranches = levels(TreeBranches); nBranches = nlevels(TreeBranches) NumberOnBranch = table(TreeBranches); MergedColors = colors; # Merge modules on the same branch for (branch in 1:nBranches) if (NumberOnBranch[branch]>1) { ModulesOnThisBranch = names(TreeBranches)[TreeBranches==UniqueBranches[branch]]; ColorsOnThisBranch = substring(ModulesOnThisBranch, 3); if (is.numeric(origColors)) ColorsOnThisBranch = as.numeric(ColorsOnThisBranch); if (verbose>3) printFlush(paste(spaces, " Merging original colors", paste(ColorsOnThisBranch, collapse=", "))); for (color in 2:length(ColorsOnThisBranch)) MergedColors[MergedColors==ColorsOnThisBranch[color]] = ColorsOnThisBranch[1]; } MergedColors = MergedColors[, drop = TRUE]; nNewMods = nlevels(as.factor(MergedColors)); if (nNewMods3) printFlush(paste(spaces, " Changing original colors:")); rank = 0; for (color in 1:length(SortedRawModuleColors)) if (SortedRawModuleColors[color]!=unassdColor) { rank = rank + 1; if (verbose>3) printFlush(paste(spaces, " ", SortedRawModuleColors[color], "to ", colorSeq[rank])); MergedNewColors[MergedColors==SortedRawModuleColors[color]] = colorSeq[rank]; } if (is.factor(MergedColors)) MergedNewColors = as.factor(MergedNewColors); } else { MergedNewColors = MergedColors; } MergedNewColors = MergedNewColors[, drop = TRUE]; if (getNewMEs) { if (nNewMods0) printFlush(paste(spaces, " Calculating new MEs...")); NewMEs = multiSetMEs(exprData, colors = NULL, universalColors = MergedNewColors, useSets = useSets, impute = impute, subHubs = TRUE, trapErrors = FALSE, excludeGrey = !getNewUnassdME, grey = unassdColor, verbose = verbose-1, indent = indent+1); newMEs = consensusOrderMEs(NewMEs, useAbs = useAbs, useSets = useSets, greyLast = TRUE, greyName = greyMEname); MEDiss = vector(mode="list", length = nSets); useMEs = c(1:dim(newMEs[[1]]$data)[2])[names(newMEs[[1]]$data)!=greyMEname]; if (length(useMEs)>1) { for (set in useSets) { if (useAbs) { diss = 1-abs(cor(newMEs[[set]]$data[, useMEs], use = "p")); } else { diss = 1-cor(newMEs[[set]]$data[, useMEs], use = "p"); } MEDiss[[set]] = list(Diss = diss); } for (set in useSets) if (set==useSets[1]) { ConsDiss = MEDiss[[set]]$Diss; } else { ConsDiss = pmax(ConsDiss, MEDiss[[set]]$Diss); } Tree = hclust(as.dist(ConsDiss), method = "average"); } else Tree = NULL; } else { newMEs = MEs; } } else { newMEs = NULL; } if (MEsInSingleFrame) { newMEs = newMEs[[1]]$data; oldMEs = oldMEs[[1]]$data; } }, silent = TRUE); if (class(ok)=='try-error') { if (!trapErrors) stop(ok); if (verbose>0) { printFlush(paste(spaces, "Warning: merging of modules failed with the following error:")); printFlush(paste(' ', spaces, ok)); printFlush(paste(spaces, " --> returning unmerged modules and *no* eigengenes.")); } warning(paste("mergeCloseModules: merging of modules failed with the following error:\n", " ", ok, " --> returning unmerged modules and *no* eigengenes.\n")); list(colors = origColors, allOK = FALSE); } else { list(colors = MergedNewColors, dendro = Tree, oldDendro = oldTree, cutHeight = cutHeight, oldMEs = oldMEs, newMEs = newMEs, allOK = TRUE); } } ##*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#* #===================================================================================== # # Included file: ../RLibs/dynamicTreeCut/R/PrintFlush.R # #===================================================================================== #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#* if (exists("printFlush")) { remove(printFlush); collect_garbage(); } printFlush = function(...) { # x = print(...) cat(...); cat("\n"); if (exists("flush.console")) flush.console(); } if (exists("indentSpaces")) { remove(indentSpaces); collect_garbage(); } indentSpaces = function(indent = 0) { if (indent>0) { spaces = paste(rep(" ", times=indent), collapse=""); } else { spaces = ""; } spaces; } ##*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#* #===================================================================================== # # Included file: ../RLibs/dynamicTreeCut/R/TreeCut.R # #===================================================================================== #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#* # Tree cut # 1.11-2 # . Bug fix in interpretation of deepSplit fixed # 1.11-1 # . If no merge lies below the cut height, simply exit with all labels=0 instead of throwing an # error. # . If cutHeight is above the highest merge, set it to the highest merge. # 1.11 # . change the default cutHeight to 99% of dendogram height range # 1.10-02: # . if distM isn't given, method defaults to "tree". # 1.10: # . Bug fix: since distM us necessary in the first stage as well, the function now complains if # distM is not given # 1.09: # . Bug fix: number of Unassigned = 1 is now handled correctly # 1.08: # . Changing the meaning of minGap, maxCoreScatter: both now relative (i.e., fractions). minGap will # be the minimum cluster gap expressed as a fraction of the range between a certain quantile of the # merging heights and the cutHeight; maxCoreScatter will be interpreted in the same way. Adding two # more parameters, namely minAbsGap, maxAbsCoreScatter that can be used to supply hitherto used # absolute values. If they are given, they override minGap and maxCoreScatter, respectively. # . Change of names: clusterMinSize -> minClusterSize # . Default value of minClusterSize now 20 # . Fixed stage 2 labeling for non-medoids: ClusterDiam{eter} now given is the maximum of average # distances of points to the rest of the cluster. # . Default for maxDistToLabel is now cutHeight even though it may mean that some objects above the # cutHeight may be labeled. There doesn't seem to be a clean and simple way to only label objects # whose merging distance to (the continuation of) the cluster is below cutHeight # . If cutHeight not given, will be set to max(dendro$height) for the DynamicTree as well. # 1.07: Changing parameter names to be more intuitive and in accord with generally used terminology # Also changing internal function names by prepending a . to them # . All extra functions (EvaluateCLusters etc) removed. # . Rename the function from GetClusters to cutreeHybrid # 1.06: The tree cut is exactly the same as 1.05, but the color handling is relegated to NetworkFunctions. # The ColorsFromLabels in NetworkFunctions use a slightly different input format; # in particular, 0 is considered grey. This means, among other things, that # 1.05: # . The PAM stage is changed: instead of calculating distances to medoids, will calculate average # distances to clusters. This is intuitively less desirable than the medoids, but simulations # seem to indicate that large clusters are too greedy. The average linkage may help that a bit. # It is not quite clear that cluster trimming makes a lot of sense in this scenario, but I'll # keep it in (assigning elements based on average distance to a trimmed cluster is not quite the # same as an untrimmed cluster, even for the elements that were trimmed). # . Improve trimming: instead of the lowest joining heights, keep all singletons up to the first joined # object (branch or singleton) whose merging height is above the threshold; the rest (content of all # branches merged higher than threshold) is trimmed. # 1.04: # . Implement Bin's idea that small branches unassigned in Stage 1 should not be broken # up by Stage 2. Implemented as follows: Only keep those branches unbroken whose only reason for not # being a cluster is that they don't have enough objects. # While going through the merge tree: mark branches that are (1) merged into composite clusters, (2) # not clusters themselves because of failing the minimum size requirement only. # . Change boudaries of clusters. Instead of regarding everything up to the merge automatically part # the respective cluster, only elements whose joining heights are less than a cutoff given for # the cluster are considered part of the cluster automatically; the rest is assigned in pam-like # manner. # 1.03: # . Changing the definition of the core from the most connected points to the first points added # to the cluster. Note that the order depends on how branches are merged, so that better be # correct as well. This makes the core stable against adding outliers to the cluster. # 1.02.01: fixing a bug in the main function that was referencing nonexistent heights member of # dendro. # . Fixing a correctness issue: the core average distance is the average distance between points # within the core, not the average distance of points within the core to all points in the # cluster. # 1.02: # . Changing core size: instead of minClusterSize + sqrt(Size - minClusterSize) it is now # CoreSize + sqrt(Size - CoreSize), with CoreSize = as.integer(minClusterSize) + 1. # 1.01.03: Fixing memory usage and a bug in which singletons were added twice (and more times). # In this version, to simplify things, Singletons are only kept for basic clusters. For composite # clusters they are NULL; CoreScatter should never be called for composite clusters as that can get # hugely time and memory intensive. # Another bug concerning couting unassigned points fixed. # -02: adding distance matrix to the parameters of GetClusters. # -03: Merging GetClusters and AssignLabel together; changes in variable names to make code more # readable. .NewBranch = function(IsBasic = TRUE, IsClosed = FALSE, Content = NULL, MergingHeights = NULL, RootHeight = 1, LastMerge = 0, Size = 0, MergedInto = 0, Singletons = NULL, SingletonHeights = NULL, AttachHeight = NULL, FailSize = FALSE ) { list(IsBasic = IsBasic, IsClosed = IsClosed, Content = Content, MergingHeights = MergingHeights, RootHeight = RootHeight, LastMerge = LastMerge, Size = Size, MergedInto = MergedInto, Singletons = Singletons, SingletonHeights = SingletonHeights, AttachHeight = AttachHeight, FailSize = FailSize); } # The following are supporting function for GetClusters. .CoreSize = function(BranchSize, minClusterSize) { BaseCoreSize = minClusterSize/2 + 1; if (BaseCoreSize < BranchSize) { CoreSize = as.integer(BaseCoreSize + sqrt(BranchSize - BaseCoreSize)); } else CoreSize = BranchSize; CoreSize; } # This assumes the diagonal of the distance matrix # is zero, BranchDist is a square matrix whose dimension is at least 2. .CoreScatter = function(BranchDist, minClusterSize) { nPoints = dim(BranchDist)[1]; PointAverageDistances = apply(BranchDist, 2, sum) / (nPoints-1); CoreSize = minClusterSize/2 + 1; if (CoreSize < nPoints) { EffCoreSize = as.integer(CoreSize + sqrt(nPoints - CoreSize)); ord = order(PointAverageDistances); Core = ord[c(1:EffCoreSize)]; } else { Core = c(1:nPoints); EffCoreSize = nPoints; } CoreAverageDistances = apply(BranchDist[Core, Core], 2, sum) / (EffCoreSize-1); mean(CoreAverageDistances); } #------------------------------------------------------------------------------------------- # # cutreeHybrid # #------------------------------------------------------------------------------------------- # Traverses a given clustering tree and detects branches whose size is at least minClusterSize, average # singleton joining height is at most maxCoreScatter and split (attaching height minus average # height) is at least minGap. If cutHeight is set, all clusters are cut at that height. # clusterTrim is the fraction of the cluster gap that will be trimmed away. # Objects whose joining height is above that will be (re-)assigned based on distance to medoids. If # clusterTrim<=0, all assigments of stage 1 will be respected. cutreeHybrid = function(dendro, distM, cutHeight = NULL, minClusterSize = 20, deepSplit = 1, maxCoreScatter = NULL, minGap = NULL, maxAbsCoreScatter = NULL, minAbsGap = NULL, clusterTrim = 0, labelUnlabeled = TRUE, useMedoids = FALSE, maxDistToLabel = cutHeight, respectSmallClusters = TRUE, verbose = 2, indent = 0) { spaces = indentSpaces(indent); MxBranches = length(dendro$height) Branches = vector(mode="list", length = MxBranches); nBranches = 0; if (verbose>0) printFlush(paste(spaces, "Detecting clusters...")); # No. of merges in the tree nMerge = length(dendro$height); if (nMerge < 1) stop("The given dendrogram is suspicious: number of merges is zero."); if (is.null(distM)) stop("distM must be non-NULL") if ( (dim(distM)[1] != nMerge+1) | (dim(distM)[2]!=nMerge+1) ) stop("distM has incorrect dimensions."); diag(distM) = 0; refQuantile = 0.05; refMerge = as.integer(nMerge * refQuantile + 0.5); if (refMerge < 1) refMerge = 1; refHeight = dendro$height[refMerge]; if (is.null(cutHeight)) { cutHeight = 0.99 * (max(dendro$height) - refHeight) + refHeight; if (verbose>0) printFlush(paste(spaces, "..cutHeight not given, setting it to", signif(cutHeight,3), " ===> 99% of the (truncated) height range in dendro.")); } else { if (cutHeight > max(dendro$height)) cutHeight = max(dendro$height); } nMergeBelowCut = sum(dendro$height <= cutHeight); if (nMergeBelowCut < minClusterSize) { if (verbose>0) printFlush(paste(spaces, "cutHeight set too low: no merges below the cut.")); return(list(labels = rep(0, times = nMerge+1))) } # Default values for maxCoreScatter and minGap: defMCS = c(0.64, 0.73, 0.82, 0.91); defMG = (1-defMCS)*3/4; nSplitDefaults = length(defMCS); # Convert deep split to range 1..4 if (is.logical(deepSplit)) deepSplit = as.integer(deepSplit)*(nSplitDefaults - 2); deepSplit = as.integer(deepSplit + 1); if ((deepSplit<1) | (deepSplit>nSplitDefaults)) stop(paste("Parameter deepSplit (value", deepSplit, ") out of range: allowable range is 0 through", nSplitDefaults-1)); # If not set, set the cluster gap and core scatter according to deepSplit. if (is.null(maxCoreScatter)) maxCoreScatter = defMCS[deepSplit]; if (is.null(minGap)) minGap = defMG[deepSplit]; # If maxDistToLabel is not set, set it to cutHeight if (is.null(maxDistToLabel)) maxDistToLabel = cutHeight; # Convert (relative) minGap and maxCoreScatter to corresponding absolute quantities if the latter were # not given. if (is.null(maxAbsCoreScatter)) maxAbsCoreScatter = refHeight + maxCoreScatter * (cutHeight - refHeight); if (is.null(minAbsGap)) minAbsGap = minGap * (cutHeight - refHeight); # For each merge, record the cluster that it belongs to IndMergeToBranch = rep(0, times = nMerge) # The root RootBranch = 0; if (verbose>2) printFlush(paste(spaces, "..Going through the merge tree")); for (merge in 1:nMerge) if (dendro$height[merge]<=cutHeight) { # are both merged objects sigletons? if (dendro$merge[merge,1]<0 & dendro$merge[merge,2]<0) { # Yes; start a new cluster. nBranches = nBranches + 1; Branches[[nBranches]] = .NewBranch(Content = dendro$merge[merge,], MergingHeights = rep(dendro$height[merge], 2), LastMerge = merge, Size = 2, Singletons = -dendro$merge[merge,], SingletonHeights = rep(dendro$height[merge], 2)); IndMergeToBranch[merge] = nBranches; RootBranch = nBranches; } else if (dendro$merge[merge,1] * dendro$merge[merge,2] <0) { # merge the sigleton into the cluster clust = IndMergeToBranch[max(dendro$merge[merge,])]; if (clust==0) stop("Internal error: a previous merge has no associated cluster. Sorry!"); gene = min(dendro$merge[merge,]); Branches[[clust]]$Content = c(Branches[[clust]]$Content, gene); if (Branches[[clust]]$IsBasic) Branches[[clust]]$Singletons = c(Branches[[clust]]$Singletons, -gene); Branches[[clust]]$MergingHeights = c(Branches[[clust]]$MergingHeights, dendro$height[merge]); Branches[[clust]]$SingletonHeights = c(Branches[[clust]]$SingletonHeights, dendro$height[merge]); Branches[[clust]]$LastMerge = merge; Branches[[clust]]$Size = Branches[[clust]]$Size + 1; IndMergeToBranch[merge] = clust; RootBranch = clust; } else { # attempt to merge two clusters: clusts = IndMergeToBranch[dendro$merge[merge,]]; sizes = c(Branches[[clusts[1]]]$Size, Branches[[clusts[2]]]$Size); rnk = rank(sizes, ties.method = "first"); smaller = clusts[rnk[1]]; larger = clusts[rnk[2]]; if (Branches[[smaller]]$IsBasic) { coresize = .CoreSize(length(Branches[[smaller]]$Singletons), minClusterSize); Core = Branches[[smaller]]$Singletons[c(1:coresize)]; SmAveDist = mean(apply(distM[Core, Core], 2, sum)/(coresize-1)); } else { SmAveDist = 0; } if (Branches[[larger]]$IsBasic) { coresize = .CoreSize(length(Branches[[larger]]$Singletons), minClusterSize); Core = Branches[[larger]]$Singletons[c(1:coresize)]; LgAveDist = mean(apply(distM[Core, Core], 2, sum)/(coresize-1)); } else { LgAveDist = 0; } # Is the smaller cluster small or shallow enough to be merged? SmallerScores = c(Branches[[smaller]]$IsBasic, (Branches[[smaller]]$Size < minClusterSize), (SmAveDist > maxAbsCoreScatter), (dendro$height[merge] - SmAveDist < minAbsGap)); if ( SmallerScores[1] * sum(SmallerScores[c(2:4)]) > 0 ) { DoMerge = TRUE; SmallerFailSize = !(SmallerScores[3] | SmallerScores[4]); # Smaller fails only due to size } else { LargerScores = c(Branches[[larger]]$IsBasic, (Branches[[larger]]$Size < minClusterSize), (LgAveDist > maxAbsCoreScatter), (dendro$height[merge] - LgAveDist < minAbsGap)); if ( LargerScores[1] * sum(LargerScores[c(2:4)]) > 0 ) { # Actually: the larger one is the one to be merged DoMerge = TRUE; SmallerFailSize = !(LargerScores[3] | LargerScores[4]); # cluster fails only due to size x = smaller; smaller = larger; larger = x; } else { DoMerge = FALSE; # None of the two satisfies merging criteria } } if (DoMerge) { # merge the smaller into the larger cluster and close it. Branches[[smaller]]$IsClosed = TRUE; Branches[[smaller]]$FailSize = SmallerFailSize; Branches[[smaller]]$MergedInto = larger; Branches[[smaller]]$AttachHeight = dendro$height[merge]; Branches[[larger]]$Content = c(Branches[[larger]]$Content, smaller); if (Branches[[larger]]$IsBasic) { Branches[[larger]]$Singletons = c(Branches[[larger]]$Singletons, Branches[[smaller]]$Singletons); Branches[[larger]]$SingletonHeights = c(Branches[[larger]]$SingletonHeights, Branches[[smaller]]$SingletonHeights); } Branches[[larger]]$MergingHeights = c(Branches[[larger]]$MergingHeights, dendro$height[merge]); Branches[[larger]]$Size = Branches[[larger]]$Size + Branches[[smaller]]$Size; Branches[[larger]]$LastMerge = merge; IndMergeToBranch[merge] = larger; RootBranch = larger; } else { # Close both clusters and start a composite cluster. nBranches = nBranches + 1; Branches[[smaller]]$IsClosed = TRUE; Branches[[larger]]$IsClosed = TRUE; Branches[[smaller]]$AttachHeight = dendro$height[merge]; Branches[[larger]]$AttachHeight = dendro$height[merge]; # print(paste(" Starting a composite cluster with number", nBranches)); Branches[[nBranches]] = .NewBranch(IsBasic = FALSE, Content = clusts, MergingHeights = rep(dendro$height[merge], 2), Size = sum(sizes), LastMerge = merge, #Singletons = c(Branches[[larger]]$Singletons, Branches[[smaller]]$Singletons) Singletons = NULL ); IndMergeToBranch[merge] = nBranches; RootBranch = nBranches; } } } if (verbose>2) printFlush(paste(spaces, "..Going through detected branches and marking clusters..")); nPoints = nMerge+1; IsBasic = rep(TRUE, times = nBranches); IsBranch = rep(FALSE, times = nBranches); SmallLabels = rep(0, times = nPoints); Trimmed = rep(0, times = nPoints); for (clust in 1:nBranches) { if (is.null(Branches[[clust]]$AttachHeight)) Branches[[clust]]$AttachHeight = cutHeight; IsBasic[clust] = Branches[[clust]]$IsBasic; if (Branches[[clust]]$IsBasic) { coresize = .CoreSize(length(Branches[[clust]]$Singletons), minClusterSize); Core = Branches[[clust]]$Singletons[c(1:coresize)]; Branches[[clust]]$Core = Core; CoreScatter = mean(apply(distM[Core, Core], 2, sum)/(coresize-1)); } else { CoreScatter = 0; } IsBranch[clust] = Branches[[clust]]$IsBasic & (Branches[[clust]]$Size >= minClusterSize) & (CoreScatter < maxAbsCoreScatter) & (Branches[[clust]]$AttachHeight - CoreScatter > minAbsGap); if (Branches[[clust]]$FailSize) SmallLabels[Branches[[clust]]$Singletons] = clust; } if (!respectSmallClusters) SmallLabels = rep(0, times = nPoints); # Trim objects from clusters that are too close to the boundary. if (clusterTrim>0) for (clust in 1:nBranches) if (Branches[[clust]]$IsBasic & (Branches[[clust]]$Size>2)) { if (is.null(Branches[[clust]]$AttachHeight)) Branches[[clust]]$AttachHeight = cutHeight; #if (length(Branches[[clust]]$SingletonHeights)!=length(Branches[[clust]]$Singletons)) # stop("Internal error: length of SingletonHeights differs from length of Singletons. Sorry!"); bottom = min(Branches[[clust]]$MergingHeights); top = Branches[[clust]]$AttachHeight; # First object in the cluster to be trimmed: FirstTrim = match( TRUE, (Branches[[clust]]$MergingHeights - bottom)/(top-bottom) > 1-clusterTrim); if (!is.na(FirstTrim)) { FirstCont = Branches[[clust]]$Content[FirstTrim[1]]; NSingls = Branches[[clust]]$Size; if (FirstCont<0) { FirstSingl = c(1:NSingls)[Branches[[clust]]$Singletons == -FirstCont]; if (length(FirstSingl)==0) { print(paste("FirstCont:", FirstCont)) print("Content:"); print( Branches[[clust]]$Content); print(paste("FirstSingl:", FirstSingl)) print("Singletons:"); print( Branches[[clust]]$Singletons); stop(paste("Internal error: Trimming: First trimmed content", "points to an invalid singleton. Sorry!")); } } else { FirstSingl = c(1:NSingls)[Branches[[clust]]$Singletons == Branches[[FirstCont]]$Singletons[1]]; if (length(FirstSingl)==0) stop(paste("Internal error: Trimming: First trimmed content", "points to an invalid cluster singleton. Sorry!")); } if ((FirstSingl < NSingls/2) | (FirstSingl < 3)) { #printFlush(paste("Trimming cluster ", clust)); #printFlush("Merging heights:") #print(Branches[[clust]]$MergingHeights); #printFlush(paste("FirstTrim:", FirstTrim)); #printFlush("Trim Condition:"); #printFlush((Branches[[clust]]$MergingHeights - bottom)/(top-bottom) <= clusterTrim); warning(paste("GetClusters: keeping a low proportion of a cluster:", FirstSingl, "out of", Branches[[clust]]$Size, "objects.\n", "Increasing the proportion of kept objects to preserve a branch.")); if (verbose>3) { printFlush(paste("GetClusters: keeping a low proportion of a cluster:", FirstSingl, "out of", Branches[[clust]]$Size, "objects.")); printFlush(paste(spaces, " ")); printFlush(paste(spaces, " ..SingletonHeights:", paste(signif(Branches[[clust]]$SingletonHeights,2), collapse=", "))); printFlush(paste(spaces, " ..bottom =", signif(bottom,2), ", top =", signif(top,2))); } # Make sure we keep a certain minimum of singletons FirstSingl = max(FirstSingl, as.integer(NSingls/3)+1, 3); } if (FirstSingl<=NSingls) { Trimmed[Branches[[clust]]$Singletons[c(FirstSingl:NSingls)]] = clust; Branches[[clust]]$Singletons = Branches[[clust]]$Singletons[-c(FirstSingl:NSingls)]; Branches[[clust]]$SingletonHeights = Branches[[clust]]$SingletonHeights[-c(FirstSingl:NSingls)]; Branches[[clust]]$Size = length(Branches[[clust]]$Singletons); } } } # Here's where the original AssignLabel starts if (verbose>2) printFlush(paste(spaces, "..Assigning stage 1 labels..")); Colors = rep(0, times = nPoints); IsCore = rep(0, times = nPoints); BranchBranches = c(1:nBranches)[IsBranch]; color = 0; for (clust in BranchBranches) { color = color+1; Colors[Branches[[clust]]$Singletons] = color; SmallLabels[Branches[[clust]]$Singletons] = 0; coresize = .CoreSize(length(Branches[[clust]]$Singletons), minClusterSize); Core = Branches[[clust]]$Singletons[c(1:coresize)]; IsCore[Core] = color; } Labeled = c(1:nPoints)[Colors!=0]; Unlabeled = c(1:nPoints)[Colors==0]; nUnlabeled = length(Unlabeled); UnlabeledExist = (nUnlabeled>0); if (length(Labeled)>0) { LabelFac = factor(Colors[Labeled]); nProperLabels = nlevels(LabelFac); } else { nProperLabels = 0; } if (labelUnlabeled & UnlabeledExist & nProperLabels>0) { if (verbose>1) printFlush(paste(spaces, "..Assigning stage 2 labels..")); # Assign some of the grey genes to the nearest module. Define nearest as the distance to the medoid, # that is the point in the cluster that has the lowest average distance to all other points in the # cluster. First get the medoids. if (useMedoids) { Medoids = rep(0, times = nProperLabels); ClusterRadii = rep(0, times = nProperLabels); for (cluster in 1:nProperLabels) { InCluster = c(1:nPoints)[Colors==cluster]; DistInCluster = distM[InCluster, InCluster]; DistSums = apply(DistInCluster, 2, sum); Medoids[cluster] = InCluster[which.min(DistSums)]; ClusterRadii[cluster] = max(DistInCluster[, which.min(DistSums)]) } # If small clusters are to be respected, assign those first based on medoid-medoid distances. if (respectSmallClusters) { FSmallLabels = factor(SmallLabels); SmallLabLevs = as.numeric(levels(FSmallLabels)); nSmallClusters = nlevels(FSmallLabels) - (SmallLabLevs[1]==0); if (nSmallClusters>0) for (sclust in SmallLabLevs[SmallLabLevs!=0]) { InCluster = c(1:nPoints)[SmallLabels==sclust]; # printFlush(paste("SmallCluster", sclust, "has", length(InCluster), "elements.")); DistInCluster = distM[InCluster, InCluster]; if (length(InCluster)>1) { DistSums = apply(DistInCluster, 2, sum); smed = InCluster[which.min(DistSums)]; DistToMeds = distM[Medoids, smed]; Nearest = which.min(DistToMeds); DistToNearest = DistToMeds[Nearest]; if ( (DistToNearest < ClusterRadii[Nearest]) | (DistToNearest < maxDistToLabel) ) { Colors[InCluster] = Nearest; } else Colors[InCluster] = -1; # This prevents individual points from being assigned later } } } # Assign leftover unlabeled objects to clusters with nearest medoids Unlabeled = c(1:nPoints)[Colors==0]; UnassdToMedoidDist = distM[Medoids, Unlabeled]; if (nProperLabels>1) { NearestMedoids = apply(UnassdToMedoidDist, 2, which.min); NearestCenterDist = apply(UnassdToMedoidDist, 2, min); } else { NearestMedoids = rep(1, times = nUnlabeled); NearestCenterDist = UnassdToMedoidDist; } Colors[Unlabeled] = ifelse((NearestCenterDist < ClusterRadii[NearestMedoids]) | (NearestCenterDist < maxDistToLabel) , NearestMedoids, 0); UnlabeledExist = (sum(Colors==0)>0); if (verbose>1) printFlush(paste(spaces, " ...assigned", sum((NearestCenterDist < ClusterRadii[NearestMedoids]) | (NearestCenterDist < maxDistToLabel)), "of", nUnlabeled, "previously unassigned points.")); } else # Instead of medoids, use average distances { ClusterDiam = rep(0, times = nProperLabels); for (cluster in 1:nProperLabels) { InCluster = c(1:nPoints)[Colors==cluster]; nInCluster = length(InCluster) DistInCluster = distM[InCluster, InCluster]; if (nInCluster>1) { AveDistInClust = apply(DistInCluster, 2, sum)/(nInCluster-1); ClusterDiam[cluster] = max(AveDistInClust); } else { ClusterDiam[cluster] = 0; } } # If small clusters are respected, assign them first based on average cluster-cluster distances. if (respectSmallClusters) { FSmallLabels = factor(SmallLabels); SmallLabLevs = as.numeric(levels(FSmallLabels)); nSmallClusters = nlevels(FSmallLabels) - (SmallLabLevs[1]==0); if (nSmallClusters>0) for (sclust in SmallLabLevs[SmallLabLevs!=0]) { InCluster = c(1:nPoints)[SmallLabels==sclust]; # printFlush(paste("SmallCluster", sclust, "has", length(InCluster), "elements.")); if (length(InCluster)>1) { DistSClustClust = distM[InCluster, Labeled]; MeanDist = apply(DistSClustClust, 2, mean); MeanMeanDist = tapply(MeanDist, LabelFac, mean); Nearest = which.min(MeanMeanDist); NearestDist = MeanMeanDist[Nearest]; if ( ((NearestDist < ClusterDiam[Nearest]) | (NearestDist < maxDistToLabel)) ) { Colors[InCluster] = Nearest; } else Colors[InCluster] = -1; # This prevents individual points from being assigned later } } } # Assign leftover unlabeled objects to clusters with nearest medoids Unlabeled = c(1:nPoints)[Colors==0]; if (length(Unlabeled)>0) { if (length(Unlabeled)>1) { UnassdToClustDist = apply(distM[Labeled, Unlabeled], 2, tapply, LabelFac, mean); if (nProperLabels>1) { NearestClusters = apply(UnassdToClustDist, 2, which.min); NearestClusterDist = apply(UnassdToClustDist, 2, min); } else { NearestClusters = rep(1, length(Unlabeled)); NearestClusterDist = UnassdToClustDist; } } else { UnassdToClustDist = tapply(distM[Labeled, Unlabeled], LabelFac, mean); NearestClusters = which.min(UnassdToClustDist); NearestClusterDist = min(UnassdToClustDist); } Colors[Unlabeled] = ifelse((NearestClusterDist < ClusterDiam[NearestClusters]) | (NearestClusterDist < maxDistToLabel), NearestClusters, 0); if (verbose>1) printFlush(paste(spaces, " ...assigned", sum((NearestClusterDist < ClusterDiam[NearestClusters]) | (NearestClusterDist < maxDistToLabel)), "of", nUnlabeled, "previously unassigned points.")); } } } # Relabel labels such that 1 corresponds to the largest cluster etc. Colors[Colors<0] = 0; UnlabeledExist = (sum(Colors==0)>0); NumLabs = as.numeric(as.factor(Colors)); Sizes = table(NumLabs); if (UnlabeledExist) { if (length(Sizes)>1) { SizeRank = c(1, rank(-Sizes[2:length(Sizes)], ties.method="first")+1); } else { SizeRank = 1; } OrdNumLabs = SizeRank[NumLabs]; } else { SizeRank = rank(-Sizes[1:length(Sizes)], ties.method="first"); OrdNumLabs = SizeRank[NumLabs]; } OrdIsCore = OrdNumLabs-UnlabeledExist; OrdIsCore[IsCore==0] = 0; list(labels = OrdNumLabs-UnlabeledExist, cores = OrdIsCore, smallLabels = SmallLabels, trimmed = as.numeric(factor(Trimmed))-1, branches = list(nBranches = nBranches, Branches = Branches, IndMergeToBranch = IndMergeToBranch, RootBranch = RootBranch, IsBasic = IsBasic, IsBranch = IsBranch, nPoints = nMerge+1)); } #---------------------------------------------------------------------------------------------- # # cutreeDynamic # #---------------------------------------------------------------------------------------------- # A wrapper function for cutreeHybrid and cutreeDynamicTree. cutreeDynamic = function(dendro, cutHeight = NULL, minClusterSize = 20, method = "hybrid", distM = NULL, deepSplit = (ifelse(method=="hybrid", 1, FALSE)), maxCoreScatter = NULL, minGap = NULL, maxAbsCoreScatter = NULL, minAbsGap = NULL, clusterTrim = 0, labelUnlabeled = TRUE, useMedoids = FALSE, maxDistToLabel = cutHeight, respectSmallClusters = TRUE, verbose = 2, indent = 0) { if (class(dendro)!="hclust") stop("Argument dendro must have class hclust."); methods = c("hybrid", "tree"); met = charmatch(method, methods); if ( (met==1) && (is.null(distM)) ) { warning('cutreeDynamic: method "hybrid" requires a valid dissimilarity matrix "distM". Defaulting to method "tree".'); met = 2; } if (is.na(met)) { stop(paste("Invalid method argument. Accepted values are (unique abbreviations of)", paste(methods, collapse = ", "))); } else if (met==1) { # if (is.null(distM)) stop('distM must be given when using method "hybrid"'); return(cutreeHybrid(dendro = dendro, distM = as.matrix(distM), minClusterSize = minClusterSize, cutHeight = cutHeight, deepSplit = deepSplit, maxCoreScatter = maxCoreScatter, minGap = minGap, maxAbsCoreScatter = maxAbsCoreScatter, minAbsGap = minAbsGap, labelUnlabeled = labelUnlabeled, useMedoids = useMedoids, maxDistToLabel = maxDistToLabel, clusterTrim = clusterTrim, respectSmallClusters = respectSmallClusters, verbose = verbose, indent = indent)$labels); } else { return(cutreeDynamicTree(dendro = dendro, maxTreeHeight = cutHeight, deepSplit = deepSplit, minModuleSize = minClusterSize)); } } ##*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#* #===================================================================================== # # Included file: ../RLibs/dynamicTreeCut/R/cutreeDynamic.R # #===================================================================================== #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#* #------------------------------------------------------------------------------------------ # # cutreeDynamic # #------------------------------------------------------------------------------------------ # Modification(s) by Peter Langfelder: returns numerical labels (not colors) in a vector (not a factor). # Several rendundant blocks of code removed; duplicate function definitions removed; unused # functions removed. # maxTreeHeight now checked for being too large. #.minAttachModuleSize = 100; if( exists("cutreeDynamicTree") ) rm(cutreeDynamicTree) cutreeDynamicTree = function(dendro, maxTreeHeight=1, deepSplit=TRUE, minModuleSize=50) { if (is.null(maxTreeHeight)) maxTreeHeight = 0.99 * max(dendro$height); if (maxTreeHeight > max(dendro$height)) maxTreeHeight = 0.99 * max(dendro$height); staticCutCluster = .cutTreeStatic(dendro=dendro, heightcutoff=maxTreeHeight, minsize1=minModuleSize) #get tree height for every singleton #node_index tree_height demdroHeiAll= rbind( cbind(dendro$merge[,1], dendro$height), cbind(dendro$merge[,2], dendro$height) ) #singletons will stand at the front of the list myorder = order(demdroHeiAll[,1]) #get # of singletons no.singletons = length(dendro$order) demdroHeiAll.sort = demdroHeiAll[myorder, ] demdroHei.sort = demdroHeiAll.sort[c(1:no.singletons), ] demdroHei = demdroHei.sort[seq(no.singletons, 1, by=-1), ] demdroHei[,1] = -demdroHei[,1] # combine with prelimilary cluster-cutoff results demdroHei = cbind(demdroHei, as.integer(staticCutCluster)) # re-order the order based on the dendrogram order dendro$order demdroHei.order = demdroHei[dendro$order, ] static.clupos = .locateCluster(demdroHei.order[, 3]) if (is.null(static.clupos) ){ module.assign = rep(0, no.singletons) return ( module.assign ) } static.no = dim(static.clupos)[1] static.clupos2 = static.clupos static.no2 = static.no #split individual cluster if there are sub clusters embedded mcycle=1 while(1==1){ clupos = NULL for (i in c(1:static.no)){ mydemdroHei.order = demdroHei.order[ c(static.clupos[i,1]:static.clupos[i,2]), ] #index to [1, clusterSize] mydemdroHei.order[, 1] = mydemdroHei.order[, 1] - static.clupos[i, 1] + 1 #cat("Cycle ", as.character(mcycle), "cluster (", static.clupos[i,1], static.clupos[i,2], ")\n") #cat("i=", as.character(i), "\n") iclupos = .processIndividualCluster(mydemdroHei.order, cminModuleSize = minModuleSize, cminAttachModuleSize = 2*minModuleSize) iclupos[,1] = iclupos[,1] + static.clupos[i, 1] -1 #recover the original index iclupos[,2] = iclupos[,2] + static.clupos[i, 1] -1 clupos = rbind(clupos, iclupos) #put in the final output buffer } if(deepSplit==FALSE){ break } if(dim(clupos)[1] != static.no) { static.clupos = clupos static.no = dim(static.clupos)[1] }else{ break } mcycle = mcycle + 1 #static.clupos } final.cnt = dim(clupos)[1] #assign colors for modules module.assign = rep(0, no.singletons) module.cnt=1 for (i in c(1:final.cnt )) { sdx = clupos[i, 1] #module start point edx = clupos[i, 2] #module end point module.size = edx - sdx +1 if(module.size 0 : running length (with same sign) to right, otherwise to the left #mysign = -1: negative value, mysign = -1: positive value .runlengthSign = function(mysequence, leftOrright=-1, mysign=-1){ seqlen = length(mysequence) if(leftOrright<0){ pseq = rev(mysequence) }else{ pseq = mysequence } if(mysign<0){ #see where the first POSITIVE number occurs nonezero.bool = (pseq > 0) }else{ #see where the first NEGATIVE number occur nonezero.bool = (pseq < 0) } if( sum(nonezero.bool) > 0){ runlength = min( c(1:seqlen)[nonezero.bool] ) - 1 }else{ runlength = 0 } } #"0" is for grey module #.assignModuleColor = function(labelpred, minsize1=50, anameallmodules=FALSE, auseblackwhite=FALSE) { # here we define modules by using a height cut-off for the branches #labelpred= cutree(dendro,h=heightcutoff) #cat(labelpred) #"0", grey module doesn't participate color assignment, directly assigned as "grey" #labelpredNoZero = labelpred[ labelpred >0 ] #sort1=-sort(-table(labelpredNoZero)) ## sort1 #modulename= as.numeric(names(sort1)) #modulebranch= sort1 >= minsize1 #no.modules = sum(modulebranch) # #colorcode=GlobalStandardColors; # ##"grey" means not in any module; #colorhelp=rep("grey",length(labelpred)) #if ( no.modules==0){ #print("No module detected.") #} #else{ #if ( no.modules > length(colorcode) ){ #print( paste("Too many modules:", as.character(no.modules)) ) #} # #if ( (anameallmodules==FALSE) || (no.modules <=length(colorcode)) ){ #labeledModules = min(no.modules, length(colorcode) ) #for (i in c(1:labeledModules)) { #colorhelp=ifelse(labelpred==modulename[i],colorcode[i],colorhelp) #} #colorhelp=factor(colorhelp,levels=c(colorcode[1:labeledModules],"grey")) #}else{#nameallmodules==TRUE and no.modules >length(colorcode) #maxcolors=length(colorcode) #labeledModules = no.modules #extracolors=NULL #blackwhite=c("red", "black") #for(i in c((maxcolors+1):no.modules)){ #if(auseblackwhite==FALSE){ #icolor=paste("module", as.character(i), sep="") #}else{#use balck white alternatively represent extra colors, for display only ##here we use the ordered label to avoid put the same color for two neighboring clusters #icolor=blackwhite[1+(as.integer(modulename[i])%%2) ] #} #extracolors=c(extracolors, icolor) #} # ##combine the true-color code and the extra colorcode into a uniform colorcode for ##color assignment #allcolorcode=c(colorcode, extracolors) # #for (i in c(1:labeledModules)) { #colorhelp=ifelse(labelpred==modulename[i],allcolorcode[i],colorhelp) #} #colorhelp=factor(colorhelp,levels=c(allcolorcode[1:labeledModules],"grey")) #} #} # #colorhelp #} # This function written by Peter Langfelder, based on .assignModuleColor above but simplified. # Assigns module numbers, not colors. All modules are labeled. #"0" is for grey module .assignModuleNumber = function(labelpred, minsize1=50) { #"0", grey module doesn't participate color assignment, directly assigned as "grey" labelpredNoZero = labelpred[ labelpred >0 ] sort1=-sort(-table(labelpredNoZero)) # sort1 modulename= as.numeric(names(sort1)) modulebranch= sort1 >= minsize1 no.modules = sum(modulebranch) #"grey" means not in any module; colorhelp=rep(0,length(labelpred)) for (i in c(1:no.modules)) { colorhelp=ifelse(labelpred==modulename[i],i,colorhelp) } colorhelp } #locate the start/end positions of each cluster in the ordered cluster label sequence #where "-1" indicating no cluster #3-1 -1 1 1 1 1 2 2 2 #3 3 -1-1 1 1 1 1 2 2 2 (shift) #--------------------------------- #0-4 0 2 0 0 0 1 0 0 0 (difference) # * * @ .locateCluster = function(clusterlabels) { no.nodes = length(clusterlabels) clusterlabels.shift = c(clusterlabels[1], c(clusterlabels[1:(no.nodes-1)]) ) #a non-zero point is the start point of a cluster and it previous point is the end point of the previous #cluster label.diff = abs(clusterlabels - clusterlabels.shift) #process the first and last positions as start/end points if they belong to a cluster instead of no # cluster "-1" if(clusterlabels[1] >0) {label.diff[1]=1} if(clusterlabels[no.nodes]>0) {label.diff[no.nodes]=1} flagpoints.bool = label.diff > 0 if( sum(flagpoints.bool) ==0){ return(NULL) } flagpoints = c(1:no.nodes)[flagpoints.bool] no.points = length(flagpoints) myclupos=NULL for(i in c(1:(no.points-1)) ){ idx = flagpoints[i] if(clusterlabels[idx]>0){ if(flagpoints[i+1]==no.nodes) {#boundary effect myclupos = rbind(myclupos, c(idx, flagpoints[i+1]) ) break }else{ myclupos = rbind(myclupos, c(idx, flagpoints[i+1]-1) ) } } } myclupos } #input is the cluster demdrogram of an individual cluster, we want to find its embbeded subclusters #execution order: mean-height ==> (mean+max)/2 ==> (mean+min)/2 #useMean: =0 ~ use mean-height as calibation line # =1 ~ use (mean+max)/2 as calibation line to detect relatively a small cluster sitting on the head of a bigger one, # so mean-height is too low to detect the two modules. # =-1~ use (mean+min)/2 as calibation line to detect relatively a small cluster sitting on the tail of a bigger one, # so mean-height & (mean+max)/2 are too high to detect the two modules .processIndividualCluster = function(clusterDemdroHei, cminModuleSize=50, cminAttachModuleSize = 2* cminModuleSize, minTailRunlength= as.integer(cminModuleSize/3)+1, useMean=0) { #for debug: use all genes #clusterDemdroHei =demdroHei.order no.cnodes = dim(clusterDemdroHei)[1] cmaxhei = max(clusterDemdroHei[, 2]) cminhei = min(clusterDemdroHei[, 2]) cmeanhei = mean(clusterDemdroHei[, 2]) cmidhei = (cmeanhei + cmaxhei)/2.0 cdwnhei = (cmeanhei + cminhei)/2.0 if (useMean==1){ comphei = cmidhei }else if (useMean==-1){ comphei = cdwnhei }else{ #normal case comphei = cmeanhei } # compute height diffrence with mean height heidiff = clusterDemdroHei[,2] - comphei heidiff.shift = .shiftSequence(heidiff, -1) # get cut positions # detect the end point of a cluster, whose height should be less than meanhei # and the node behind it is the start point of the next cluster which has a height above meanhei cuts.bool = (heidiff<0) & (heidiff.shift > 0) cuts.bool[1] = TRUE cuts.bool[no.cnodes] = TRUE if(sum(cuts.bool)==2){ if (useMean==0){ new.clupos=.processIndividualCluster(clusterDemdroHei=clusterDemdroHei, cminModuleSize=cminModuleSize, cminAttachModuleSize=cminAttachModuleSize, useMean=1) }else if(useMean==1){ new.clupos=.processIndividualCluster(clusterDemdroHei=clusterDemdroHei, cminModuleSize=cminModuleSize, cminAttachModuleSize=cminAttachModuleSize, useMean=-1) }else{ new.clupos = rbind(c(1, no.cnodes)) } return (new.clupos) } #a good candidate cluster-end point should have significant # of ahead nodes with head < meanHei cutindex =c(1:no.cnodes)[cuts.bool] no.cutps = length(cutindex) runlens = rep(999, no.cutps) cuts.bool2 = cuts.bool for(i in c(2:(no.cutps-1)) ){ seq = c( (cutindex[i-1]+1):cutindex[i] ) runlens[i] = .runlengthSign(heidiff[seq], leftOrright=-1, mysign=-1) if(runlens[i] < minTailRunlength){ #cat("run length=", runlens[i], "\n") cuts.bool2[ cutindex[i] ] = FALSE } } #attach SMALL cluster to the left-side BIG cluster if the small one has smaller mean height cuts.bool3=cuts.bool2 if(sum(cuts.bool2) > 3) { curj = 2 while (1==1){ cutindex2 =c(1:no.cnodes)[cuts.bool2] no.clus = length(cutindex2) -1 if (curj>no.clus){ break } pre.sdx = cutindex2[ curj-1 ]+1 #previous module start point pre.edx = cutindex2[ curj ] #previous module end point pre.module.size = pre.edx - pre.sdx +1 pre.module.hei = mean(clusterDemdroHei[c(pre.sdx:pre.edx) , 2]) cur.sdx = cutindex2[ curj ]+1 #previous module start point cur.edx = cutindex2[ curj+1 ] #previous module end point cur.module.size = cur.edx - cur.sdx +1 cur.module.hei = mean(clusterDemdroHei[c(cur.sdx:cur.edx) , 2]) #merge to the leftside major module, don't change the current index "curj" #if( (pre.module.size >minAttachModuleSize)&(cur.module.hei 2){ if( (cutindex2[no.cutps] - cutindex2[no.cutps-1]+1) < cminModuleSize ){ cuts.bool2[ cutindex2[no.cutps-1] ] =FALSE } } cutindex2 = c(1:no.cnodes)[cuts.bool2] cutindex2[1]=cutindex2[1]-1 #the first no.cutps2 = length(cutindex2) if(no.cutps2 > 2){ new.clupos = cbind( cutindex2[c(1:(no.cutps2-1))]+1, cutindex2[c(2:no.cutps2)] ) }else{ new.clupos = cbind( 1, no.cnodes) } if ( dim(new.clupos)[1] == 1 ){ if (useMean==0){ new.clupos=.processIndividualCluster(clusterDemdroHei=clusterDemdroHei, cminModuleSize=cminModuleSize, cminAttachModuleSize=cminAttachModuleSize, useMean=1) }else if(useMean==1){ new.clupos=.processIndividualCluster(clusterDemdroHei=clusterDemdroHei, cminModuleSize=cminModuleSize, cminAttachModuleSize=cminAttachModuleSize, useMean=-1) } } new.clupos } #delta >0 : shift to right, otherwise to the left .shiftSequence = function(mysequence, delta){ seqlen = length(mysequence) if(delta>0){ finalseq=c(mysequence[1:delta], mysequence[1:(seqlen-delta)]) }else{ posdelta = -delta finalseq=c(mysequence[(posdelta+1):seqlen], mysequence[(seqlen-posdelta+1):seqlen]) } finalseq } #use height cutoff to remove .cutTreeStatic = function(dendro,heightcutoff=0.99, minsize1=50) { # here we define modules by using a height cut-off for the branches labelpred= cutree(dendro,h=heightcutoff) sort1=-sort(-table(labelpred)) sort1 modulename= as.numeric(names(sort1)) modulebranch= sort1 >= minsize1 no.modules=sum(modulebranch) colorhelp = rep(-1, length(labelpred) ) if ( no.modules==0){ print("No module detected") } else{ for (i in c(1:no.modules)) { colorhelp=ifelse(labelpred==modulename[i],i ,colorhelp) } } colorhelp } ##*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#* #===================================================================================== # # Included file: ../CommonFunctions/NetworkScreeningFunctions.txt # #===================================================================================== #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#* # This document contains functions that we find useful for network screening. # We recommend to read the tutorial first. # The functions were written by Steve Horvath, Jun Dong, Peter Langfelder, Andy Yip, Bin Zhang # To cite this code or the statistical methods please use the following 2 references: #1) Horvath S, Zhang B, Carlson M, Lu KV, Zhu S, Felciano RM, Laurance MF, Zhao W, Shu, Q, Lee Y, # Scheck AC, Liau LM, Wu H, Geschwind DH, Febbo PG, Kornblum HI, Cloughesy TF, Nelson SF, Mischel PS (2006) #"Analysis of Oncogenic Signaling Networks in Glioblastoma Identifies ASPM as a Novel Molecular Target", # PNAS | November 14, 2006 | vol. 103 | no. 46 | 17402-17407 #2) Zhang B, Horvath S (2005) "A General Framework for Weighted Gene Co-Expression Network Analysis", #Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Article 17 # this function creates pairwise scatter plots between module eigengenes (above the diagonal) # Below the diagonal are the absolute values of the Pearson correlation coefficients. # The diagonal contains histograms of the module eigengene expressions. if (exists("plotMEpairs")) rm(plotMEpairs); plotMEpairs=function(datME, y=NULL, title="Relationship between module eigengenes",clusterMEs=T ){ if ( dim(as.matrix(datME))[[2]]==1 & is.null(y) ) hist( datME) if ( !( dim(as.matrix(datME))[[2]]==1 & is.null(y)) ){ datMEordered=datME if (clusterMEs & dim(as.matrix(datME))[[1]] >1 ) { dissimME=(1-t(cor(datME, method="p", use="p")))/2 hclustdatME=hclust(as.dist(dissimME), method="average" ) datMEordered=datME[,hclustdatME$order] } # end of if if ( !is.null(y) ) { if ( length(y) != dim(as.matrix(datMEordered))[[1]] ) stop("The outcome vector y does not match the number of rows of datME. Hint: consider transposing datME. The columns of datME should correspond to the module eigengenes. The rows correspond to the array samples.") datMEordered=data.frame(y, datMEordered) } # end of if pairs( datMEordered, upper.panel = panel.smooth, lower.panel = panel.cor , diag.panel=panel.hist ,main=title) } # end if } # end of function # This function can be used to create an average linkage hierarchical clustering tree # or the microarray samples. The rows of datExpr correspond to the samples and the columns to the genes # You can optionally input a quantitative microarray sample trait. if(exists("PlotClusterTreeSamples")) rm(PlotClusterTreeSamples); PlotClusterTreeSamples=function(datExpr, y=NULL) { hierSamples=hclust( dist( datExpr ), method="average" ) if (is.null(y) ) {plot(hierSamples, main="Cluster Tree of the Samples (average linkage, Euclidean dist.)", sub="")} if (!is.null(y) ) { if (!is.numeric(y) ) {warning("The microarray sample trait y is not numeric. Therefore, the function PlotClusterTreeSamples will transform it to a numeric variable"); y=as.numeric(as.character(y)) } # end of if (!is.numeric(y) ) if ( dim(as.matrix(datExpr))[[1]] != length(y) ) stop("Input Error: The number of microarray sample arrays does not match the length of the sample trait. Hint: consider transposing datExpr. Also make sure that y is a vector. dim(as.matrix(datExpr))[[1]] != length(y)") par(mar=c(2,2,2,2), mfrow=c(2,1)) plot(hierSamples, main="Cluster Tree of the Samples (Euclidean dist.)", sub="") if ( is.integer(y) ) { if (min(y, na.rm=T)<0 ) y=y- min(y, na.rm=T)+1 hclustplotn(hierSamples, data.frame(y),main="Array Samples Colored by the Sample Trait") } # end of if ( is.integer(y) ) if ( !is.integer(y) ) { yBinary =as.integer(ifelse( y>=mean(y, na.rm=T), max(y,na.rm=T) , min(y,na.rm=T))) if (min(yBinary, na.rm=T)<0 ) yBinary=yBinary- min(yBinary, na.rm=T)+1 hclustplotn(hierSamples, data.frame(yBinary),main="Samples Colored by the Dichotomized (Binary) Trait") } # end of if ( !is.integer(y) ) } # end of if }# end of function PlotClusterTreeSamples if (exists("WeightedGenePredictor") ) rm(WeightedGenePredictor); WeightedGenePredictor=function( datExprTraining, yTraining, datExprTest=NULL, yTest=NULL, MinimumNoArraysObserved=7,power=6){ UserSuppliedTestSetData=F if ( !is.null(datExprTest) ) UserSuppliedTestSetData=T UserSuppliedTestSetOutcome=T if ( is.null(yTest) ) UserSuppliedTestSetOutcome=F if (!is.numeric(yTraining) ) stop("Error: the training outcome yTraining is not numeric. Please turn it into a numeric variable, e.g. code cases as 1 and controls as 0.") if ( is.null(datExprTest) ) {warning("Since no test data were provided, the function GeneNetworkPredictor will be evaluated on the training data"); datExprTest=datExprTraining; yTest=yTraining} if (!is.numeric(yTraining) ) stop("Error: the training outcome yTraining is not numeric. Please turn it into a numeric variable, e.g. code cases as 1 and controls as 0.") if ( sum( !is.na(yTraining) ) < MinimumNoArraysObserved ) stop("The outcome variable yTraining has too many missing values or there are not enough samples. Hint: Consider decreasing MinimumNoArraysObserved or collect more microarray samples.") if ( dim(as.matrix(datExprTraining))[[2]] != dim(as.matrix(datExprTest))[[2]] ) stop("Input error in function GeneNetworkPredictor. The number of genes in the training data does not match the number of genes in the test data, i.e. dim(as.matrix(datExprTraining))[[2]] != dim(as.matrix(datExprTest))[[2]]. Hint: Maybe you need to transpose one or both of the data sets?") if ( dim(as.matrix(datExprTraining))[[1]] !=length(yTraining) ) stop("Input error in function GeneNetworkPredictor. The number of arrays does not match the length of the outcome in the training data. ( dim(as.matrix(datExprTraining))[[1]] !=length(yTraining). Maybe you need to transpose the gene expression data.") if (! is.null(yTest) & dim(as.matrix(datExprTest))[[1]] !=length(yTest) ) stop("Input error in function GeneNetworkPredictor. dim(as.matrix(datExprTest))[[1]] !=length(yTest) ") if ( length(levels(factor(yTraining)) ) ==1) {warning(paste("all entries of the outcome variable yTraining are equal. We wil use the trivial prediction: yPredicted=", levels(factor(yTraining) ) ) ) ; yPredictedAllGenes= rep(yTraining[1], dim(as.matrix(datExprTest))[[1]]) yPredictedtopHubs= rep(yTraining[1], dim(as.matrix(datExprTest))[[1]]) } # end of if if ( length(levels(factor(yTraining)) ) > 1) { varTraining=as.numeric(apply(datExprTraining,2,var,na.rm=T)) NoArraysObserved= as.numeric(apply(as.matrix(!is.na(datExprTraining)),1,sum,na.rm=T)) usefulgene= !is.na(varTraining) & varTraining>0 & NoArraysObserved>=MinimumNoArraysObserved if (sum(!usefulgene, na.rm=T)>0 ) print(paste("Warning:" ,sum(!usefulgene, na.rm=T) , "gene(s) have been removed since they were either constant or they had too many missing values. Consider decreasing the parameter MinimumNoArraysObserved")) if ( sum(usefulgene, na.rm=T)==0 ) stop("There are 0 useful genes in the training data. The genes are either constant or they have too many missing values or not enough microarrays. Consider decreasing the parameter MinimumNoArraysObserved") datExprTraining=as.matrix(datExprTraining)[, usefulgene] datExprTest=as.matrix(datExprTest)[, usefulgene] Cor.Weighted=as.numeric(cor(yTraining, datExprTraining, use="p")) weight=abs(Cor.Weighted)^power*sign(Cor.Weighted) varTraining=as.numeric(apply(as.matrix(datExprTraining),2,var,na.rm=T)) meanTraining=as.numeric(apply(as.matrix(datExprTraining),2,mean,na.rm=T)) datExprTestScaled=scale(datExprTest, center=meanTraining,scale=sqrt(varTraining) ) datExprTestScaled[is.na(datExprTestScaled)]=0 yPredictedAllGenes=as.numeric(as.matrix(datExprTestScaled)%*% as.matrix(weight)) rm(datExprTestScaled); gc(); # now we scale the predictor so that it has the same mean and variance as yTraining yPredictedAllGenes=c(as.numeric(scale(yPredictedAllGenes))) varyTraining=as.numeric(var(yTraining, na.rm=T)) if (varyTraining>0 & !is.na(varyTraining) ) { yPredictedAllGenes= sqrt(varyTraining)*yPredictedAllGenes } meanyTraining=as.numeric(mean(yTraining, na.rm=T)) yPredictedAllGenes=yPredictedAllGenes+ meanyTraining } # end of if ( length(levels(factor(yTraining)) ) > 1) if ( !UserSuppliedTestSetData ) { print(paste("GeneNetworkPredictor: The correlation between yPredictedAllGenes and yTraining on the training data equals ", signif(cor(yPredictedAllGenes, yTraining,use="p"),2))) } if (UserSuppliedTestSetData & UserSuppliedTestSetOutcome ) { print(paste("GeneNetworkPredictor: cor(yPredictedAllGenes, yTest) = ", signif(cor(yPredictedAllGenes, yTest,use="p"),2))); } yPredictedAllGenesBinary =ifelse( yPredictedAllGenes>=mean(yPredictedAllGenes, na.rm=T), max(yTraining,na.rm=T) , min(yTraining,na.rm=T)) data.frame(yPredictedAllGenes=yPredictedAllGenes, yPredictedAllGenesBinary) } # end of function WeightedGenePredictor if (exists("GeneNetworkPredictor") ) rm(GeneNetworkPredictor); GeneNetworkPredictor=function( datExprTraining, yTraining, datExprTest=NULL, yTest=NULL, MinimumNoArraysObserved=7,topNumberHubs=10){ UserSuppliedTestSetData=F if ( !is.null(datExprTest) ) UserSuppliedTestSetData=T UserSuppliedTestSetOutcome=T if ( is.null(yTest) ) UserSuppliedTestSetOutcome=F if (!is.numeric(yTraining) ) stop("Error: the training outcome yTraining is not numeric. Please turn it into a numeric variable, e.g. code cases as 1 and controls as 0.") if ( is.null(datExprTest) ) {warning("Since no test data were provided, the function GeneNetworkPredictor will be evaluated on the training data"); datExprTest=datExprTraining; yTest=yTraining} if (!is.numeric(yTraining) ) stop("Error: the training outcome yTraining is not numeric. Please turn it into a numeric variable, e.g. code cases as 1 and controls as 0.") if ( sum( !is.na(yTraining) ) < MinimumNoArraysObserved ) stop("The outcome variable yTraining has too many missing values or there are not enough samples. Hint: Consider decreasing MinimumNoArraysObserved or collect more microarray samples.") if ( dim(as.matrix(datExprTraining))[[2]] != dim(as.matrix(datExprTest))[[2]] ) stop("Input error in function GeneNetworkPredictor. The number of genes in the training data does not match the number of genes in the test data, i.e. dim(as.matrix(datExprTraining))[[2]] != dim(as.matrix(datExprTest))[[2]]. Hint: Maybe you need to transpose one or both of the data sets?") if ( dim(as.matrix(datExprTraining))[[1]] !=length(yTraining) ) stop("Input error in function GeneNetworkPredictor. The number of arrays does not match the length of the outcome in the training data. ( dim(as.matrix(datExprTraining))[[1]] !=length(yTraining). Maybe you need to transpose the gene expression data.") if (! is.null(yTest) & dim(as.matrix(datExprTest))[[1]] !=length(yTest) ) stop("Input error in function GeneNetworkPredictor. dim(as.matrix(datExprTest))[[1]] !=length(yTest) ") if ( length(levels(factor(yTraining)) ) ==1) {warning(paste("all entries of the outcome variable yTraining are equal. We wil use the trivial prediction: yPredicted=", levels(factor(yTraining) ) ) ) ; yPredictedAllGenes= rep(yTraining[1], dim(as.matrix(datExprTest))[[1]]) yPredictedtopHubs= rep(yTraining[1], dim(as.matrix(datExprTest))[[1]]) } # end of if if ( length(levels(factor(yTraining)) ) > 1) { varTraining=as.numeric(apply(datExprTraining,2,var,na.rm=T)) NoArraysObserved= as.numeric(apply(as.matrix(!is.na(datExprTraining)),1,sum,na.rm=T)) usefulgene= !is.na(varTraining) & varTraining>0 & NoArraysObserved>=MinimumNoArraysObserved if (sum(!usefulgene, na.rm=T)>0 ) print(paste("Warning:" ,sum(!usefulgene, na.rm=T) , "gene(s) have been removed since they were either constant or they had too many missing values. Consider decreasing the parameter MinimumNoArraysObserved")) if ( sum(usefulgene, na.rm=T)==0 ) stop("There are 0 useful genes in the training data. The genes are either constant or they have too many missing values or not enough microarrays. Consider decreasing the parameter MinimumNoArraysObserved") datExprTraining=as.matrix(datExprTraining)[, usefulgene] datExprTest=as.matrix(datExprTest)[, usefulgene] lazy1= ImFeelingLazy(datE=datExprTraining, y= yTraining ) Cor.Weighted=as.numeric(lazy1$NetworkScreening$Cor.Weighted); topHubs=rep(F, dim(as.matrix(datExprTraining))[[2]] ) for (i in c(1:dim(as.matrix(lazy1$datME))[[2]]) ) { kME=as.numeric(cor(lazy1$datME[,i], datExprTraining, use="p" )) topHubs[ rank(-kME,ties.method="first")<=topNumberHubs]=T } varTraining=as.numeric(apply(as.matrix(datExprTraining),2,var,na.rm=T)) meanTraining=as.numeric(apply(as.matrix(datExprTraining),2,mean,na.rm=T)) datExprTestScaled=scale(datExprTest, center=meanTraining,scale=sqrt(varTraining) ) datExprTestScaled[is.na(datExprTestScaled)]=0 yPredictedAllGenes=as.numeric(as.matrix(datExprTestScaled)%*% as.matrix(Cor.Weighted)^1) yPredictedtopHubs= as.numeric(as.matrix(as.matrix(datExprTestScaled)[ ,topHubs] ) %*% as.matrix( Cor.Weighted[topHubs] )) rm(datExprTestScaled); gc(); # now we scale the predictor so that it has the same mean and variance as yTraining yPredictedAllGenes=c(as.numeric(scale(yPredictedAllGenes))) yPredictedtopHubs=c(as.numeric(scale(yPredictedtopHubs))) varyTraining=as.numeric(var(yTraining, na.rm=T)) if (varyTraining>0 & !is.na(varyTraining) ) { yPredictedAllGenes= sqrt(varyTraining)*yPredictedAllGenes yPredictedtopHubs= sqrt(varyTraining)*yPredictedtopHubs } meanyTraining=as.numeric(mean(yTraining, na.rm=T)) yPredictedAllGenes=yPredictedAllGenes+ meanyTraining yPredictedtopHubs= yPredictedtopHubs +meanyTraining } # end of if ( length(levels(factor(yTraining)) ) > 1) if ( !UserSuppliedTestSetData ) { print(paste("GeneNetworkPredictor: The correlation between yPredictedAllGenes and yTraining on the training data equals ", signif(cor(yPredictedAllGenes, yTraining,use="p"),2))) print(paste("GeneNetworkPredictor: The correlation between yPredictedtopHubs and yTraining on the training data equals ", signif(cor(yPredictedAlltopHubs, yTraining,use="p"),2))) } if (UserSuppliedTestSetData & UserSuppliedTestSetOutcome ) { print(paste("GeneNetworkPredictor: cor(yPredictedAllGenes, yTest) = ", signif(cor(yPredictedAllGenes, yTest,use="p"),2))); print(paste("GeneNetworkPredictor: cor(yPredictedtopHubs, yTest) = ", signif(cor(yPredictedtopHubs, yTest,use="p"),2))) } yPredictedAllGenesBinary =ifelse( yPredictedAllGenes>=mean(yPredictedAllGenes, na.rm=T), max(yTraining,na.rm=T) , min(yTraining,na.rm=T)) yPredictedtopHubsBinary =ifelse( yPredictedtopHubs>=mean(yPredictedtopHubs, na.rm=T), max(yTraining,na.rm=T) , min(yTraining,na.rm=T)) data.frame(yPredictedAllGenes=yPredictedAllGenes, yPredictedtopHubs=yPredictedtopHubs, yPredictedAllGenesBinary, yPredictedtopHubsBinary) } # end of function GeneNetworkPredictor # This is the factor analysis predictor # input a data frame datE where the rows correspond to samples (with outcome y) and # the columns correspond to the number of genes. # if no outcome is provided, the factor analysis predictor uses the factor scores on # the test set data (newdataE) as outcome predictor. # The parameter MinimumGeneSignificanceCor can be used to remove genes # whose absolute correlation with the outcome y is below the threshold. if (exists("FAgeneselection") ) rm(FAgeneselection); FAgeneselection =function(datE, datPC, y=NULL, MinimumNumberObserved=5 ,MinimumGeneSignificanceCor=0.2) { datPC=data.frame(datPC) no.genes=dim(data.frame(datPC))[[2]] no.obs=dim(data.frame(datPC))[[1]] if ( dim(data.frame(datPC))[[1]] < MinimumNumberObserved ) {stop("ERROR: Not enough training data, i.e. datPC has too few row. Maybe you need to transpose datPC? The columns of datPC should correspond to the number of genes. The rows correspond to the number of observations. Alternatively, consider changing the parameter MinimumNumberObserved.")} if ( dim(data.frame(datPC))[[1]] < MinimumNumberObserved ) {stop("ERROR: Not enough training data, i.e. datPC has too few row. Maybe you need to transpose datPC? The columns of datPC should correspond to the number of genes. The rows correspond to the number of observations. Alternatively, consider changing the parameter MinimumNumberObserved.")} if ( !is.null(y) ) { if ( !is.numeric(y) ) {stop("ERROR: The outcome y is not numeric. Please input a numeric vector.")}} if ( !is.null(y) ) { if( length(y) != dim(data.frame(datPC) )[[1]] ) {stop("The length of the outcome y does not match the number of rows in the data (datPC). Maybe you need to transpose datPC? The columns of datPC should correspond to the number of module eigengenes. The rows correspond to the number of observations.")}} # Note that we consider the outcome as one of the PCs if ( !is.null(y) ) { datPC=data.frame(y,datPC) } meandatPC=as.numeric(apply(data.frame(datPC),2,mean,na.rm=T)) vardatPC=as.numeric(apply(data.frame(datPC),2,var,na.rm=T)) selectgenes= !is.na(vardatPC) & vardatPC>0 & !is.na(meandatPC) if ( sum( !selectgenes) >0) {print("Warning: some genes with missing variance, zero variance, or missing mean have been removed from the data.") datPC=datPC[,selectgenes] no.genes=dim(data.frame(datPC))[[2]] if (no.genes<1) stop("Error: All genes have missing variance, zero variance, or missing mean. A prediction cannot be computed since the predictor is based on correlations between the gene. Maybe you need to transpose datPC? The columns of datPC should correspond to the number of genes. The rows correspond to the number of observations.") } if (no.genes>1) { no.missingbygene=apply(is.na(datPC),2,sum) selectgenes=no.obs- no.missingbygene>= MinimumNumberObserved if (sum(!selectgenes)>0 ) print("Warning: Genes with too many missing data (or equivalently too few observed values have been removed. Consider changing the parameter MinimumNumberObserved. ") datPC=datPC[,selectgenes] no.genes=dim(data.frame(datPC))[[2]] if (no.genes<1) stop("Error: Not a single gene has enough observations. Too many missing observations. Consider changing the parameter MinimumNumberObserved. ") } if (no.genes==1) {no.missingbygene=sum(is.na(datPC)) if (no.obs- no.missingbygene < MinimumNumberObserved) {stop("Error: Not a single gene has enough observations. Too many missing observations. Consider changing the parameter MinimumNumberObserved.");}} # now we impute missing values no.missing=sum(c(is.na(datPC)) ) if (no.missing>0) print("Warning: missing values in the training data (datPC) have been imputed.") if (no.missing>0 && no.genes>1) {datPC=t(impute.knn(t(datPC), k=min(10,no.genes-1)))} if (no.missing>0 && no.genes==1) {datPC[is.na(datPC)]= median(datPC,na.rm=T)} # now we ensure that all the columns (genes) in datPC have the same sign if (is.null(y) ) y=as.numeric(datPC[,1]) sign1=sign(as.numeric(cor(y,datPC, use="p" ))) sign1[is.na(sign1)| sign1==0]=1 datPC=t(t(datPC)*sign1) # Now we scale the genes to mean 0 and variance 1 meandatPC=as.numeric(apply(data.frame(datPC),2,mean,na.rm=T)) vardatPC=as.numeric(apply(data.frame(datPC),2,var,na.rm=T)) datPC= data.frame(scale(datPC, center=meandatPC, scale=sqrt(vardatPC) )) if ( !is.null(y) ) { GeneSignificance=abs(as.numeric(cor(y, datPC, use="p"))) selectgenes= GeneSignificance>=MinimumGeneSignificanceCor if ( sum( !selectgenes) >0) {print("Warning: some genes have been removed from the data since their correlation with the outcome fell below the threshold. Consider lowering MinimumGeneSignificanceCor") if ( sum( !selectgenes) ==no.genes) {print("IMPORTANT Warning: None of the PCs has a correlation with the outcome y, which passes the threshold MinimumGeneSignificanceCor. I will with the gene that has the strongest correlation with y."); selectgenes= GeneSignificance==max(GeneSignificance,na.rm=T)} datPC=datPC[,selectgenes] no.genes=dim(data.frame(datPC))[[2]] if (no.genes<1) stop("Error: All genes have missing variance, zero variance, or missing mean. A prediction cannot be computed since the predictor is based on correlations between the gene. Maybe you need to transpose datPC? The columns of datPC should correspond to the number of genes. The rows correspond to the number of observations.") } } if (dim(data.frame(datPC))[[2]]<3 ) {print("Warning: since there are fewer than 3 genes, the predictor reduces to the average of the scaled (standardized) gene expression profiles")} if (dim(data.frame(datPC))[[2]]==2 ) {Prediction=as.numeric( apply(datPC, 1, mean, na.rm=T) ); PredictionPower1= Prediction} if (dim(data.frame(datPC))[[2]]==1 ) {Prediction=as.numeric(datPC); PredictionPower1= Prediction } # the factor analysis predictor requires more than 2 genes if (dim(data.frame(datPC))[[2]]>2 ) { F1=factanal(x=datPC, factors=1) Lambda= as.numeric(F1$loadings) # this is the factor analysis predictor Prediction=.5*as.numeric(as.matrix(datPC)%*%solve(cov(datPC, use="p")) %*% Lambda) # this is a connectivity based approximation ConnectivityPower1= apply(abs( cor(datPC, use="p"))^1,2,sum) ConnectivityPower1=ConnectivityPower1/sum(ConnectivityPower1^2)*sqrt(sum(ConnectivityPower1)) PredictionPower1=.5* as.numeric(as.matrix(datPC)%*%ConnectivityPower1) } data.frame(Prediction, PredictionPower1) } # end of function FApredictor # The function CorPredictionSuccess can be used to determine which method is best for predicting correlations # in a new test set. CorTestSet should be a vector of correlations in the test set. # The parameter topNumber specifies that the top number most positive and the top most negative predicted correlations # TopNumber is a vector of integers. # CorPrediction should be a data frame of predictions for the correlations. # Output a list with the following components: # meancorTestSetPositive= mean test set correlation among the topNumber of genes which are predicted to have positive correlations. # meancorTestSetNegative= mean test set correlation among the topNumber of genes which are predicted to have negative correlations. # meancorTestSetOverall=(meancorTestSetPositive-meancorTestSetNegative)/2 if(exists("CorPredictionSuccess")) rm(CorPredictionSuccess); CorPredictionSuccess=function( CorPrediction, CorTestSet, topNumber=100 ){ no.predictors=dim(as.matrix(CorPrediction))[[2]] no.genes=dim(as.matrix(CorPrediction))[[1]] if (length(as.numeric(CorTestSet))!=no.genes ) stop("Error: input error, non-compatible dimensions in the function CorPredictionSuccess") out1=rep(NA, no.predictors) meancorTestSetPositive=matrix(NA, ncol=no.predictors, nrow=length(topNumber) ) meancorTestSetNegative=matrix(NA, ncol=no.predictors, nrow=length(topNumber) ) for (i in c(1:no.predictors) ){ rankpositive=rank(-as.matrix(CorPrediction)[,i], ties.method="first") ranknegative=rank(as.matrix(CorPrediction)[,i], ties.method="first") for (j in c(1:length(topNumber) ) ) { meancorTestSetPositive[j,i]=mean(CorTestSet[rankpositive<= topNumber[j]],na.rm=T) meancorTestSetNegative[j,i]= mean(CorTestSet[ranknegative<=topNumber[j]],na.rm=T) } # end of j loop over topNumber } # end of i loop over predictors meancorTestSetOverall=data.frame((meancorTestSetPositive-meancorTestSetNegative)/2) dimnames(meancorTestSetOverall)[[2]]=names(data.frame(CorPrediction)) meancorTestSetOverall=data.frame(topNumber=topNumber, meancorTestSetOverall) meancorTestSetPositive=data.frame(meancorTestSetPositive) dimnames(meancorTestSetPositive)[[2]]=names(data.frame(CorPrediction)) meancorTestSetPositive=data.frame(topNumber=topNumber, meancorTestSetPositive) meancorTestSetNegative=data.frame(meancorTestSetNegative) dimnames(meancorTestSetNegative)[[2]]=names(data.frame(CorPrediction)) meancorTestSetNegative=data.frame(topNumber=topNumber, meancorTestSetNegative) datout=list(meancorTestSetOverall=meancorTestSetOverall, meancorTestSetPositive=meancorTestSetPositive, meancorTestSetNegative =meancorTestSetNegative) datout } # end of function CorPredictionSuccess # The function RelativeCorPredictionSuccess can be used to test whether a gene screening method is significantly better than a standard method. # For each gene screening method (column of CorPredictionNew) it provides a Kruskal Wallis test p-value for comparison # with the vector CorPredictionStandard, # TopNumber is a vector of integers. # CorTestSet should be a vector of correlations in the test set. CorPredictionNew should be a data frame of predictions for the # correlations. CorPredictionStandard should be the standard prediction (correlation in the training data). # The function outputs a p-value for the Kruskal test that # the new correlation prediction methods outperform the standard correlation prediction method. if(exists("RelativeCorPredictionSuccess")) rm(RelativeCorPredictionSuccess); RelativeCorPredictionSuccess=function( CorPredictionNew, CorPredictionStandard, CorTestSet, topNumber=100 ){ no.predictors=dim(as.matrix(CorPredictionNew))[[2]] no.genes=dim(as.matrix(CorPredictionNew))[[1]] if (length(as.numeric(CorTestSet))!=no.genes ) stop("Error:input error, non-compatible dimensions in the function CorPredictionSuccess") if (length(as.numeric(CorTestSet))!=length(CorPredictionStandard) ) stop("Error 2:input error, non-compatible dimensions in the function CorPredictionSuccess") kruskalp=matrix(NA,nrow=length(topNumber), ncol=no.predictors) for (i in c(1:no.predictors) ){ rankhighNew=rank(-as.matrix(CorPredictionNew)[,i], ties.method="first") ranklowNew=rank(as.matrix(CorPredictionNew)[,i],ties.method="first") for (j in c(1:length(topNumber)) ){ highCorNew=as.numeric(CorTestSet[rankhighNew <= topNumber[j] ]) lowCorNew=as.numeric(CorTestSet[ranklowNew <= topNumber[j] ]) highCorStandard=as.numeric(CorTestSet[rank(-as.numeric(CorPredictionStandard), ties.method="first")<= topNumber[j]]) lowCorStandard=as.numeric(CorTestSet[rank(as.numeric(CorPredictionStandard), ties.method="first")<= topNumber[j]]) signedCorNew=c(highCorNew,-lowCorNew) signedCorStandard=c(highCorStandard,-lowCorStandard) x1=c(signedCorNew,signedCorStandard) Grouping=rep(c(2,1), c(length(signedCorNew), length(signedCorStandard))) sign1=sign(cor(Grouping,x1, use="p")) if (sign1==0) sign1=1 kruskalp[j,i]=kruskal.test(x=x1, g=Grouping)$p.value*sign1 #print(names(data.frame(CorPredictionNew))[[i]]) #print(paste("This correlation is positive if the new method is better than the old method" , signif(cor(Grouping,x1, use="p"),3))) } # end of j loop } # end of i loop kruskalp[kruskalp<0]=1 kruskalp=data.frame(kruskalp) dimnames(kruskalp)[[2]]= paste(names(data.frame(CorPredictionNew)),".kruskalP", sep="") kruskalp=data.frame(topNumber=topNumber, kruskalp) kruskalp } # end of function RelativeCorPredictionSuccess # this function compute an asymptotic p-value for a given correlation (r) and sample size (n) if (exists("FisherTransformP") ) rm(FisherTransformP); FisherTransformP=function(r, n) { #Z=sqrt(n-3) * 0.5*log((1+r)/(1-r)) #2*(1-pnorm(abs(Z) )) # the following is implemented in the cor.test function T=sqrt(n-2) * r/sqrt(1-r^2) 2*(1-pt(abs(T),n-2)) } # This function can be used to ensure that all correlations between the columns (genes) in datE are positive. # If y is supplied, it ensures that all correlations between y and the columns of datE are positive. if(exists("SameSign")) rm(SameSign); SameSign=function(datE, y=NULL) { if ( !is.null(y) & dim(as.matrix(datE))[[1]] != length(y) ) stop("Input error in function SameSign") if (is.null(y) ) y=as.numeric(datE[,1]) sign1=sign(as.numeric(cor(y,datE, use="p" ))) scale(t(t(datE)*sign1)) } # end of function SameSign # this function can be used to rank the values in x. Ties are broken by the method first. rank1=function(x){ rank(x, ties.method="first") } # Copy and paste the following functions into R # This function takes a vector as input a vector (ME) and generates a module of a #specified size around it. #The vector ME will be the most highly connected vector in the resulting module. # It outputs a warning which should be ignored. if(exists("SimulateModule") ) rm(SimulateModule); SimulateModule=function(ME, size,minimumCor=.3) { if (size<3) print("WARNING: module size smaller than 3") if(minimumCor==0) minimumCor=0.0001; maxnoisevariance=var(ME,na.rm=T)*(1/minimumCor^2-1) SDvector=sqrt(c(1:size)/size*maxnoisevariance) datSignal=suppressWarnings(matrix(c(ME, ME ,-ME),nrow=size ,ncol=length(ME) ,byrow=T)) datNoise=SDvector* matrix(rnorm(size*length(ME)),nrow=size ,ncol=length(ME)) datModule=datSignal+datNoise t(datModule) } # end of function # The following function simulates 5 proper module and 1 improper module comprised # of grey genes. # The parameter truecolors gives the module colors and trueproportions gives the proportion of genes in each module. if(exists("SimulateExprData") ) rm(SimulateExprData); SimulateExprData=function(no.genes=2000, truecolors=c("turquoise","blue", "brown", "yellow", "green"), trueproportions=c(0.10,0.08, 0.06, 0.04, 0.02), MEturquoise, MEblue, MEbrown, MEyellow, MEgreen, SDnoise=1, backgroundcorrelation=T){ no.samples=length(MEturquoise) if( length(MEturquoise) != length(MEblue) | length(MEturquoise) != length(MEbrown) | length(MEturquoise) != length(MEyellow) | length(MEturquoise) != length(MEgreen) ) {print("Error: length of module eigengenes (MEs) is not consistent" ) ; no.samples=0} if ( sum(trueproportions)>1 ) { print("ERROR: the proportion of genes in each module is wrong, make sure that sum(trueproportions)<=1. " ); trueproportions=rep(1/10,5)} modulesizes=round(no.genes*c(trueproportions, 1-sum(trueproportions))) truemodule=rep(c( as.character(truecolors),"grey") , modulesizes ) ModuleEigengenes =data.frame(MEturquoise,MEblue,MEbrown,MEyellow,MEgreen) no.MEs=dim(ModuleEigengenes)[[2]] # This matrix contains the simulated expression values #(rows are samples, columns genes) # it contains some background noise datExpr=matrix(rnorm(no.samples*no.genes,mean=0,sd=SDnoise),nrow=no.samples,ncol=no.genes) if (backgroundcorrelation) { MEbackground=MEturquoise datSignal= (matrix(MEbackground,nrow=length(MEturquoise) ,ncol=no.genes,byrow=F)) datExpr= datExpr+ .3*datSignal }# end of if backgroundcorrelation for (i in c(1:no.MEs) ) { restrict1= truemodule== truecolors[i] datModule=SimulateModule(ModuleEigengenes[,i] , size=modulesizes[i]) datExpr[,restrict1]= datModule } # end of for loop # this is the output of the function list(datExpr =datExpr, truemodule =truemodule, datModuleEigengenes= ModuleEigengenes ) } # end of simulation function if(exists("ImFeelingLazy")) rm(ImFeelingLazy); ImFeelingLazy=function(datE, y, power=6, networkType="unsigned",detectCutHeight = 0.995, minModuleSize = min(20, ncol(as.matrix(datE))/2 ), datME=NULL, ...) { y=as.numeric(as.character(y)) if (length(y) != dim(as.matrix(datE))[[1]] ) stop("Input error in function ImFeelingLazy. The following should hold: length(y) = dim(as.matrix(datE))[[1]] ") No.Available=apply(as.matrix(!is.na(datE)), 2,sum) ExprVariance=apply(as.matrix(datE),2,var, na.rm=T ) restrictGenes=No.Available>=4 & ExprVariance>0 numberUsefulGenes=sum(restrictGenes,na.rm=T) if ( numberUsefulGenes<3 ) { warning("IMPORTANT warning in function ImFeelingLazy: there are not enough useful genes. Maybe your input genes have fewer than 4 observations or they are constant. WGCNA cannot be used for these data. Hint: collect more arrays or input genes that vary."); output=list(NetworkScreening=data.frame(NS1=rep(NA, dim(as.matrix(datE))[[2]] )), datME=rep(NA, dim(as.matrix(datE))[[1]] ), EigengeneSignificance=NA , AAcriterion=NA) } # end of if if ( numberUsefulGenes>= 3 ) { datEUsefulGenes=as.matrix(datE)[,restrictGenes & !is.na(restrictGenes)] if (is.null(datME) ){ mergeCutHeight1 = DynamicMergeCut(n= dim(as.matrix(datEUsefulGenes))[[1]]) B=blockwiseModules(datEUsefulGenes, mergeCutHeight = mergeCutHeight1, TOMLevel=0,power=power, networkType=networkType, detectCutHeight = detectCutHeight, minModuleSize= minModuleSize ); datME=data.frame(B$MEs) } if (dim(as.matrix(datME))[[1]] != dim(as.matrix(datE))[[1]] ) stop("Error: dim(as.matrix(datME))[[1]] != dim(as.matrix(datE))[[1]]") MMdata=signedKME(datExpr=datE, datPC=datME, outputColumnName="MM.") MMdataPvalue=as.matrix(FisherTransformP(as.matrix(MMdata), n= dim(as.matrix(datE))[[1]])) dimnames( MMdataPvalue)[[2]]=paste("Pvalue",names(MMdata), sep=".") NS1=NetworkScreening(y= y,datPC=datME, datExpr=datE) # here we compute the eigengene significance measures ES=data.frame(cor(y, datME, use="p")) rr=max(abs(ES),na.rm=T) AAcriterion=sqrt(length(y)-2) * rr/sqrt(1-rr^2) ESy=(1+max(abs(ES), na.rm=T))/2 ES=data.frame(ES, ESy=ESy) # to avoid dividing by zero, we set correlation that are 1 equal to .9999 ES.999=as.numeric(as.vector(ES)) ES.999[ES==1 & !is.na(ES) ]=.9999 ES.pvalue=FisherTransformP(r=abs(ES.999), n=sum(!is.na(y) )) ES.pvalue[length(ES.999)]=0 EigengeneSignificance.pvalue=data.frame(matrix(ES.pvalue, nrow=1) ) names(EigengeneSignificance.pvalue)=names(ES) datME=data.frame(datME,y=y) names(ES)=paste("ES", substr(names(ES),3,100), sep="") # now we compute the AA criterion #datMEALL= ModulePrincipalComponents(Data=datEUsefulGenes,ModuleColors= rep("turquoise", dim(datEUsefulGenes)[[2]]) )[[1]] #ES.overall=as.numeric(cor(y,datMEALL, use="p")) print(signif(ES,2)) #print(paste("ES.overall= ",signif(ES.overall,2))) #print(paste("AAcriterion=", signif(AAcriterion,2))) output=list(NetworkScreening=data.frame(NS1, MMdata, MMdataPvalue), datME=data.frame(datME), EigengeneSignificance=data.frame(ES) , EigengeneSignificance.pvalue=EigengeneSignificance.pvalue, AAcriterion=AAcriterion) } # end of if ( numberUsefulGenes>= 1 ) output } # end of function ImFeelingLazy if(exists("ImFeelingLazyGS")) rm(ImFeelingLazyGS); ImFeelingLazyGS=function(datE, GS, power=6, networkType="unsigned", detectCutHeight = 0.995, minModuleSize = min(20, ncol(as.matrix(datE))/2 ), datME=NULL, ...) { if (!is.numeric(GS) ) stop("Input error in function ImFeelingLazyGS: gene significance variable GS is not numeric.") if ( dim(as.matrix(datE))[[2]] != length(GS) ) stop("Input error in function ImFeelingLazyGS: length of gene significance variable GS does not equal the number of columns of datE."); mergeCutHeight1 = DynamicMergeCut(n= dim(as.matrix(datE))[[1]]) No.Available=apply(as.matrix(!is.na(datE)), 2,sum) ExprVariance=apply(as.matrix(datE),2,var, na.rm=T ) restrictGenes=No.Available>=4 & ExprVariance>0 numberUsefulGenes=sum(restrictGenes,na.rm=T) if ( numberUsefulGenes<3 ) { warning("IMPORTANT warning in function ImFeelingLazy: you input fewer than 3 useful genes. Violations: either fewer than 4 observations or the genes are constant. WGCNA cannot be used for these data. Hint: collect more arrays or input genes that vary."); output=list(NetworkScreening=data.frame(NS1=rep(NA, dim(as.matrix(datE))[[2]])) , datME=rep(NA, dim(as.matrix(datE))[[1]]) , HubGeneSignificance=NA); } # end of if if ( numberUsefulGenes>=3 ) { datEUsefulGenes=as.matrix(datE)[,restrictGenes & !is.na(restrictGenes)] if (is.null(datME) ){ B=blockwiseModules(datEUsefulGenes, mergeCutHeight = mergeCutHeight1, TOMLevel=0,power=power, networkType=networkType, detectCutHeight = detectCutHeight, minModuleSize= minModuleSize ); datME=data.frame(B$MEs) } #end of if MMdata=signedKME(datExpr=datE, datPC=datME, outputColumnName="MM.") MMdataPvalue=as.matrix(FisherTransformP(as.matrix(MMdata), n= dim(as.matrix(datE))[[1]])) dimnames( MMdataPvalue)[[2]]=paste("Pvalue",names(MMdata), sep=".") NS1= NetworkScreeningGS(datExpr=datE, datPC=datME, GS=GS ) # here we compute the eigengene significance measures HGS1=data.frame(as.matrix(t(HubGeneSignificance(MMdata ^3,GS^3)),nrow=1)) datME=data.frame(datME) names(HGS1)=paste("HGS", substr(names(MMdata),4,100), sep="") # now we compute the AA criterion print(signif(HGS1,2)) output=list(NetworkScreening=data.frame(NS1, MMdata, MMdataPvalue), datME=data.frame(datME), HubGeneSignificance=data.frame(HGS1)) } # end of if ( numberUsefulGenes>=3 ) output } # end of function ImFeelingLazyGS # The following function computes the hub gene significance as defined in # in the paper Horvath and Dong. Input a data frame with possibly signed # module membership measures ( also known as module eigengene based connectivity #kME. Further it requires a possibly signed gene significance measure. # GS=0 means that the gene is not significant, high positive or negative values mean # that it is significant. # The input to this function can include the sign of the correlation. HubGeneSignificance=function(datKME, GS ) { no.PCs=dim(as.matrix(datKME))[[2]] no.genes= dim(as.matrix(datKME))[[1]] if ( length(GS) != no.genes ) stop("Input error in function HubGeneSignificance. Dimensions are not compatible. ") Kmax=as.numeric(apply(as.matrix(abs(datKME)),2,max, na.rm=T)) Kmax[Kmax==0]=1 datKME=scale(datKME, center=F, scale=Kmax) sumKsq=as.numeric(apply(as.matrix(datKME^2) , 2, sum, na.rm=T)) sumKsq[sumKsq==0]=1 HGS=as.numeric(apply(I(GS)*datKME, 2, sum,na.rm=T))/ sumKsq as.numeric(HGS) } #end of function HubGeneSignificance if(exists("NetworkScreeningGS") ) rm(NetworkScreeningGS); NetworkScreeningGS=function(datExpr , datPC, GS ,oddpower1=3,batchsize=1000,MinimumSampleSize=4, addGS=T){ oddpower1=as.integer(oddpower1) if (as.integer(oddpower1/2)==oddpower1/2 ) {oddpower1=oddpower1+1} no.PCs=dim(as.matrix(datPC))[[2]] no.genes=dim(as.matrix(datExpr))[[2]] GS.Weighted=rep(0,no.genes) if ( dim(as.matrix(datExpr))[[1]] != dim(as.matrix(datPC))[[1]]) stop("Error in function NetworkScreeningGS. Expression data and the module eigengenes have different numbers of observations (arrays). Specifically: dim(as.matrix(datExpr))[[1]] != dim(as.matrix(datPC))[[1]] ") if ( dim(as.matrix(datExpr))[[2]] != length(GS) ) stop("Error in function NetworkScreeningGS. The number of genes in the expression data does not match the length of the genes significance variable. Specifically, dim(as.matrix(datExpr))[[2]] != length(GS) ") No.Available=apply(as.matrix(!is.na(datExpr)), 2,sum) ExprVariance=apply(as.matrix(datExpr),2,var, na.rm=T ) restrictGenes=No.Available>=4 & ExprVariance>0 numberUsefulGenes=sum(restrictGenes,na.rm=T) if ( numberUsefulGenes<3 ) { warning("IMPORTANT warning in function NetworkScreeningGS: you input fewer than 3 useful genes. Violations: either fewer than 4 observations or the genes are constant. WGCNA cannot be used for these data. Hint: collect more arrays or input genes that vary."); datout=data.frame(GS.Weighted=rep(NA, dim(as.matrix(datExpr))[[2]]), GS=GS) } # end of if if ( numberUsefulGenes>=3 ) { no.batches=as.integer(no.PCs/batchsize) if (no.batches>0) { for (i in 1:no.batches) { print(paste("batch number = ", i)) index1=c(1:batchsize)+(i-1)* batchsize datPCBatch= datPC[,index1] datKMEBatch=as.matrix(signedKME(datExpr,datPCBatch, outputColumnName="MM.")) ESBatch= HubGeneSignificance(datKMEBatch ^oddpower1,GS^oddpower1) # the following omits the diagonal when datPC=datExpr if (no.genes==no.PCs) {diag(datKMEBatch[index1,])=0 # missing values will not be used datKMEBatch[is.na(datKMEBatch)]=0 ESBatch[is.na(ESBatch)]=0 } # end of if GS.WeightedBatch= as.matrix(datKMEBatch)^oddpower1 %*% as.matrix(ESBatch) GS.Weighted=GS.Weighted+GS.WeightedBatch } # end of for (i in 1:no.batches } # end of if (no.batches>0)... if (no.PCs-no.batches*batchsize>0 ) { restindex=c((no.batches*batchsize+1):no.PCs) datPCBatch= datPC[,restindex] datKMEBatch=as.matrix(signedKME(datExpr,datPCBatch, outputColumnName="MM.")) ESBatch= HubGeneSignificance(datKMEBatch ^oddpower1,GS^oddpower1) # the following omits the diagonal when datPC=datExpr if (no.genes==no.PCs) {diag(datKMEBatch[restindex,])=0 # missing values will not be used datKMEBatch[is.na(datKMEBatch)]=0 ESBatch[is.na(ESBatch)]=0 } # end of if (no.genes==no.PCs) GS.WeightedBatch= as.matrix(datKMEBatch)^oddpower1 %*% ESBatch GS.Weighted=GS.Weighted+GS.WeightedBatch } # end of if (no.PCs-no.batches*batchsize>0 ) GS.Weighted=GS.Weighted/no.PCs GS.Weighted[No.Available< MinimumSampleSize]=NA rankGS.Weighted=rank(-GS.Weighted, ties.method="first") rankGS=rank(-GS, ties.method="first") print(paste("Proportion of agreement between GS.Weighted and GS:")) for (i in c(10,20,50,100,200,500,1000)) { print(paste("Top ", i, " list of genes: prop. of agreement = ", signif(sum(rankGS.Weighted<=i & rankGS<=i,na.rm=T)/i,3) )) } # end of for loop if (mean(abs(GS.Weighted),na.rm=T)>0) {GS.Weighted=GS.Weighted/mean(abs(GS.Weighted),na.rm=T)*mean(abs(GS),na.rm=T)} if (addGS ) GS.Weighted=apply(data.frame(GS.Weighted, GS), 1,mean, na.rm=T) datout=data.frame(GS.Weighted, GS) } # end of if ( numberUsefulGenes>=3 ) { datout } # end of function # The following function computes a merging threshold for combinings modules whose # module eigengenes are more highly correlated than a certain threshold. # The output of the function is 1-threshold (i.e. for use in a dissimilarity measure # The merging threshold depends on the sample size and the true merging correlation # parameter truemergecor. # Specifically, the function uses the Fisher transform to compute the lower # confidence interval of the truemerge correlation given the sample size # and the Zquantile. if(exists("DynamicMergeCut") ) rm(DynamicMergeCut); DynamicMergeCut=function(n, truemergecor=.9, Zquantile=2.6) { if (truemergecor>1 | truemergecor<0 ) stop("Input error in the function DynamicMergeCut. Please specify 0<=truemergecor<=1") if (truemergecor==1) { print("Warning in function DynamicMergeCut: truemergecor=1. I will set it to .999."); truemermergecor=.999} if (n<4 ) {print("Warning in function DynamicMergeCut: too few observations for the dynamic assignment of the merge threshold. I will set the threshold to .35"); mergethreshold=.35} if (n>3 ) { # Fisher transform of the true merge correlation Fishertruemergecor=.5*log((1+truemergecor)/(1-truemergecor)) E=exp(2*( Fishertruemergecor -Zquantile/sqrt(n-3))) LowerBoundCIcor=(E-1)/(E+1) mergethreshold=1- LowerBoundCIcor} if (mergethreshold>1) 1 mergethreshold }# end of function DynamicMergeCut if(exists("NetworkScreening") ) rm(NetworkScreening); NetworkScreening=function(y,datPC,datExpr , oddpower1=3,batchsize=1000,MinimumSampleSize=4,addPCy=T,removeDiag=F, weightESy=0.5){ oddpower1=as.integer(oddpower1) if (as.integer(oddpower1/2)==oddpower1/2 ) {oddpower1=oddpower1+1} no.PCs=dim(as.matrix(datPC))[[2]] no.genes=dim(as.matrix(datExpr))[[2]] # Here we add y as extra PC if (no.genes>no.PCs & addPCy) { datPC=data.frame(y,datPC) } no.PCs=dim(as.matrix(datPC))[[2]] RawCor.Weighted=rep(0,no.genes) Cor.Standard= as.numeric(cor(y,datExpr,use= "p") ) NoAvailable=apply(!is.na(datExpr), 2,sum) Cor.Standard[NoAvailable< MinimumSampleSize]=NA if (no.genes==1) RawCor.Weighted=as.numeric(cor(y,datExpr,use= "p") ) no.batches=as.integer(no.PCs/batchsize) if (no.batches>0) { for (i in 1:no.batches) { print(paste("batch number = ", i)) index1=c(1:batchsize)+(i-1)* batchsize datPCBatch= datPC[,index1] datKMEBatch=as.matrix(signedKME(datExpr,datPCBatch, outputColumnName="MM.")) ES.CorBatch= as.vector(cor( as.numeric(as.character(y)) ,datPCBatch, use="p")) #weightESy ES.CorBatch[ES.CorBatch>.999]= weightESy*1+ (1- weightESy)* max(abs(ES.CorBatch[ES.CorBatch <.999 ]),na.rm=T) # the following omits the diagonal when datPC=datExpr if (no.genes==no.PCs & removeDiag) {diag(datKMEBatch[index1,])=0} if (no.genes==no.PCs ){ # missing values will not be used datKMEBatch[is.na(datKMEBatch)]=0 ES.CorBatch[is.na(ES.CorBatch)]=0 } # end of if RawCor.WeightedBatch= as.matrix(datKMEBatch)^oddpower1 %*% as.matrix(ES.CorBatch^oddpower1) RawCor.Weighted=RawCor.Weighted+RawCor.WeightedBatch } # end of for (i in 1:no.batches } # end of if (no.batches>0)... if (no.PCs-no.batches*batchsize>0 ) { restindex=c((no.batches*batchsize+1):no.PCs) datPCBatch= datPC[,restindex] datKMEBatch=as.matrix(signedKME(datExpr,datPCBatch, outputColumnName="MM.")) ES.CorBatch= as.vector(cor( as.numeric(as.character(y)) ,datPCBatch, use="p")) #weightESy ES.CorBatch[ES.CorBatch>.999]= weightESy*1+ (1- weightESy)* max(abs(ES.CorBatch[ES.CorBatch <.999 ]),na.rm=T) # the following omits the diagonal when datPC=datExpr if (no.genes==no.PCs & removeDiag) {diag(datKMEBatch[restindex,])=0} if (no.genes==no.PCs ){ # missing values will not be used datKMEBatch[is.na(datKMEBatch)]=0 ES.CorBatch[is.na(ES.CorBatch)]=0 } # end of if RawCor.WeightedBatch= as.matrix(datKMEBatch)^oddpower1 %*% ES.CorBatch^oddpower1 RawCor.Weighted=RawCor.Weighted+RawCor.WeightedBatch } # end of if (no.PCs-no.batches*batchsize>0 ) RawCor.Weighted=RawCor.Weighted/no.PCs RawCor.Weighted[NoAvailable< MinimumSampleSize]=NA #to avoid dividing by zero we scale it as follows if (max(abs(RawCor.Weighted),na.rm=T)==1) RawCor.Weighted=RawCor.Weighted/1.0000001 if (max(abs( Cor.Standard),na.rm=T)==1) Cor.Standard=Cor.Standard/1.0000001 RawZ.Weighted=sqrt(NoAvailable -2)*RawCor.Weighted/sqrt(1-RawCor.Weighted^2) Z.Standard= sqrt(NoAvailable -2)* Cor.Standard/sqrt(1-Cor.Standard^2) if (sum(abs(Z.Standard),na.rm=T) >0 ) { Z.Weighted=RawZ.Weighted/sum(abs(RawZ.Weighted),na.rm=T)*sum(abs(Z.Standard),na.rm=T) } # end of if h1=Z.Weighted/sqrt(NoAvailable-2) Cor.Weighted=h1/sqrt(1+h1^2) p.Weighted=as.numeric(2*(1-pt(abs(Z.Weighted),NoAvailable-2))) p.Standard=2*(1-pt(abs(Z.Standard),NoAvailable-2)) # since the function qvalue cannot handle missing data, we set missing p-values to 1. p.Weighted2=p.Weighted p.Standard2=p.Standard p.Weighted2[is.na(p.Weighted)]=1 p.Standard2[is.na(p.Standard)]=1 q.Weighted=try(qvalue(p.Weighted2)$qvalues) q.Standard=try(qvalue(p.Standard2)$qvalues) if (class(q.Weighted)=="try-error") q.Weighted=rep(NA, length(p.Weighted) ) if (class(q.Standard)=="try-error") q.Standard=rep(NA, length(p.Standard) ) rankCor.Weighted=rank(-abs(Cor.Weighted), ties.method="first") rankCor.Standard=rank(-abs(Cor.Standard), ties.method="first") print(paste("Proportion of agreement between lists based on abs(Cor.Weighted) and abs(Cor.Standard):")) for (i in c(10,20,50,100,200,500,1000)) { print(paste("Top ", i, " list of genes: prop. agree = ", signif(sum(rankCor.Weighted<=i & rankCor.Standard<=i,na.rm=T)/i,3))) } # end of for loop datout=data.frame(p.Weighted,q.Weighted,Cor.Weighted,Z.Weighted,p.Standard,q.Standard,Cor.Standard, Z.Standard) datout } # end of function var1=function(x) var(x, na.rm=T) if (exists("no.present")) rm(no.present); no.present=function(x) sum(!is.na(x)) # The following function computes the module eigengene based connectivity. # Input: datExpr= a possibly very large gene expression data set where the rows # correspond to samples and the columns represent genes # datPC=data frame of module eigengenes (columns correspond to module eigengenes or PCs) # A module eigengene based connectivity kME value will be computed if the gene has # a non missing expression value in at least MinimumNoSamples arrays. # Output a data frame where columns are the kME values # corresponding to different modules. # By splitting the expression data into different batches, the function can deal with # tens of thousands of gene expression data. # If there are many eigengenes (say hundreds) consider decreasing the batch size. if(exists("signedKME") ) rm(signedKME) signedKME=function(datExpr, datPC, outputColumnName="kME") { datExpr=data.frame(datExpr) datPC=data.frame(datPC) output=list() if (dim(as.matrix(datPC))[[1]] != dim(as.matrix(datExpr))[[1]] ) stop("ERROR in signedKME function: dim(datPC)[[1]] != dim(datExpr)[[1]] ") varianceZeroIndicatordatExpr=as.vector(apply(as.matrix(datExpr),2,var1))==0 varianceZeroIndicatordatPC =as.vector(apply(as.matrix(datPC),2,var1))==0 if (sum(varianceZeroIndicatordatExpr,na.rm=T)>0 ) warning("warning in signedKME: some genes are constant. Hint: consider removing constant columns from datExpr." ) if (sum(varianceZeroIndicatordatPC,na.rm=T)>0 ) warning("warning in signedKME: some module eigengenes are constant, which is very weird. Module eigengenes should have mean zero and variance 1. Hint: consider removing constant columns from datPC." ) no.presentdatExpr=as.vector(apply(!is.na(as.matrix(datExpr)),2, sum) ) if (min(no.presentdatExpr)<4 ) warning("warning in signedKME: some gene expressions have fewer than 4 observations. Hint: consider removing genes with too many missing values or collect more arrays.") output=data.frame(cor(datExpr, datPC, use="p")) output[no.presentdatExpr<4,]=NA names(output)=paste(outputColumnName, substring(names(datPC), first=3, last=100), sep="") dimnames(output)[[1]] = names(datExpr) output } # end of function signedKME