Skip Navigation


Bioinformatics Advance Access originally published online on June 22, 2007
Bioinformatics 2007 23(17):2298-2305; doi:10.1093/bioinformatics/btm328
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary data
Right arrow All Versions of this Article:
23/17/2298    most recent
btm328v2
btm328v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Zhu, D.
Right arrow Articles by Li, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zhu, D.
Right arrow Articles by Li, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Multivariate correlation estimator for inferring functional relationships from replicated genome-wide data

Dongxiao Zhu 1,*, Youjuan Li 2 and Hua Li 1

1Stowers Institute for Medical Research, 1000 E 50th Street, Kansas City, MO 64110 and 2Department of Statistics, University of Michigan, Ann Arbor, MI 48105, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 Appendix
 ACKNOWLEDGEMENTS
 REFERENCES
 

Summary: Estimating pairwise correlation from replicated genome-scale (a.k.a. OMICS) data is fundamental to cluster functionally relevant biomolecules to a cellular pathway. The popular Pearson correlation coefficient estimates bivariate correlation by averaging over replicates. It is not completely satisfactory since it introduces strong bias while reducing variance. We propose a new multivariate correlation estimator that models all replicates as independent and identically distributed (i.i.d.) samples from the multivariate normal distribution. We derive the estimator by maximizing the likelihood function. For small sample data, we provide a resampling-based statistical inference procedure, and for moderate to large sample data, we provide an asymptotic statistical inference procedure based on the Likelihood Ratio Test (LRT). We demonstrate advantages of the new multivariate correlation estimator over Pearson bivariate correlation estimator using simulations and real-world data analysis examples.

Availability: The estimator and statistical inference procedures have been implemented in an R package ‘CORREP’ that is available from CRAN [http://cran.r-project.org] and Bioconductor [http://www.bioconductor.org/].

Contact: doz{at}stowers-institute.org or dongxiaozhu{at}yahoo.com

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 Appendix
 ACKNOWLEDGEMENTS
 REFERENCES
 
The ever-increasing use of high-throughput data acquisition technologies, such as gene expression arrays and mass spectrometry, has generated an enormous amount of genome-scale data, in which every gene or gene product is associated with a series of numeric measurements. Analysis of these data provide many opportunities to understand the underlying cellular processes. Some of the familiar tasks in analysis of genome-wide measurements include pre-processing (Irizarry et al., 2003; Li and Wong, 2001; Zhou and Rocke, 2005), detecting differentially expressed genes (Cui and Churchill, 2003; Pavelka et al., 2004; Sartor et al., 2006), gene clustering (Medvedovic and Sivaganesan, 2002; Medvedovic et al., 2004; Liu et al., 2006; Yeung et al., 2001), sample classification and biomarker discovery (Nguyen and Rocke, 2002a, Nguyen and Rocke, 2002b; Yeung and Bumgarner, 2005) and gene network reconstruction (Lee et al., 2004; Zhu et al., 2005a). These analyses are useful in understanding the complex web of bio-molecule interaction and regulation in the living cell.

Many types of genome-wide data are noisy, and have to be replicated to account for the inherent variability. In this work, we discuss replication in two contexts, i.e. profiling gene expression using expression arrays, and profiling protein abundance in biological samples using mass-spectrometry.

Various forms of data replication are employed in both cDNA and Affymetrix microarray experiments. In cDNA microarray experiments, the differences between replication strategies are in the extent to which the replicates can be treated as statistically independent random samples from a population. The more popular replication methods can be divided into within-slide and between-slide replications (Speed, 2003). The former is generated by printing the same cDNA on a slide more than once for quality control purposes. The latter is generated by using different mRNA aliquot for each slide hybridization that can be further divided into technical replicates and biological replicates depends on whether the mRNA aliquots are from a single mouse or strain or not. In Affymetrix experiments, each probe set consists of about 11 probe pairs. The probe-level intensities can be treated as replicated measures of the gene expression scores.

In proteomics research, quantification experiments of the proteins or the peptides present in a sample are usually performed in replicates. For example, the recently introduced iTRAQ protein quantification technique (Ross et al., 2004) derivatizes peptides using multiple (e.g. four) isobaric mass tags i.e. mass-to-charge ratios (m/z) 114, 115, 116 and 117 Da, and it also imparts identical reversed-phase retention properties to differentially labeled peptides. Therefore, four separate peptide mixtures can be individually labeled with different mass tags, combined, and separated by reversed-phase high performance liquid chromatography (HPLC). The relative abundance of a protein is thus represented by a number of ratios of peptides abundance to the internal standard. These ratios can be viewed as replicated measurements of the protein abundance (Hardt et al., 2005).

Many data analysis approaches were able to draw reliable inference from noisy data by taking good advantage of replicates. For example, analysis of variance (ANOVA) have been applied to screen differentially expressed genes between different physiological/genetic conditions (Cui and Churchill, 2003). Bayesian mixture models have been developed and applied to replicated microarray data in order to infer gene clusters and classify tissues (Medvedovic and Sivaganesan, 2002; Medvedovic et al., 2004; Yeung and Bumgarner, 2005). However, many popular approaches to multivariate pattern discovery, such as hierarchical clustering or co-expression network construction, are based on estimating the statistical correlation between a pair of replicated biomolecule expression profiles, and the widespread bivariate correlation estimators used for that purpose (e.g. Pearson correlation coefficient) employ average expression profiles over replicates. The averaging, while on one hand reduces variance, on the other hand, tends to give rise to more biased correlation estimation. Thus the simple averaging over replicates is not completely satisfactory especially for noisy omics data that do not have good within-replicate correlation.

Here, we propose a new multivariate correlation estimator that exploits all replicates of a data set by assuming they are i.i.d. samples from the multivariate normal distribution. Based on a covariance structure that explicitly models within-replicate and between-replicate correlations of the data, we derive a maximum likelihood (ML)-based correlation estimator by maximizing the likelihood function of the replicated data. We establish a theoretical connection between the two correlation estimators when there is no replicate, and we provide a suite of statistical inference procedures for small sample size and large sample size data. Our work was strongly influenced by the previous works in estimating interclass and intraclass correlation from familial data (Chu and kshirsagar, 1994; Keen and Srivastava, 1991; Konishi, 1985; Konishi et al., 1991; Srivastava, 1984).


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 Appendix
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Pearson bivariate correlation estimator
Suppose the biomolecule abundance levels are simultaneously measured over n independent samples. There are m replicated measures available for each sample. Let xij, yij denote the abundance levels for arbitrary biomolecule X and biomolecule Y in the i th replicate in the j th sample. The application of Pearson correlation coefficient requires averaging biomolecule abundance profiles, i.e. Formula and Formula , j = 1, 2,..., n. The sample Pearson bivariate correlation coefficient between biomolecules X and Y is defined as:


Formula 1

(1)
where Formula and Formula and Formula are the grand means for xij, yij, i = 1, 2,..., m, j = 1, 2,..., n, respectively.

2.2 Multivariate correlation estimator
Instead of averaging, we exploit all the replicated observations by assuming the data are i.i.d. samples from a multivariate normal distribution with a specified correlation matrix and a mean vector, i.e. Zj = (xj1, ...,xjm,yj1,...,yjm)T, j = 1, 2, ...,n, follows a 2m-variate normal distribution N(µ,{sum}), where Formula Formula is an m x 1 vector, the correlation matrix {sum} is a 2m x 2m matrix:


Formula 2

(2)
where the inter-molecule correlation {rho} is the parameter of interest, and the intra-molecule correlation {rho}x or {rho}y are nuisance parameters. The {rho}x and {rho}y indicate the quality of replicates that high quality replicates tend to have high value, and vice versa. We employ three parameters: {rho}, {rho}x and {rho}y to model the correlation structure of replicated data.

Assuming a multivariate normal model, the maximum likelihood estimate (MLE) of {rho} can be derived as follows (see Appendix for more details):


Formula 3

(3)
Similarly,


Formula 4

(4)
therefore, Formula

The MLE of {sum} is


Formula 5

(5)

To derive the MLE of {rho}, it would be preferable to obtain the likelihood explicitly as a function of {rho}. However, this is intractable in practice (see Appendix for a detailed discussion). Our approach is to use the average of the elements of Formula estimated via Equation (5):


Formula 6

(6)

It follows that the Formula remains invariant for data with unequal numbers of replicates (m) since it is not a function of m.

2.3 Theoretical connection between the two estimators
The sample Pearson correlation coefficient [Equation (1)] can be also written into the following form:


Formula 7

(7)
where SX and SY are SDs of X and Y, respectively. When there is no replicate (m = 1), the correlation matrix {sum} is reduced to a 2 by 2 matrix with diagonal elements equal to 1 and off-diagonal elements equal to {rho}. It is easy to show from Equation (5) that


Formula 8

(8)
Hence, we derive the connection between the two estimators when there is no replicate as follows:


Formula 9

(9)

2.4 Small sample inference procedure
For small sample size data, we provide a resampling-based statistical inference procedure. For n replicated bivariate observations Z1 = (X.1,Y.1), Z2 = (X.2,Y.2), ..., Zn = (X.n,Y.n), where . represents row index, and 1,..., n represent column index, we test the following hypothesis:


Formula 10

(10)
The steps for performing the permutation test are described in Algorithm 1, and the steps for deriving the asymptotically distribution-free bootstrap confidence interval is described in Algorithm 2 (Efron and Tibshirani, 1993, Hollander and Wolfe, 1999).
Algorithm 1 Permutation Test

1: Estimate the multivariate correlation (Table 4) of the original data using Eqs. 5, 6.
2: if sample size n ≤ 10 then
3:    for all n! permutations of (X.1, Y.1),...,(X.n,Y.n) do
4:        keep X.j constant, permute the Y.j, and re-estimate the multivariate correlation of the permutated data (i.e.Table 4, k = 1, 2, ..., n!)
5:    end for
6: end if
7: if sample size n > 10 then
8:    for P random permutations of (X.1, Y.1),...,(X.n, Y.n) do
9:        keep X.j constant, permute the Y.j, and recalculate the multivariate correlation of the permutated data (i.e. Table 4, k = 1, 2, ... , P, P ≤ n!)
10:    end for
11: end if
12: Table 4 forms empirical null distribution of {rho}
13: The empirical two-sided p-value is calculated as:

Table 4(11)

where K = n! or K = P


Algorithm 2 Bootstrap Confidence Interval (Hollander and Wolfe 1999)

1: for all B bootstrap trials do
2:    Make n random draws with replacement from the bivariate sample Z1, Z2, ..., Zn. i.e. each data vector Zi, i = 1,..., n is sampled with equivalent probability Table 5
3: Compute Table 5. Denote B values of Table 5 as Table 5
4: end for
5: An asymptotically distribution-free confidence interval for Table 5, with approximate confidence coefficient 100(1-{alpha})%, is Table 5 where

Table 5

and

Table 5(12)

6: if k=B({alpha}/2) is an integer then
7:    Table 5 is the kth-largest bootstrap replication and Table 5 is the (B+ 1–k)th-largest replication
8: end if
9: if k=B({alpha}/2) is not an integer then
10:    k=(B+1)({alpha}/2)>, the largest integer that is less than or equal to (B+1)({alpha}/2)
11: end if

2.5 Large sample inference procedure
For moderate to large sample data, we provide an LRT procedure for testing the hypothesis that the multivariate correlation {rho} vanishes. Under the multivariate normal distribution assumption stated in Section 2.2, Zj ~ N(µ, {sum}), and we test the following hypothesis:


Formula 13

(13)
Here, both {sum}0 and {sum}1 are (m1 + m2) x (m1 + m2) matrices, where m1 and m2 are number of replicates for biomolecule X and Y (in the case of equal numbers of replicates for X and Y, we have m1 = m2 = m) and Formula , where {sum}x and {sum}y, with diagonal elements identity and all the other entries being {rho}x and {rho}y, respectively. 0m12 is an m1 x m2 zero matrix and 0m21 is an m2 x m1 zero matrix, i.e. under the null hypothesis, the intermolecule correlation {rho} vanishes. {sum}xy is an m1 x m2 matrix with all entries equal to {rho}. Likewise, Formula is an m2 x m1 matrix with all entries equal to {rho}. The likelihood ratio (LR) statistic for testing the two different correlation structures can be derived as follows:


Formula 14

(14)
Note that for the test to be a true LRT, all the estimated quantities Formula in the above formula should be MLE's. In Section 2.2, Equations (3–5GoGo) give the formula of MLE's of the mean vector and the correlation matrix under H{alpha}. The MLE of the correlation matrix under H0 can be determined as Formula , where


Formula 15

(15)


Formula 16

(16)

The LR statistic, denoted by G2, is therefore:


Formula 17

(17)
where Formula . The LR statistic is asymptotically {chi}2 distributed with 2(m1x m2) degrees of freedom under H0 (Anderson, 1958).


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 Appendix
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 Simulation setup
We used mean squared error (MSE) as an objective criterion for comparison. It is defined as


Formula 18

(18)
where l is the index, and S is the total number of simulation runs, {rho} is the true correlation coefficient and Formula is the lth (either multivariate or bivariate) estimation of the true correlation. An estimator with smaller MSE is considered to be better.

Genome-wide data may have different sample sizes (n), different number of replicates (m) and different, often not well-studied, correlation structure ({sum}). We aim to show that our multivariate estimator is consistently superior to Pearson bivariate correlation estimator in almost all the realistic combinations of the above parameters. In particular, we set the parameters to the following values:

  • The sample size n: sample size represents the number of independent biological samples of interest. We set n to be 5, 10, 20 and 50, corresponding to small/median/large samples.
  • The number of replicates m: most data only have a few replicates due to the cost of the genome-wide experiments. Therefore, we set m to be 3, 4 and 5.
  • The within-replicate correlations {rho}x or {rho}y (see Methods Section): it is an indicator of data quality. We set it at three levels: very noisy (L) (0.1–0.3), noisy (M) (0.3–0.5) and clean (H) (0.6–0.8). Evaluation based on the first two levels are of particular interest to us, since genome-wide profiling is typically noisy.
  • The between-replicate correlation {rho} (see Methods Section): similarly, {rho} can be chosen as low (0.2), median (0.4) and high (0.6). The high between-replicate correlation (typically 0.6) is of particular interest since it more reliably predict functional similarity (Zhu et al., 2005b).
A comprehensive comparisons of the two estimators can thus be done by exhaustively exploring all possible combinations of the above parameters.

3.2 Simulation results
In Figures 1 and 2, vertical axis represents the MSE ratio of Pearson estimator over multivariate estimator. Ratios greater than 1 indicate that the multivariate correlation estimator outperforms the traditional bivariate estimator. The horizontal axis represents the different combinations of categorical within-replicate correlations. We compared the two estimators under five within-replicate correlation structures: i.e. (L, L), (L, M), (L, H), (M, M) and (M, H) and three between-replicate correlation cut-offs, 0.2, 0.4 and 0.6. Note that under few combination, simulations could not be preformed because the corresponding correlation matrices were not positive definite (PD).


Figure 1
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. MSE ratios of the bivariate versus multivariate correlation estimators. Sample size is fixed (N = 20), and replicates vary from m = 3; 4 and 5. (a) Low true correlation cut-off (in logarithmic scale).(b) High true correlation cut-off. High true correlation cut-off. Colour version of this figure is available as Supplementary material online.

 

Figure 2
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. MSE ratios of the bivariate versus the multivariate correlation estimators. Replicates are fixed (m = 4), and sample sizes vary (in logarithmic scale). (a) Low true correlation cut-off. (b) High true correlation cut-off. Colour version of this figure is available as Supplementary material online.

 
In Figure 1, we fixed the sample size at n = 20 to examine the performance of the multivariate estimator with varying number of replicates versus the bivariate estimator by averaging over replicates. In Figure 1a and b, almost all examined MSE ratio are greater than 1 indicating superior performance of the multivariate estimator to the bivariate estimator. In particular, Figure 1a (lower between-replicate correlation cut-off {rho} = 0.4) has a much higher overall MSE ratio. In Figure 2, we fixed the number of replicates at m = 4, which is typical in omics experiments to examine the performance over different number of samples. Similarly, almost all examined MSE ratio are greater than 1. In general, the superior performance of the multivariate estimator to the bivariate estimator is an increasing function of the sample size (n), number of replicates (m), but it is a decreasing function of data quality ({rho}x and {rho}y) and correlation cut-off. For real-world data, we expect that our correlation estimator work even better than Pearson estimator as sample size, number of replicates increase and data quality decrease. The set of comprehensive comparison results are summarized in the Supplementary Table S1.

Checking into the results more carefully, we found that the mean bivariate estimates are greatly biased upward and the mean multivariate estimates are slightly biased downward (Figure 3, Supplementary Table S1). This can be translated into a significantly reduced false positive rate when applying the multivariate estimator. We also examined variances of the two estimators, our estimator has smaller variance for noisy data, and larger variance otherwise (Supplementary Table S1). Our multivariate estimator, while a little conservative and sometime having larger variance, is much more accurate than the Pearson bivariate correlation estimator (Figure 3).


Figure 3
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Comparison of the means of two estimations. The upper cluster of curves (pink) represents the bivariate estimator for omics data with different replication quality. The lower cluster of curves (blue) represents the multivariate estimator for omics data with different replication quality. Colour version of this figure is available as Supplementary material online.

 
3.3 Microarray data analysis examples
Many real-world data analysis tasks rely on accurate estimation of correlation matrix. Affymetrix made two spike-in data sets publicly available as benchmark to compare different probe set expression summarization methods, such as RMA (Irizarry et al., 2003) and GCRMA (Wu and Irizarry, 2005) [downloadable from http://affycomp.biostat.jhsph.edu/]. We instead use the two spike-in data sets as benchmark to compare correlation estimation methods: Pearson correlation coefficient using either RMA or GCRMA expression scores (GR and GG in Table 1). Multivariate correlation estimator using probe-level data where perfect match (PM) intensities of the same probe set are treated as replicated measures of the gene expression score (PR and PG in Table 1). For each spike-in data set, both nominal values of gene expression and observed probe-level intensities are available. The correlation matrix estimated by the four methods summarized in Table 1 are then compared to the nominal correlation matrix in terms of MSE. The smaller the MSE is, the closer the estimated correlation matrix to the nominal one.


View this table:
[in this window]
[in a new window]

 
Table 1. Summary of the four methods to estimate correlation from Affymetrix microarray data

 
Either array or probe set can be treated as multivariate random variable, correspondingly we calculated MSE of both array correlation matrix and probe set correlation matrix. Note that in the second data set, only probe set correlation matrix is computable due to unequal probe levels within array. In Table 2, we observe that MSE of the multivariate correlation estimators (PR and PG) are uniformly smaller than that of Pearson correlation estimator (GR and GG). It strongly indicates that pattern discovery methods based on multivariate correlation estimator are more capable to discover the true patterns of the data.

We then compared our multivariate correlation estimator with Pearson correlation estimator in the context of hierarchical gene clustering. We analyzed a subset of 205 genes whose expression were profiled using four replicates under 20 physiological/genetic conditions (Ideker et al., 2000). The 205 genes were previously classified into four functional groups (Yeung et al., 2003) that we employed it as the external knowledge for comparing performance of two correlation estimators through hierarchical gene clustering approaches. Similar to Medvedovic et al. (2004), we constructed six data sets of two replicates and four data sets of three replicates. The (average) adjusted RAND index (Hubert and Arabie, 1985) was used to measure the consistency between clustering results and the external knowledge.


View this table:
[in this window]
[in a new window]

 
Table 2. Comparing the four correlation estimation method listed in Table 1 using two spike-in data sets

 
As shown in Table 3, overall clustering algorithms based on the multivariate correlation estimator are better consistent to the external biological knowledge (higher adjusted RAND index). Note that we are interested in identifying four large clusters instead of many small clusters, therefore, we expect the divisive hierarchical clustering more suitable for this task (Speed, 2003). In Table 3, the (average) adjusted RAND index of divisive hierarchical clustering (represented by Diana) monotonically increases with the number of replicates increases and is consistent best to the external knowledge with four replicates (adjusted RAND index 0.95).


View this table:
[in this window]
[in a new window]

 
Table 3. Comparing the multivariate correlation estimator with the Pearson bivariate correlation estimator over different numbers of replicates (m) using adjusted RAND index

 
We further compared the sensitivity and specificity of the two test procedures over a wide biologically relevant range. For fair comparisons, asymptotic Fisher Z-transform was used as test procedure for Pearson correlation estimator in parallel with the asymptotic LRT test procedure for the multivariate correlation estimator. Sensitivity is calculated as the ratio between the number of gene pairs called significant by the test procedure and the number of genes pairs that does fall into the same one of four previously defined functional categories. Specificity is calculated as the ratio between the number of gene pairs NOT called significant and the number of gene pairs that does NOT fall into the same one of four previously defined functional categories.

We made comparisons using top correlated gene pairs (1–15%) and the selection is based on the empirical results that only top correlated gene pairs are likely to be functionally relevant (Griffith et al., 2005; Lee et al., 2004). In the context of co-expression network, it corresponds to the well-observed phenomenon that gene co-expression network is typically very sparse (Lee et al., 2004; Zhu et al., 2005a). Our LRT procedure performs slightly better than Fisher's Z-transform test procedure in terms of both sensitivity and specificity (Supplementary Table S2). Note that there is much overlap of significant calls between the two test procedures (Supplementary Table S2).

We claim that the better performance of our correlation estimator and test procedure was achieved under an adverse condition of high within-replicate correlation (75% quantile: 0.89, 50% quantile: 0.73, 25% quantile: 0.34, see Supplementary Fig. S1). As demonstrated in simulations, we expect our estimator to significantly outperform Pearson bivariate estimator with noisy data (low within-replicate correlation) and more replicates.


    4 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 Appendix
 ACKNOWLEDGEMENTS
 REFERENCES
 
In this article, we proposed a new multivariate correlation estimator that exploits all replicated data points in genome-scale data instead of averaging over replicates when using bivariate correlation estimator. Our simulation studies showed that the new estimator possesses low MSE, high accuracy and comparable variance to the Pearson correlation estimator. The analysis of a real-world data set further demonstrated the potential application of our method to pattern discovery problems from noisy genome-wide profiling data. In particular, our approach provide a new way of exploiting Affymetrix probe-level data to pattern discovery problems. Using two Affymetrix spike-in data sets, we have shown that our estimated correlation matrix is closer to the nominal correlation matrix than the competing methods. It seems to indicate that pattern discovery methods based on multivariate correlation estimator are better able to discover the true patterns from probe-level data. Since the multivariate correlation estimator was presented in a closed form, its computational cost is moderate compared to Pearson correlation estimator, therefore, pattern discovery algorithms based on the multivariate correlation estimator is very scalable to large omics data set.


    Appendix
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 Appendix
 ACKNOWLEDGEMENTS
 REFERENCES
 
The likelihood function of a 2m-variate normal family is as the following:


Formula 19

(19)
where Formula with em = (1, ... ,1)T. {sum} is the 2m x 2m covariance matrix with the structure specified in Equation (2). To obtain the MLE, equivalently, we wish to minimize:


Formula 20

(20)
By taking derivatives on µx and µy, we have:


Formula 21

(21)


Formula 22

(22)
Now, in order to find the MLE of {sum}, we do a slight transformation on the log-likelihood:

Formula


Formula 23

(23)
therefore,


Formula 24

(24)
This gives:


Formula 25

(25)
The ideal approach to obtain MLE of {rho} is to get the likelihood explicitly expressed as a function of the unknown parameters {theta} = (µx, µy, {rho}x, {rho}y, {rho})T. In order to do that, we can introduce the transformation used by Srivastava (1984) for interclass correlations in familial data.Let em = (1, ... , 1)T, 0m = (0, ... ,0)T be constant m x 1 vectors. Im denotes the m x m identity matrix and 0m is an m x m zero matrix. The canonical transformation of the original random vector Zj, Formula , is given by:


Formula 26

(26)
where Formula is an m x m matrix such that Cj e m = 0 m and Formula . The likelihood function of the original data is independent of the choice of Cj (Keen and Srivastava, 1991).Then the expectation and covariance matrix for the transformed random vector Tj can be expressed as:


Formula 27

(27)


Formula 28

(28)
where Formula Formula and Formula .Therefore, the {sum}–1 in the log likelihood function can be expressed as:


Formula 29

(29)
with Formula Formula We can also express |{sum}| in terms of the parameters {rho}x, {rho}y and {rho} explicitly:


Formula 30

(30)
The MLE can therefore be derived by minimizing:


Formula 31

(31)
where a = (m–1){rho}x + 1, b = (m 1){rho}y + 1 and h = [(m – 1){rho}x + 1][(m – 1){rho}y + 1] – m2 {rho}2, and SSxj, SSyj denote the sample sum of squares for gene X and Y under the jth condition, respectively:


Formula 32

(32)
Similarly,


Formula 33

(33)
By taking differentiation of this with respect to the unknown parameters {theta} = (µx, µy, {rho}x, {rho}y, {rho})T, we can get the true MLE estimates. However, we cannot get explicit solutions for any of the unknown parameters. The calculations were proved to be intractable in implementation, and we chose to use the estimation specified in Section 2 rather that the MLE solution for {rho}.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 Appendix
 ACKNOWLEDGEMENTS
 REFERENCES
 
We thank Arcady Mushegian, Norman Pavelka and Frank Emmert-Streib for helpful discussion. We would to thank anonymous reviewers for their helps in improving the quality of this manuscript.


    FOOTNOTES
 
Associate Editor: Golan Yona

Received on January 23, 2007; revised on May 31, 2007; accepted on June 17, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 Appendix
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Anderson TW. An introduction to Multivariate Analysis (1958) New York: Wiley,

    Chu SK, Kshirsagar AM. Correlation coefficient between two variables when the data set consists of observations on twins. Parisankhyan Samikkha: Int. J. Stat (1994) 1:1.

    Cui XQ, Churchill GA. Statistical tests for differential expression in cDNA microarray experiments. Genome Biol (2003) 4:201.[CrossRef][Medline]

    Efron B, Tibshirani RJ. An Introduction to the Boostrap (1993) Chapman & Hall, New York, USA.

    Griffith OL, et al. Assessment and integration of publicly available SAGE, cDNA microarray, and oligonucleotide microarray expression data for global coexpression analyses. Genomics (2005) 86:476–488.[CrossRef][Web of Science][Medline]

    Hardt M, et al. Assessing the effects of diurnal variation on the composition of human parotid saliva: quantitative analysis of native peptides using iTRAQ reagents. Anal. Chem (2005) 77:4947–4954.[Medline]

    Hollander A, Wolfe D. Nonparametric Statistical Methods (1999) Wiley-Interscience, Hoboken NJ, USA.

    Hubert L, Arabie P. Comparing partitions. J. Classif (1985) 2:193–218.[CrossRef]

    Ideker T, et al. Testing for differentially-expressed genes by maximum-likelihood analysis of microarray data. J. Comput. Biol (2000) 7:805–817.[CrossRef][Web of Science][Medline]

    Irizarry RA, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics (2003) 4:249–264.[Abstract]

    Keen KJ, Srivastava MS. The asymptotic variance of the interclass correlation coefficient. Biometrika (1991) 78:225–228.[Abstract/Free Full Text]

    Konishi S. Normalizing and variance stabilizing transformations for intraclass correlations. Ann. Inst. Stat. Math (1985) 37:87–94.[CrossRef]

    Konishi S, et al. Inferences on multivariate measures of interclass and intraclass correlations in familial data. J. R. Stat. Soc. B (1991) 53:649–659.

    Lee HK, et al. Coexpression analysis of human genes across many microarray data sets. Genome Res (2004) 14:1085–1094.[Abstract/Free Full Text]

    Li C, Wong WH. Model-based analysis of oligonucleotide arrays: expression score computation and outlier detection. Proc. Natl Acad. Sci. USA (2001) 98:31–36.[Abstract/Free Full Text]

    Liu X, et al. Bayesian context-specific infinite mixture model for clustering of gene expression profiles across diverse microarray datasets. Bioinformatics (2006) 22:1737–1744.[Abstract/Free Full Text]

    Medvedovic M, Sivaganesan S. Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics (2002) 18:1194–1206.[Abstract/Free Full Text]

    Medvedovic M, et al. Bayesian mixtures for clustering replicated microarray data. Bioinformatics (2004) 20:1222–1232.[Abstract/Free Full Text]

    Nguyen DV, Rocke DM. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics (2002a) 18:39–50.[Abstract/Free Full Text]

    Nguyen DV, Rocke DM. Multi-class cancer classification via partial least squares using gene expression profiles. Bioinformatics (2002b) 18:1216–1226.[Abstract/Free Full Text]

    Pavelka N, et al. A power law global error model for the identification of differentially expressed genes in microarray data. BMC Bioinformatics (2004) 5:203.[CrossRef][Medline]

    Ross PL, et al. Multiplexed protein quantitation in Saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Mol. Cell Proteomics (2004) 3:1154–1169.[Abstract/Free Full Text]

    Sartor MA, et al. Intensity-based hierarchical Bayes method improves testing for differentially expressed genes in microarray experiments. BMC Bioinformatics (2006) 7:538.[CrossRef][Medline]

    Speed T, ed. Statistical analysis of gene expression microarray data. (2003) CRC Press. Chapman & Hall Boca Raton FL USA.

    Srivastava MA. Estimation of interclass correlations in familial data. Biometrika (1984) 71:177–185.[Abstract/Free Full Text]

    Wu Z, Irizarry RA. Stochastic models inspired by hybridization theory for short oligonucleotide arrays. J. Comput. Biol (2005) 12:882–893.[CrossRef][Web of Science][Medline]

    Yeung KY, Bumgarner R. Multi-class classification of microarray data with repeated measurements: application to cancer. Genome Biol (2005) 4:R83.

    Yeung KY, et al. Model-based clustering and data transformations for gene expression data. Bioinformatics (2001) 17:977–987.[Abstract/Free Full Text]

    Yeung KY, Medvedovic M, Bumgarner R. Clustering gene expression data with repeated measurements. Genome Biol (2003) 4:R34.[CrossRef][Medline]

    Zhou L, Rocke D. An expression index for Affymetrix GeneChips based on the generalized algorithm. Bioinformatics (2005) 21:3983–3989.[Abstract/Free Full Text]

    Zhu D, et al. High throughput screening of co-expressed gene pairs with controlled False Discovery Rate (FDR) and Minimum Acceptable Strength (MAS). J. Comput. Biol (2005a) 12:1027–1043.

    Zhu D, et al. Network constrined clustering for gene microarray data. Bioinformatics (2005b) 21:4014–4021.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary data
Right arrow All Versions of this Article:
23/17/2298    most recent
btm328v2
btm328v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Zhu, D.
Right arrow Articles by Li, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zhu, D.
Right arrow Articles by Li, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?