Skip Navigation


Bioinformatics Advance Access originally published online on November 25, 2004
Bioinformatics 2005 21(7):1112-1120; doi:10.1093/bioinformatics/bti124
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/7/1112    most recent
bti124v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (14)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Reverter, A.
Right arrow Articles by Dalrymple, B. P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Reverter, A.
Right arrow Articles by Dalrymple, B. P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Validation of alternative methods of data normalization in gene co-expression studies

Antonio Reverter *, Wes Barris , Sean McWilliam , Keren A. Byrne , Yong H. Wang , Siok H. Tan , Nick Hudson and Brian P. Dalrymple

Bioinformatics Group, CSIRO Livestock Industries, Queensland Bioscience Precinct St Lucia, QLD 4067, Australia

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 MICROARRAY DATA SETS
 3 NORMALIZATION METHODS:...
 4 NORMALIZATION METHODS: RESULTS...
 5 CONCLUSIONS
 REFERENCES
 

Motivation: Clusters of genes encoding proteins with related functions, or in the same regulatory network, often exhibit expression patterns that are correlated over a large number of conditions. Protein associations and gene regulatory networks can be modelled from expression data. We address the question of which of several normalization methods is optimal prior to computing the correlation of the expression profiles between every pair of genes.

Results: We use gene expression data from five experiments with a total of 78 hybridizations and 23 diverse conditions. Nine methods of data normalization are explored based on all possible combinations of normalization techniques according to between and within gene and experiment variation. We compare the resulting empirical distribution of gene x gene correlations with the expectations and apply cross-validation to test the performance of each method in predicting accurate functional annotation. We conclude that normalization methods based on mixed-model equations are optimal.

Contact: tony.reverter-gomez{at}csiro.au


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 MICROARRAY DATA SETS
 3 NORMALIZATION METHODS:...
 4 NORMALIZATION METHODS: RESULTS...
 5 CONCLUSIONS
 REFERENCES
 
Microarray technology provides us with the opportunity to make inferences on protein associations and gene regulatory networks on a genome-wide scale. Genes of similar function yield similar expression patterns across a diverse range of conditions (Eisen et al., 1998; Hughes et al., 2000; Kim et al., 2003; Segal et al., 2003). Based on patterns of co-expression, often from merging several data sources, transcriptional regulatory networks have been proposed mostly within (Brown et al., 2000; Guelzim et al., 2002; Horak et al., 2002; Lee et al., 2002; Gardner et al., 2003; Yu et al., 2003; Nachman, et al., 2004) but also across species (Stuart et al., 2003; Winter et al., 2004).

Given an expression vector for each gene across various experimental conditions, the simplest kernel that can be used to measure the similarity between pairs of genes is the Pearson correlation coefficient. This correlation is equivalent to the dot product when the vectors have been normalized [to zero mean and standard deviation (SD) of 1], so that each has a Euclidean length 1 (achieved by dividing each element in the vector by the square root of its norm). It gives a measure of the strength of the linear relationship between two vectors, it pays attention only to differences in pattern ignoring differences in level and amplitude, and (most importantly) assumes that each pair of values is i.i.d., an independent realization from the same (not necessarily normal) distribution. Violations of this assumption can generate spurious correlations that in turn will increase the false discovery rate (FDR) when making inference on gene regulatory networks.

Microarray data are notorious for being noisy due to systematic biases. Accounting for systematic effects during normalization, and methods to adjust for such biases is a subject of great importance (Smyth et al., 2003; and references therein), for which alternative approaches are continuously being proposed (Baird et al., 2004; Benito et al., 2004; Fan et al., 2004). Even when these effects have been accounted, it might be of interest to standardize each vector to unit variance (dividing by its SD) to avoid uninterpretable correlations resulting from either one or both vectors having a negligible variation. However, while dividing by the SD is necessary for many techniques, it amounts to giving equal prior importance to all genes (Frank and Friedman, 1993).

Most research involving co-expression of genes based on microarray data ignore alternate methods of data normalization because they are based on the quite reasonable, yet non-testable assumption, that gene pairs exhibiting co-expression in multiple species and/or tissues and across a large number of arrays are more likely to be functionally relevant (true positives). Yeung et al. (2004) concluded that the proportion of co-regulated genes is directly related to the number of hybridization experiments and not so much to the diversity of experiments. This could be attributed to the fact that noise can be better described with more technical replicates (i.e. power increases with more data), while diverse microarray experiments do not help in characterizing such noise. To standardize data, the authors used a z-score as their primary evaluation criterion. Such standardization is achieved by subtracting the average expression value of each gene across all experiments from the expression value of each gene and then divided by the SD of its expression level across all experiments. Assymptotically, and by means of the Central Limit Theorem (for details see Mood et al., 1974), such standardization is expected to yield a ‘normalized’ set of data that will follow a standard normal distribution (i.e. with zero mean and unit variance). For a bivariate normal distribution, the distribution of the correlation coefficients and density function under the null hypothesis is given by (Bevington and Robinson, 2003):

where {Gamma} (·) is the gamma function and {upsilon} = N – 2 is the degrees of freedom.

We compare nine normalization methods to pre-adjust the data prior to computing all pair-wise clone x clone, clone x gene and gene x gene correlations. We tested the performance of each method by: (1) comparing the resulting empirical distributions of correlation coefficients with theoretical expectations in P(r); and (2) using validation on those clones for which a reliable functional annotation to a gene existed. Finally, once the normalization method of choice is identified, we show how the visualization of the resulting correlation matrices after sorting genes according to simple criteria can reveal structures that are consistent with functional biology.


    2 MICROARRAY DATA SETS
 TOP
 Abstract
 1 INTRODUCTION
 2 MICROARRAY DATA SETS
 3 NORMALIZATION METHODS:...
 4 NORMALIZATION METHODS: RESULTS...
 5 CONCLUSIONS
 REFERENCES
 
We used five microarray studies in genetic improvement of beef cattle from our laboratory spanning a total of 78 microarray hybridizations, with mRNA from muscle and adipose tissues from 23 experimental conditions. These experiments were performed using the same bovine muscle and fat cDNA slide with 9600 elements (from now on refer to as clones) spotted twice (Lehnert et al., 2005). Spots were arranged in a layout of 48 printing blocks each of 20 x 20 spots. The 48 blocks were arranged in 12 rows of four blocks each. Importantly, duplicate clones on the array were spotted adjacent within a single block.

Several studies involving statistical and biological aspects for some of these experiments have been reported (Byrne et al., 2005; Reverter et al., 2003, 2004, 2005). For the present study, we did not perform any data pre-processing with any filtering criteria based on differential expression or absolute level of expression. We edited out readings with foreground signal ≤ background signal, and clones not observed in the 23 experimental conditions. These criteria resulted in a total of 2 095 999 fluorescent signals from 8074 clones of which 1530 contained accurate (P < 0.01/8074 = 1.24E – 6) functional BLAST annotation for 624 genes determined by searching the NCBI human reference sequence (RefSeq) collection of mRNA sequences. The method of Reverter et al. (2005) to infer transcriptome coverage estimates that, assuming a genome with 30 000 genes, a library with 9600 clones would contain 2533 unique genes while a library with 1530 (fully informative) clones would contain 716 genes. The redundancy of the cDNA libraries used to develop the microarray slide is reflected by the fact that a total of 195 genes were represented by more than one clone. These genes will provide the basis for the validation study to test the performance of each normalization method. Table 1 provides details of the five experiments including number of hybridizations, experimental conditions and summary statistics on signals.


View this table:
[in this window]
[in a new window]
 
Table 1 Description of the five microarray experiments used in the present study: number of hybridisations (Hybs.), conditions (Cond.) and summary statistics for fluorescent signals

 

    3 NORMALIZATION METHODS: DEFINITION
 TOP
 Abstract
 1 INTRODUCTION
 2 MICROARRAY DATA SETS
 3 NORMALIZATION METHODS:...
 4 NORMALIZATION METHODS: RESULTS...
 5 CONCLUSIONS
 REFERENCES
 
We explore nine normalization methods based on raw means and on solutions to uni- and multi-variate mixed-model equations and after ignoring or accounting for between and within gene and experiment variation. These methods are briefly described in Table 2 and in more detail below.


View this table:
[in this window]
[in a new window]
 
Table 2 Normalization methods used to obtain the vector of expression for each of the 8074 clones and across the 23 experimental conditions

 
3.1 Methods based on raw means
Normalization methods (1) RMNA, (2) RMCE and (3) RMEC (acronyms for Raw Means Non-Adjusted, Raw Means adjusted between Clones within Experiment and Raw Means adjusted between Experiments within Clones, respectively) were computed for each clone in c (c = 1,..., 8074) as follows:
(1) ;
(2) ; and
(3) ,
where is the vector containing the average expression of the c-th clone in each of the 23 experimental treatments (or conditions) explored across the five microarray experiments (Table 1); µc/E is the vector containing the average expression of the c-th clone in the E-th experiment (thus, for each clone, there were 5 µc/E, one for each experiment, and each value in was adjusted by the µc/E corresponding to its experiment of origin); {sigma}E is the SD of all the expressions in the E-th experiment; µE/c is the vector containing the average expression of the E-th experiment; and {sigma}c is the SD of all the expressions in the c-th clone. It can be shown that, due to unit variance of RMECc and assuming independence between µE/i and for two clones in i and j where i != j, the correlation between RMNAi and RMNAj is equivalent to the correlation between RMECi and RMECj. Thus, and within rounding, results from RMNA are expected to be identical to those from RMEC.

3.2 Methods based on univariate mixed-models
Normalization methods (4) MM1NA, (5) MM1CE and (6) MM1EC (MM1 for univariate mixed-model and NA, CE and EC as defined earlier) were obtained after implementing the following univariate mixed-model:

(1)
where y is the vector of all the intensity signals (N = 2 095 999), background corrected and base-2 log transformed; X is the incidence matrix relating signals in y with systematic fixed effects in ß including array slide, printing block and fluorescent dye channel; Z1 is the incidence matrix relating y with random effects in c corresponding to the clones (N = 8074) printed on the array; Z2 is the incidence matrix relating y with random effects in a corresponding to the three-way interaction of clone by array and printing block; Z3 is the incidence matrix relating y with random effects in d corresponding to the interaction of clone by the two fluorescent dye channels; Z4 is the incidence matrix relating y with random effects in t corresponding to the interaction of clones by the 23 experimental treatment conditions; e is the random error associated with signals in y.

We make standard stochastic assumptions about model (1). Accordingly, random effects are assumed to follow a normal distribution with zero mean and variance , , , and for c, a, d, t and e, respectively. We estimated variance components by restricted maximum likelihood (REML) using the VCE software (available at http://www.w3.tzv.fal.de/~eg/vce4/vce4.html).

The fitting of model (1) provided solutions (known as BLUP for Best Linear Unbiased Predictions) that were the basis for the normalization methods in:

(4) ;
(5) ; and
(6) ,
where is the vector containing the BLUP of t for the clone by treatment interaction and remaining elements are defined as before except that means and variances refer to as opposed to raw means in .

3.3 Methods based on multivariate mixed-models
Normalization methods (7) MM5NA, (8) MM5CE and (9) MM5EC (MM5 for penta-variate mixed-model and NA, CE, and EC as defined earlier) were obtained after implementing the following penta-variate (one for each microarray experiment) mixed-model:

(2)
where elements are defined as in the univariate model in (1) except that they are treated separately for each experiment in E. Accordingly, variance components for random effects in model (2) are given by: Var[cE] = {Sigma} {otimes} I, where {Sigma} is a 5 x 5 (one for each experiment) between clones co-variance matrix, {otimes} indicates the Kronecker product and I is an identity matrix of dimension 8074 (the number of clones); , , and , where Diag{·} indicates a diagonal matrix with components defined in the argument and each multiplied by an identity matrix of appropriate dimension. Again, the 35 covariance components from model (2) were estimated via REML.

Similarly, the fitting of model (2) provided BLUP that were the basis for the normalization methods in:

(7) ;
(8) and
(9)
where is the vector containing the BLUP of tE for the clone by treatment interaction and remaining elements are defined as before except that means and variances refer to as opposed to either raw means in or BLUP solutions in .

Finally, and for all nine normalization methods, vectors with (normalized) expressions for each gene in g(g = 1,..., 624) as well as for the 48 printing blocks were obtained by averaging element by element (i.e. each of the 23 conditions) those vectors corresponding to clones that were annotated to a single gene (for the gene expression vectors) or that were printed in a given printing block (for the printing block expression vectors).


    4 NORMALIZATION METHODS: RESULTS AND COMPARISONS
 TOP
 Abstract
 1 INTRODUCTION
 2 MICROARRAY DATA SETS
 3 NORMALIZATION METHODS:...
 4 NORMALIZATION METHODS: RESULTS...
 5 CONCLUSIONS
 REFERENCES
 
4.1 Distributions
Figure 1 presents the empirical density distribution resulting from all pair-wise gene correlations computed according to each of the nine normalizations methods. The theoretical density (given by P(r) in the Introduction) is also drawn in Figure 1 to illustrate the amount by which each empirical distribution deviates from the expectation.



View larger version (39K):
[in this window]
[in a new window]
 
Fig. 1 Theoretical (line) and empirical (boxes) density distribution for the 194 376 correlation coefficients among the 624 genes and for each method of adjustment

 
Methods RMNA and RMEC based on raw means yielded similar distributions. These were vastly negatively skewed with small density on negative correlations while the bulk of the correlations were in the moderate to extreme positive. This skewness was attributed in part to the redundancy of the cDNA microarray platform, but mostly to the inherent correlation existing in microarray data. Method RMCE, which attempted to correct for the between gene and within experiment variation removed such skewness, stabilized the variance but generated a uniform-like distribution that failed to approach the expected bell-shaped density. Hence, most correlations computed using the simplest models (i.e. raw-means methods) departed from the expectation under the null hypothesis. In contrast, normalization methods based on mixed-model equations appeared to properly account for the various sources of variation and yielded a set of correlations whose distributions were closer to the expectation.

In order to test the goodness-of-fit, we applied the Kolmogorov–Smirnov (K–S) test (for details see Law and Kelton, 1991 pp. 387–391). The K–S test statistic is simply the largest vertical distance between the theoretical and the empirical distribution functions. The critical values for the K–S test statistic at (1 – {alpha}) = 0.95 and 0.99 are 1.358 and 1.628, respectively. The computed values for the K–S test statistic for normalization methods 1–9 were 10.176, 4.185, 10.097, 1.434, 1.319, 1.421, 1.617, 1.189 and 1.636, respectively. Thus, at 5% significance level, MM1CE and MM5CE yielded correlations that are distributed according to expectations. At 1% significance level, this statement is only true for MM5CE.

4.2 Pair-wise comparisons
Table 3 compares the nine normalization methods in a pair-wise fashion by exploring the correlations and maximum discrepancies. The correlation of correlations resulting from methods based on raw means was small positive (~0.20) between RMNA and RMCE, and between methods RMCE and RMEC, and nearly perfect (0.98) between RMNA and RMEC. Also, these raw mean-based methods produced results that were poorly correlated with those from methods based on mixed-models. Within the mixed-models normalization methods, results were highly correlated and above 0.90 in all cases. Between mixed-models methods, results were moderately correlated with the largest being observed at 0.69 between MM1CE and MM5CE. The maximum discrepancy between raw mean-based methods and mixed-model methods approached the upper bound of 2.0 in all comparisons. Smaller discrepancies, ~0.6 within and ~1.2 across, were observed for results from mixed-model methods.


View this table:
[in this window]
[in a new window]
 
Table 3 Comparison of the nine normalization methods: average across all pair-wise gene correlations (bold, italics and on the diagonal); correlation of correlations (above diagonal); and maximum absolute discrepancy between methods (below diagonal)

 
Due to the strong parallelisms observed between methods MM1CE and MM5CE, the origin of the top 20 discrepancies was further scrutinized. For all the genes in the disputed gene pairs, the SAGE Digital Northern libraries were downloaded from the Cancer Genome Anatomy Project (CGAP) site (http://cgap.nci.nih.gov/Genes/GeneFinder). These are across-tissue libraries that include the number of transcripts (in transcripts per 200 000) for the genes of interest. These values were correlated and compared with correlations obtained from methods MM1CE and MM5CE. Comparisons failed to unequivocally identify the most concordant method. However, ribosomal protein RPN1 was involved in four of the 20 biggest discrepancies and in all cases the correlation obtained from method 8 was similar to the one from the SAGE libraries. The same conclusion was achieved with genes OPRS1 and IFRD1, both involved in two discrepancies. Results from method MM5CE were also more concordant to SAGE results when the latter originated from a large number of libraries (and thus, tissues). MM5CE was favoured in comparisons involving a total of 666 libraries, while MM1CE was favoured in comparisons involving 527 libraries.

4.3 Validation studies
Genes represented in the microarray by more than one clone provided the basis to perform cross-validation studies. The assumption is made that, on average, clones of the same gene have a higher correlation among themselves than clones of different genes. Given in Table 4 are the average correlations among all pair-wise clones for genes with more than a specified number of clones and for each normalization method. Methods based of raw means show no particular trend in the average within gene correlations. These remained constant and equal to the average correlation across clones with their corresponding genes (Table 3 diagonals) for methods RMNA and RMEC, and constant and slightly higher than the average across all genes for RMCE. In contrast, normalization methods based on mixed-model equations, while having an average correlation across all genes close to zero (Table 3 diagonals), showed a monotonic increasing pattern for the average correlation among clones with genes highly represented in the microarray.


View this table:
[in this window]
[in a new window]
 
Table 4 Average correlation among all pair-wise clones for genes with more than x clones and by each normalization method

 
For the validation experiments, a random third of the 1530 clones with accurate functional annotation were treated as unknown and the performance of each normalization method in identifying the proper gene annotation tested. This procedure was repeated 100 times, each time using a different (random) third of the clones as test clones. In each run, gene expression vectors were obtained by averaging element by element (i.e. each of the 23 conditions) those vectors corresponding to clones that were annotated to the same gene. Finally, the correlation between each clone and the genes was computed. At various thresholds for the clone to gene correlation coefficient, each clone in the test sample can be categorized in one of four ways: true positives (TP; if the correlation is greater than the threshold and the clone corresponds to the gene), true negatives (TN; if the correlation is less than or equal to the threshold and the clone does not correspond to the gene), false positives (FP; if the correlation is greater than the threshold but the clone does not correspond to the gene), and false negatives (FN; if the correlation is less than or equal to the threshold but the clone corresponds to the gene). To judge overall performance, for each correlation threshold, three parameters were recorded: FDR from the proportion of clones wrongly assigned to a gene; sensitivity (SENS) from the proportion of clones correctly assigned to a gene; and specificity (SPEC) from the proportion of clones correctly not assigned to a gene.

Table 5 presents the results of the validation study. In all cases and as expected, FDR decreased with increasing threshold of correlations. However, at correlations as high as 0.85 and among the raw mean-based methods, only RMCE achieved an FDR below the nominal 5%. In contrast, and at this 0.85 threshold for the correlation, mixed-model-based methods achieved an FDR <0.5%. A higher threshold for the correlation coefficient before a decision is made amounts to a smaller statistical significance. This explains the reduction in SENS with increasing thresholds that was observed for all normalization methods. For a given correlation threshold, similar SENS was observed for raw mean-based and pentavariate mixed-model-based methods. These values were larger than those observed for the univariate mixed-model-based methods. However, at the same level of FDR (say 7–8% achieved at correlations >0.50 and >0.85 for mixed-model-based and raw mean-based methods, respectively), the SENS of the mixed-model-based methods was several magnitudes larger than the SENS of the raw mean-based methods (65–70% against 20–40%). The pattern for SPEC was the complement of that for FDR.


View this table:
[in this window]
[in a new window]
 
Table 5 Percentage of FDR and SENS for correlation between clone and gene greater than x and by each normalization method

 
4.4 Visualization studies
In order to further compare normalization methods, the resulting correlation matrices were visualized.

Figure 2 illustrates the gene x gene correlation matrices (624 x 624) that resulted from each normalization method and where genes are sorted by their NCBI RefSeq accession number (i.e. NM_######). As a result of their strong skewness, red dominates in RMNA, RMCE and RMEC, making the task of identifying specific patterns nearly impossible. The pictures from methods based on mixed-model equations are sparser and, mostly for the pentavariate methods, reveal a block of genes towards the upper left corner with a very particular pattern. These genes were identified as ribosomal proteins (RPs) that are known to be highly correlated to each other (Stuart et al., 2003) and possess (at least those from our data set) sequential RefSeq accession numbers.



View larger version (127K):
[in this window]
[in a new window]
 
Fig. 2 Gene x gene correlation matrices generated by each of the normalization methods. Genes are sorted by RefSeq sequence identification.

 
The relative positioning of cDNA probes on microarray slides has been shown to introduce significant correlations between pairs of genes (Kluger et al., 2003). Correlation matrices corresponding to printing block x printing block (48 x 48) are given in Figure 3. Unadjusted means (methods RMNA and RMEC) fail to account for the spatial features of the slide layout and present extreme correlations among all blocks. Instead, the mixed-model-based methods (that included the random effect of clone x array slide x printing block interaction) generated pictures that clearly demonstrate improved accounting for spatial features. Mixed-model-based methods that normalized for the between clones and within experiment variance (MM1CE and MM5CE) produced pictures that were identical to those from unadjusted methods (MM1NA and MM5NA). This phenomenon was due to the fact that no two samples from different experiments were hybridized on the same microarray. In contrast, and because no clones were printed on two different blocks, µE/i and for two blocks i and j are not independent but affected by the spatial location of both blocks on the array. For this reason, MM1NA and MM1EC (and hence, MM5NA and MM5EC) yielded different pictures for the printing block x printing block correlations.



View larger version (117K):
[in this window]
[in a new window]
 
Fig. 3 Printing block x printing block correlation matrices generated by each of the normalization methods. The microarray platform contained 48 blocks (with 200 spots each) arranged in 12 rows by 4 columns.

 
Finally, Figure 4 presents the gene x gene correlation matrices that resulted from (the possibly optimal) MM5CE after sorting the genes according to four different criteria as follows: average correlation with the rest of the genes, number of clones, gene symbolic name (using the standard human gene symbols), and chromosome location based on human gene order in the genome sequence. As expected, sorting genes by their symbolic name revealed a clear cluster of highly positively correlated RPs. In addition, smaller clusters of positively correlated genes corresponding to creatinine, collagens and cytochrome oxidases (negatively correlated to RP), follistatins (positively correlated to RPs and negatively correlated to collagens), and myosins (negatively correlated to RPs and positively correlated to collagens). Sorting genes by their chromosome did not reveal any particular pattern. Due to the sparse coverage of the genome in the present data set, none of the different genes are close enough to provide a reasonable expectation of correlated expression due to close physical association.



View larger version (134K):
[in this window]
[in a new window]
 
Fig. 4 Gene x gene correlation matrices using normalization method MM5CE after sorting the genes according to four different criteria: (A) average correlation; (B) number of clones; (C) symbolic name and (D) chromosome location.

 
One rather unexpected finding was the observation that the expression of the collagen and myosin structural subunits was negatively correlated with the expression of the cluster of ribosomal proteins. Given the range of conditions examined, including animals on a starvation diet and very well fed animals, we expected to find a positive correlation between ribosomal proteins and muscle structural proteins. However, it is possible that this negative correlation is a spurious result due to the standardization of the amount of mRNA in the hybridizations, because the ribosomal protein transcripts are major components of muscle mRNA. An alternative explanation is that this observation is due to the total amount of RNA per cell not being constant. Potentially, starved animals may produce substantially less RNA per cell than well-fed animals. We intend to explore the reasons for this rather unexpected result in future analysis of the data.

The clusters of proteins that include the myosins, myosin-associated proteins and the collagens were investigated in more detail (Figure 5). A major cluster of positively correlated proteins includes the myosin heavy chain proteins MYH1 (skeletal muscle) and MYH7 (cardiac, slow), the myosin essential light chain proteins MYL1 (skeletal, fast) and MYL3 (skeletal/ventricular, slow), and the myosin associated proteins MYBPC2 (myosin binding protein C, fast type) and MYOM2 (myomesin). The expression of most of these proteins is also highly correlated in the CGAP SAGE data. The expression of the regulatory light chain protein MYL2 (cardiac, slow) and a muscle sarcomeric protein, MYOZ2 (Frey et al., 2000), are highly positively correlated, but their expression is not correlated with the expression of the large cluster containing the other myosin subunits. A similar pattern of correlations is also present in the CGAP SAGE data. The expression of three of the four collagens is also correlated with the myosin cluster, albeit more weakly than within the myosin cluster. The co-expression of collagens I and III has been described (Miskulin et al., 1986) whilst collagen V is associated with different systems and in our analysis its expression is not significantly correlated with the expression of the other collagens. Again this matches the correlations observed with the CGAP SAGE data. Interestingly, the expression of the two muscle transcription factors, MYOG (myogenin) and MYF6 (myogenic factor 6) is not correlated, they are also not correlated with the expression of any of the other proteins in this section of the analysis.



View larger version (42K):
[in this window]
[in a new window]
 
Fig. 5 Correlation matrix for the clusters of proteins that include the myosins and the collagens.

 

    5 CONCLUSIONS
 TOP
 Abstract
 1 INTRODUCTION
 2 MICROARRAY DATA SETS
 3 NORMALIZATION METHODS:...
 4 NORMALIZATION METHODS: RESULTS...
 5 CONCLUSIONS
 REFERENCES
 
Microarray experiments are notorious for producing data sets that are affected by a large number of systematic (non-biological) effects. Nevertheless, these data are an invaluable source for inferring protein associations and gene regulatory networks. Mixed-model equations are a natural framework to account for systematic sources of variation because they can accommodate correlation structures in a rather general manner. We have shown that raw mean-based methods of normalization are sub-optimal for inferring gene co-expressions as measured by the correlation between pairs of genes in various conditions. The choice of univariate or multivariate mixed-models will be greatly influenced by the accuracy of parameter estimates and by computing demands. More complex models would not provide better predictions unless their parameters can be estimated with enough accuracy. With respect to computing demands, in the present study, the univariate model with five variance components contained 694 098 equations and took 7 h and 32 min of CPU time on a Dell PC 2.2 GHz running Red Hat Linux 9. In contrast, the penta-variate model with 35 (co)variance components contained 925 167 equations and took 21 h and 3 min of CPU time on the same computer.


    Acknowledgments
 
The authors gratefully acknowledge the Cooperative Research Centre for Cattle and Beef Quality, a joint venture of The University of New England, NSW Department of Agriculture, Queensland Department of Primary Industries and CSIRO, for making available the microarray datasets.

Received on October 17, 2004; revised on November 2, 2004; accepted on November 2, 2004

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 MICROARRAY DATA SETS
 3 NORMALIZATION METHODS:...
 4 NORMALIZATION METHODS: RESULTS...
 5 CONCLUSIONS
 REFERENCES
 

    Baird, D., Johnston, P., Wilson, T. (2004) Normalization of microarray data using a spatial mixed model analysis which includes splines. Bioinformatics, 20, 3196–3205[Abstract/Free Full Text].

    Benito, M., Parker, J., Du, Q., Wu, J., Xiang, D., Perou, C.M., Marron, J.S. (2004) Adjustment of systematic microarray data biases. Bioinformatics, 20, 105–114[Abstract/Free Full Text].

    Bevington, P.R. and Robinson, K.D. Data Reduction and Error Analysis for the physical Sciences, (2003) 2nd edn , New York McGraw-Hill.

    Brown, M.P.S., Grundy, W.N., Lin, D., Cristianini, N., Sugnet, C.W., Furey, T.S., Ares, M., Haussler, D. (2000) Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl Acad. Sci. USA, 97, , pp. 262–267[Abstract/Free Full Text].

    Byrne, K.A., Wang, Y.H., Lehnert, S.A., Harper, G.S., McWilliam, S.M., Bruce, H.L., Reverter, A. (2005) Gene expression profiling of brief muscle tissue in Brahman steers during nutritional restriction. J. Anim. Sci., 83, 1–12[Abstract/Free Full Text].

    Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA, 95, 14863–14868[Abstract/Free Full Text].

    Fan, J., Tam, P., Vande Woude, G., Ren, Y. (2004) Normalization and analysis of cDNA microarrays using within-array replications applied to neuroblastoma cell response to a cytokine. Proc. Natl Acad. Sci. USA, 101, 1153–1140[Abstract/Free Full Text].

    Frank, I.E. and Friedman, J.H. (1993) A statistical view of some chemometrics regression tools. Technometrics, 35, 109–135[CrossRef].

    Frey, N., Richardson, J.A., Olson, E.N. (2000) Calsarcins, a novel family of sarcomeric calcineurin-binging proteins. Proc. Natl Acad. Sci. USA, 97, 14632–14637[Abstract/Free Full Text].

    Gardner, T.A., di Bernardo, E., Lorenz, D., Collins, J.J. (2003) Inferring genetic networks and identifying compound mode of action via expression profiling. Science, 301, 102–105[Abstract/Free Full Text].

    Guelzim, N., Bottani, S., Bourgine, P., Kepes, F. (2002) Topological and causal structure of the yeast transcriptional regulatory network. Nat. Genet., 31, 60–63[CrossRef][ISI][Medline].

    Horak, C.E., Luscombe, N.M., Quian, J., Bertone, P., Piccirrillo, S., Gerstein, M., Snyder, M. (2002) Complex transcriptional circuitry at the G1/S transition in Saccharomyces cerevisiae. Genes Dev., 16, 3017–3033[Abstract/Free Full Text].

    Hughes, T.R., Marton, M.J., Jones, A.R., Roberts, C.J., Stoughton, R., Armour, C.D., Bennett, H.A., Coffey, E., Dai, H., He, Y.D., Kidd, M.J., King, A.M., Meyer, M.R., Slade, D., Lum, P.Y., Stepaniants, S.B., Shoemaker, D.D., Gachotte, D., Chakraburtty, K., Simon, J., Bard, M., Friend, S.H. (2000) Functional discovery via a compendium of expression profiles. Cell, 102, 109–126[CrossRef][ISI][Medline].

    Kim, S.Y., Imoto, S., Miyano, S. (2003) Inferring gene networks from time series microarray data using dynamics Bayesian networks. Brief. Bioinform., 4, 228–235[Abstract/Free Full Text].

    Kluger, Y., Yu, H., Qian, J., Gerstein, M. (2003) Relationship between gene co-expression and probe localization on microarray slides. BMC Genomics, 4, 49[CrossRef][Medline].

    Law, A.M. and Kelton, W.D. Simulation Modeling and Analysis, (1991) , New York McGraw-Hill.

    Lee, T.I., Rinaldi, N.J., Robert, F., Odom, D.T., Bar-Josheph, A., Gerber, G.K., Hannett, N.M., Harbison, C.T., Thompson, C.M., Simon, I., et al. (2002) Transcriptional regulatory networks in Saccharomyces cerevisiae. Science, 298, , pp. 799–804[Abstract/Free Full Text].

    Lehnert, S.A., Byrne, K.A., Wang, Y.H. (2005) Development and application of a bovine cDNA microarray for expression profiling of muscle and adipose tissue. Aust. J. Exp. Agric., 44, 1127–1133[CrossRef].

    Miskulin, M., Dalgleish, R., Kluve-Beckerman, B., Rennard, S.I., Tolstoshev, P., Brantly, M., Crystal, R.G. (1986) Hutam type III collagen gene expression is coordinately modulated with the type I collagen genes during fibroblast growth. Biochemistry, 25, 1408–1413[CrossRef][Medline].

    Mood, A.M., Graybill, F.A., Boes, D.C. Introduction to the Theory of Statistics, (1974) , New York McGraw-Hill.

    Nachman, I., Regev, A., Friedman, N. (2004) Inferring quantitative models of regulatory networks from expression data. Bioinformatics, 20, , pp. i248–i256[Abstract].

    Reverter, A., Byrne, K.A., Bruce, H.L., Wang, Y.H., Dalrymple, B.P., Lehnert, S.A. (2003) A mixture model-based cluster analysis of cDNA microarray gene expression data on Brahman and Brahman composite steers fed high, medium and low quality diets. J. Anim. Sci., 81, 1900–1910[Abstract/Free Full Text].

    Reverter, A., Wang, Y.H., Byrne, K.A., Tan, S.K., Harper, G.S., Lehnert, S.A. (2004) Joint analysis of multiple cDNA microarray studies via multivariate mixed-models applied to genetic improvement of beef cattle. J. Anim. Sci., 82, 3430–3439[Abstract/Free Full Text].

    Reverter, A., McWilliam, S.M., Barris, W., Dalrymple, B.P. (2005) A rapid method for computationally inferring transcriptome coverage and microarray sensitivity. Bioinformatics, 31, 80–89 First published August 12, 2004, doi:10.1093/bioinformatics/bth472.

    Segal, E., Shapira, M., Regev, A., Pe'er, D., Botstein, D., Koller, D., Friedman, N. (2003) Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat. Genet., 34, 166–176[CrossRef][ISI][Medline].

    Smyth, G.K., Yang, Y.H., Speed, T. (2003) Statistical issues in cDNA microarray data analysis. Methods Mol. Biol., 224, 116–136.

    Stuart, J.M., Seal, E., Koller, D., Kim, S.K. (2003) A gene-coexpression network for global discovery of conserved genetic modules. Science, 302, 249–255[Abstract/Free Full Text].

    Winter, E.E., Goodstadt, L., Ponting, C.P. (2004) Elevated rates of protein secretion, evolution, and disease among tissue-specific genes. Genome Res., 14, 54–61[Abstract/Free Full Text].

    Yeung, K.Y., Medvedovic, M., Bumgarner, R.E. (2004) From co-expression to co-regulation: how many microarray experiments do we need?. Genome Biol., 5, R48.1–R48.11.

    Yu, H., Luscombe, N.M., Qian, J., Gerstein, M. (2003) Genomic analysis of gene expression relationships in transcriptional regulatory networks. TRENDS Genet., 19, 422–427[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Physiol. GenomicsHome page
E. de la Vega, M. R. Hall, K. J. Wilson, A. Reverter, R. G. Woods, and B. M. Degnan
Stress-induced gene expression profiling in the black tiger shrimp Penaeus monodon
Physiol Genomics, September 11, 2007; 31(1): 126 - 138.
[Abstract] [Full Text] [PDF]


Home page
Physiol. GenomicsHome page
A. Reverter, N. J. Hudson, Y. Wang, S.-H. Tan, W. Barris, K. A. Byrne, S. M. McWilliam, C. D. K. Bottema, A. Kister, P. L. Greenwood, et al.
A gene coexpression network for bovine skeletal muscle inferred from microarray data
Physiol Genomics, December 13, 2006; 28(1): 76 - 83.
[Abstract] [Full Text] [PDF]


Home page
J ANIM SCIHome page
S. A. Lehnert, K. A. Byrne, A. Reverter, G. S. Nattrass, P. L. Greenwood, Y. H. Wang, N. J. Hudson, and G. S. Harper
Gene expression profiling of bovine skeletal muscle in response to and during recovery from chronic and severe undernutrition
J Anim Sci, December 1, 2006; 84(12): 3239 - 3250.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. Reverter, A. Ingham, S. A. Lehnert, S.-H. Tan, Y. Wang, A. Ratnakumar, and B. P. Dalrymple
Simultaneous identification of differential gene expression and connectivity in inflammation, adipogenesis and cancer
Bioinformatics, October 1, 2006; 22(19): 2396 - 2404.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/7/1112    most recent
bti124v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (14)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Reverter, A.
Right arrow Articles by Dalrymple, B. P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Reverter, A.
Right arrow Articles by Dalrymple, B. P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?