Skip Navigation


Bioinformatics Advance Access originally published online on January 19, 2007
Bioinformatics 2007 23(11):1339-1347; doi:10.1093/bioinformatics/btm002
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/11/1339    most recent
btm002v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Zheng, X.
Right arrow Articles by Liu, Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zheng, X.
Right arrow Articles by Liu, Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Modeling nonlinearity in dilution design microarray data

Xiuwen Zheng 1, Hung-Chung Huang 2, Wenyuan Li 2, Peng Liu 4, Quan-Zhen Li 5 and Ying Liu 2,3,*

1Department of Mathematical Sciences, 2Department of Computer Science, 3Department of Molecular and Cell Biology, University of Texas at Dallas, Richardson, TX 75083-0688, 4Department of Statistics, Iowa State University, Ames, IA 50011 and 5Microarray Core facility, University of Texas Southwestern Medical Center, Dallas, TX 75390 USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Dilution design (Mixed tissue RNA) has been utilized by some researchers to evaluate and assess the performance of multiple microarray platforms. Current microarray data analysis approaches assume that the quantified signal intensities are linearly related to the expression of the corresponding genes in the sample. However, there are sources of nonlinearity in microarray expression measurements. Such nonlinearity study in the expressions of the RNA mixtures provides a new way to analyze gene expression data, and we argue that the nonlinearity can reveal novel information for microarray data analysis. Therefore, we proposed a statistical model, called proportion model, which is based on the linear regression analysis. To approximately quantify the nonlinearity in the dilution design, a new calibration, beta ratio (BR) was derived from the proportion model. Furthermore, a new adjusted fold change (adj-FC) was proposed to predict the true FC without nonlinearity, in particular for large FC.

Results: We applied our method to one microarray dilution dataset. The experimental results indicated that, to some extent, there are global biases comparing with the linear assumption for the significant genes. Further analysis of those highly expressed genes with significant nonlinearity revealed some promising results, e.g. ‘poison’ effect was discovered for some genes in RNA mixtures. The adj-FCs of those genes with ‘poison’ effect, indicate that the nonlinearity can be also caused by the inherent feature of the genes besides signal noise and technical variation. Moreover, when percentage of overlapping genes (POG) was used as a cross-platform consistency measure, adj-FC outperformed simple fold change to show that Affymetrix and Illumina platforms are consistent.

Availability: The R codes which implements all described methods, and some Supplementary material, are freely available from http://www.utdallas.edu/~ying.liu/BetaRatio.htm

Contact: ying.liu{at}utdallas.edu

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 
Microarray technology, one of the latest breakthroughs in molecular biology, is shifting experimental approaches from single gene studies to genome-level analysis. It allows monitoring of gene expression for tens of thousands of genes in parallel. One of the key aims of microarray is to determine a list of differentially expressed genes (DEGs) that distinguish biological change of states such as disease from normal.

The success of microarray technology has led to the production of multiple array platforms (Barnes et al. 2005). Although different microarrays are based on hybridization of nucleic acid strands, they differ in the types and composition of probes used (short-oligonucleotide, long-oligonucleotide, cDNA, etc.), the deposition technologies, the hybridization and labeling protocols (Barnes et al., 2005; Draghici et al., 2006; Thomson et al., 2005) The reproducibility and cross-platform consistency have drawn a lot of concerns on microarray technology (Barnes et al., 2005; MAQC Consortium, 2006; Marshall, 2004). To determine the cross-platform reproducibility, Barnes et al. (2005) proposed a study design based on a dilution series of two human tissues (blood and placenta) and tested in duplicate on Affymetrix (AFX) and Illumina (ILM) platforms. The dilution design is to mix two different RNA samples (from blood and placenta) at known proportions (Fig. 1) and the same RNA mixture is analyzed in duplicate on each platform. Microarray measurements normally assume a linear relationship between the quantified intensity and the amount of mRNA in the sample. However, as pointed out by Ramdas et al. (2001), there are sources of nonlinearity in microarray expression measurements and the true intensity may not follow exact linear relationships; also, ratio underestimation is a well-known phenomenon of microarray data (Dudley et al., 2002; Yuen et al., 2002). Some studies dealing with nonlinearity in microarray data have been reported, which mainly focused on correcting the nonlinear relationship between intensity data from different photomultiplier tube (PMT) gains. (Bengtsson et al., 2004; Dodd et al., 2004; Dudley et al., 2002 Lyng et al., 2004).


Figure 1
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Dilution design: two different RNA samples are mixed at known proportions.

 
In our study, the different dilutions of the mixed RNA samples in the same PMT setting were utilized to quantify and correct the nonlinearity, so that we can predict the true fold change for improving reproducibility and consistency between platforms. When analyzing the dilution design gene expression data (Barnes et al., 2005), we found that the linear assumption does not always stand. In one extreme case, the intensity of the gene 208295_x_at in the sample with 5% placenta is almost as high as its intensity in the sample with 100% placenta (Fig. 2a). This is a significant nonlinear phenomenon in the intensities of a gene. Another example is shown in Figure 2b. Two genes with almost the same fold change (FC) are plotted and they exhibit different inclination of nonlinearity. Although those genes have the same FC, they should be considered differently when determining the preference of DEGs because of their different nonlinearity.


Figure 2
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Gene intensities with two RNA samples mixed at known proportions. (a) Gene, probes 208295_x_at, from the Affymetrix platform, shows the significant nonlinear phenomenon. (b) Gene 213478_at and 202551_s_at with almost the same FC are plotted and they exhibit different inclination of nonlinearity. A color version of this figure can be found in www.bioinformatics.oxfordjournals.org online.

 
In an effort to model the nonlinearity in dilution design, we first estimated the ‘actual’ proportion (AP) according to the quantified intensities detected in the mixed RNA samples. Based on dilution proportion (DP) and AP, we defined the nonlinearity via the curve in the AP versus DP plot. Then we examined the nonlinearity in the gene expression data and developed a statistical model called proportion model, based on the linear regression analysis. According to proportion model, we define fold change (FC) as the estimated FC derived from all dilution samples, rather than the traditional FC (only pure samples). Moreover, a calibration, beta ratio (BR), derived from regularized beta function was calculated to approximately quantify the nonlinearity. We also discussed the effect of log2 transform on BR, and proposed a new adjusted fold change (adj-FC) based on the BR versus FC plot, to predict the true FC without nonlinearity.

We applied our model to the microarray dilution dataset (Barnes et al., 2005). The DEG lists were determined by ANOVA (Zar, 1999), and 0.001 was selected as P-value threshold. The experimental results indicated that, to some extent, there are global biases comparing with the linear assumption for the significant genes. Further analysis of those highly expressed genes with significant nonlinearity revealed some promising results as seen from the genes 208295_x_at (Fig. 2a) with ‘poison’ effect. The adj-FCs of such genes are much larger than normal genes, which indicates that the nonlinearity can be also caused by the inherent feature of the genes besides signal noise and technical variation. Two representative genes, CSH1 and CSH2, were identified to be such genes with ‘poison’ effect. They behave with the feature of ‘multiple initiation sites of transcription’, and are translated to become human placental lactogen which has the highest concentration known in pregnancy. Moreover, when percentage of overlapping genes (POG) was used as a cross-platform consistency measure (Shi et al., 2005), adj-FC outperformed simple fold change to show that Affymetrix and Illumina platforms are consistent, via 0.001 P-value threshold, in particular for large FC.


    2 MATERIALS AND METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Gene expression dataset
The datasets, consisting of 54 675 genes for Affymetrix platform and 24 136 genes for Illumina platform were made publicly available by the original authors (Barnes et al., 2005). Briefly, Affymetrix ‘U133plus2.0 GeneChip’ and Illumina ‘Sentrix HumanRef-8 Expression BeadChip’ were utilized by Barnes et al. (2005) to get the intensity data for the RNA mixtures from whole blood and placenta samples. These expression data were retrieved from the web site ‘http://benzer.ubic.ca/platformCompare’. All these data were derived by the assay of six RNA mixtures (100% : 0%, 95% : 5%, 75% : 25%, 50% : 50%, 25% : 75%, 0% : 100%; blood : placenta) (Fig. 1) with two technical duplicates for each mixture (Barnes et al., 2005). AFX data were normalized with the RMA method from Bioconductor’s ‘affy’ package and ILM data were normalized using the ‘normalize.quantiles’ function from ‘affy’ package (Barnes et al., 2005). No further filtering or normalization on these data was performed. The common gene list for the inter-platform comparisons was derived from the ‘results table’ (http://benzer.ubic.ca/platformCompare/results.txt) prepared by Barnes et al. (2005), which identified probes that map to common genes between the platforms. There were 28 383 AFX probes and 17 711 ILM probes, which could be matched to a common gene (83 and 89% of probes mapped to genes), covering 14 929 known genes altogether (Barnes et al., 2005). We applied our method to all 54 675 genes for AFX platform and 24 136 genes for ILM platform respectively, in order to obtain the BR versus FC plots; and then POG analysis only utilized the genes from the common gene list.

2.2 Proportion model
Our approach employs the linear regression model and operates on a gene-by-gene basis. For each gene g:

Let {gamma}g be the true gene expression (transcription) level, µg be the true signal intensity. Ideally, if we could observe the true intensity µg, then we may infer the true expression level {gamma}g. There is a measurement function µ g = f({gamma} g) to describe the relationship between {gamma}g and µg, which may not be linear (Bengtsson et al., 2004).

Let {theta}1, ... ,{theta}m (0% = {theta} 1 < {theta} 2 < ... < {theta}m = 100%) denote the dilution proportions (DP) of placenta from the RNA mixture experimental design, where m is the number of dilution series (m = 6 and {theta} 1 = 0% , {theta} 2 =5% , {theta} 3 = 25% , {theta} 4 =50% , {theta} 5 =75% and {theta} 6 =100% in our study). Since the RNA mixture with the proportion {theta}i (1 ≤ i ≤ m) was created from sample blood and placenta, we can formulate the true expression level in this specific RNA mixture as


Formula 1

(1)
However, the linear relationship in Equation (1) does not always hold for the true intensities. Let µ g,{theta} i = f({gamma} g,{theta} i), and define the corresponding actual proportions (APs) pg1, ... , pgm as


Formula 2

(2)
which probably are different from DP. Please note that pg1 = 0 and pgm =1.

Our model estimates AP, pgi, using the weighted least square method, which is described in more details in section 2.2.1. Actually, the constraint 0% =pg1 ≤ pg2 ≤ ... ≤ pg(m-1) ≤ pgm =100% is enforced, because the true intensity from the mixed sample (e.g. 50% placenta) should be between sample A (0% placenta) and sample B (100% placenta) in general. In the perfect linear situation, AP is the same as DP, i.e. pgi = {theta} i (1 ≤ i ≤ m). Hence, to some extent, the information from pgi can be used to describe the nonlinearity. Then, we calculate BR value from the regularized beta function to combine all values of pgi together in order to describe the curvature of the nonlinearity.

2.2.1 The estimation of actual proportion
Let ni (1 ≤ i ≤ m) denote the number of replicates with the ith dilution series, and Yg,ij (1 ≤ j ≤ ni) for the jth quantified intensity in the ith dilution series. Hence, the quantified intensity Yg,ij can be expressed as


Formula 3

(3)
where eg, ij is the error with E(eg, ij) =0. The design of microarray experiments can be represented in terms of a linear model for each gene:


Formula 4

(4)
where the design matrix Xg is as the following:


Formula 5

(5)
Yg as (Yg,11, ... , Yg,1 n1, ..., Yg,m 1, ... , Yg,m nm)', and ß g = (µ g 1, µ g 2)'.

There are two unknown types of variables: the true intensity vector ßg from the sample A and B, the actual proportions {rho} g = (pg2, pg3, ... , pg(m-1))' in Equation (5). Based on the definition of AP in Equation (2), we should determine whether µ g, blood = µ g, placenta first via other statistical methods (e.g. ANOVA), and then analyze further those genes lower than critical P-value:

Considering the different variances in the same dilution replicates, we introduce a diagonal matrix of weights Wg. The weights to be used in weighted regression are inversely proportional to the sample variance Formula of the ith dilution series (Rocke and Durbin, 2001). A positive number (we chose Formula ) was added in deriving the weights so that the weights do not become too large or too small. Hence, we get Formula .

We need to minimize the residual sum of squares (RSS) to estimate these variables. Actually, RSS can be expressed as a function of ßg and {rho}g:


Formula 6

(6)
The best estimator for ßg is a function of {rho}g:


Formula 7

(7)
We substitute ßg with Formula so that Equation (6) can be rewritten as a function with one parameter {rho}g: RSS({rho} g). Therefore, to solve the weighted least square problem becomes to solve the following optimization problem:


Formula 8

(8)

We use ‘Nelder–Mead’ method with the constrained conditions to get the numerical results, which is provided by the function ‘constrOptim’ in R (2006). The DPs are used for the initial points.

2.2.2 Beta ratio: an approximate measure of nonlinearity
We applied Method (8) to normalized AFX data without log2 transform. The AP of each gene was calculated based on both of the replicates. The relationship between DP and AP, DP-versus-AP curve, was shown in Figure 3a (only some of the genes were shown). From Figure 3a, several facts can be observed: for some genes, only 5% DP behaves as if there was around 90% AP reflected on the intensity, i.e. DP << AP, (e.g. 208295_x_at); while for some other ones, 75% DP only corresponds to around 20% actual proportion, i.e. DP >> AP, (e.g. 237729_at). Furthermore, some genes follow the linear pattern that the DP is approximately equal to the AP (e.g. 213478_at). In order to measure the curvature of the nonlinearity of Figure 3a, we employed a smooth curve to approximate AP of each gene.


Figure 3
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. The relationship between DP and AP, and a regularized beta function to approximate AP of each gene. (a) DP-vs-AP curve based on two replicates of normalized AFX data without log2 transform, (only some of the genes were shown). For some genes, only 5% DP behaves as if there was around 90% AP reflected on the intensity (e.g. 208295_x_at); while for some other ones, 75% DP only corresponds to around 20% AP (e.g. 237729_at). (b) Regularized beta function to approximate AP of each gene to describe the curvature of the nonlinearity. The curve deviates gradually from the linear line y = x as BR approaches 1 or –1 from 0. A color version of this figure can be found in www.bioinformatics.oxfordjournals.org online.

 
Assume that there is a curve family {Psi} with the parameter {phi}, satisfying f{phi} ({theta} i) = pgi for 1 ≤ i ≤ m if f{phi} isin {Psi}. We substitute pgi with f{phi} ({theta} i) and get RSS1 ({phi}) = RSS(f{phi} ({theta} 1), ... , f{phi} ({theta} m)). Hence, Equation (8) becomes an optimization problem about the parameter {phi}:


Formula 9

(9)

Due to the constraint we mentioned before (0 = pg1} ≤ pg2 ≤ ... ≤ pg(m-1) ≤ pgm = 1), to determine an appropriate curve family, the following conditions (for {forall} f{phi} isin {Psi}) should be satisfied:

  1. f{phi} (0) =0 and f{phi} (1) =1
  2. f{phi} (x) should be non-decreasing function in [0, 1].
The curve patterns in Figure 3a lead us to choose the regularized beta function as the curve family. The regularized beta function Ia, b(x), with the parameters {phi} = (a, b) , is:


Formula 10

(10)

Beta Ratio can be defined as Formula (-1 < BR < 1) to approximately measure the curvature of the nonlinearity, where a and b can be derived numerically by ‘Nelder–Mead’ method in R (2006). Figure 3b showed the relationship between BR and the curvature of the regularized beta function: the curve deviates gradually from the linear line y = x as BR approaches 1 or –1 from 0. For BR greater than 0 (convex), it means that the sample B contributes more to the intensity than sample A; otherwise, sample A contributes more to the intensity than sample B. One more fact needs to be mentioned is that if we chose blood as sample B and placenta as sample A, the sign of BR value would be reversed, because of the permutation of the parameters a, b.

2.2.3 The estimated fold change
Now we can get the estimators of AP Formula from the solution of Equation (9), and substitute pg1 ... pgm with Formula to calculate the design matrix Xg in Equation (5). The true intensity vector Formula will be derived from Equation (7).

The estimated fold change (FCe) derived from Formula and Formula , considers each of the dilution series. It is more reasonable than the traditional fold change derived from the pure samples, and some discussions were shown in Figure S1 of Supplementary Material. If the intensity data have been log2 transformed, define Formula ; otherwise, define Formula for non-log2 data.

2.3 Effect of log2-transform on beta ratio and adjusted fold change
It is a normal practice that the microarray data is log2 transformed before data analysis. BR of log2-transformed data is different from that of non-log2 transformed data. We tried a method to adjust log ratio (i.e. FC), based on the relationship between BR and log2-transformed data.

In the perfect situation, the measurement function is f({gamma}) = c {gamma} where c is a positive constant, and AP will be accurately equal to DP. If the data has been log2 transformed, the measurement function will become f ast; ({gamma}) = log2(f({gamma})) = log2({gamma}) + log2(c), and FC satisfies FCg = log2g, placenta / µ g, blood) = log2({gamma} g, placenta / {gamma} g, blood). The formula of AP according to the new measurement function f*will be


Formula 11

(11)
for FCg != 0, where {theta} is DP. The DP-vs-AP curve family was shown in Figure 4a.


Figure 4
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Effect of log2-transform on BR. (a) the DP-vs-AP curve family. The AP is a function of DP and FC. (b) the BR-vs-FC curve. For each gene (e.g. gene A), an adjusted FC can be defined by finding the corresponding horizontal point on the perfect curve.

 
To derive the relationship between BR and FCg under log2 transform, for each FCg != 0, we find a regularized beta function Ia, b({theta}), which is closest to APg({theta}), defined in L2 space. It is an optimization problem with the parameters a, b as:


Formula 12

(12)
BR can be calculated using the solution a* and b* from Equation (12). We get a increasing function fBR(FC) describing the relationship between FC and BR for log2-transformed data, called perfect BR-vs-FC curve (shown in Fig. 4b).

In fact, BR from log2-transformed data can be a calibration value for FC. In Figure 4b, BR of point A (e.g. gene A) is greater than the one on the perfect BR-vs-FC curve with the same FC. If the measurement function follows the form of f({gamma}) = c {gamma} where c > 0, only point B has appropriate FC to make its BR equal to the BR of A. We define adj-FC of gene A as this appropriate FC of B. Hence, the point maps horizontally to the perfect BR-vs-FC curve, and we get an adjusted FC : Formula .


    3 RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 
A key assumption in the microarray data analysis is that the quantified signal intensities are linearly related to the expression of the corresponding genes in the target sample (Ramdas et al., 2001). However, as pointed out by (Ramdas et al., 2001), there are sources of nonlinearity in microarray expression measurements. Dilution design (mixed tissue RNA design) has been utilized by some researchers to evaluate and assess the performance of multiple microarray platforms (Barnes et al., 2005; MAQC Consortium, 2006; Thompson et al., 2005). RNA mixtures can be used to measure the gene transcription levels in different dilution profiles and measure the linear range as a function of both signal intensity and fold change; deviations from linearity can result from signal noise and assay imprecision as well as from signal plateauing at upper or lower limits of linearity out of dynamic range (Thompson et al., 2005).

We proposed proportion model to describe the nonlinearity phenomenon in dilution design of microarray data. A calibration value (BR) and adj-FC were calculated to approximately quantify and correct the nonlinearity respectively, based on log2-transformed data. We applied proportion model to the normalized log2-transformed data from AFX and ILM prepared by Barnes et al. (2005) respectively. By the definition of AP from Equation (2), we determined a DEG list via ANOVA (Zar, 1999) of which each gene had statistical significance according to the null hypothesis ‘µ g, blood = µ g, placenta’. As suggested by MAQC Consortium (2006) 0.001 was selected as P-value threshold.

3.1 Nonlinear characterization and adjusted FC
Figure 5 shows the relationship between BR and FC when the log2 transformed data was analyzed. The red lines denote the perfect BR-vs-FC curve, around which the genes should be plotted if the gene intensities in different samples follow linear relationship. The deviation of the genes’ data points from the red curves can be observed clearly. Most of the data points have higher BR values than they should have in the linear assumption (i.e. the corresponding values on the red curve), which indicates that there are global biases from the linear assumption to some extent. The APs for non-log transform according to 5% DP also indicate global biases (shown in Fig. S2 of Supplementary Material); and the average APs for all DEGs were greater than the corresponding DPs, which were shown in Figure S3 of Supplementary Material. The similar comparison and analysis have been reported by Shippy et al. (2006). Furthermore, such global biases imply that placenta RNA contributes more to the mixed intensity than blood RNA.


Figure 5
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. The relationship between BR and FC for different platforms: (a) Affymetrix; (b) Illumina. The red curves denote the perfect BR-vs-FC curve. The experimental results indicated that there are global biases from the linear assumption to some extent. Some of the genes, especially genes with large positive FC, do not even near the red line. The red points are the genes with ‘poison’ effect, which show large FC and very significant nonlinear pattern. A color version of this figure can be found in www.bioinformatics.oxfordjournals.org online.

 
From Figure 5, we can also observe that the genes with small FC (e.g. close to 0) distributed randomly without any pattern. It is widely accepted that the genes with small FC are usually not DEGs. When corresponding BRs are calculated for such genes, random pattern will dramatically influence the effectiveness and robustness of the adj-FCs,

Table 1 and Table 2 show the top 100 FC-sorted genes from AFX and ILM platforms respectively, and those genes were ranked by their adj-FCs shown in the tables. ‘Symbol’ and ‘Name’ columns were obtained from the ‘WebGestalt’ web site (http://bioinfo.vanderbilt.edu/webgestalt/) in Table 1. Dudley et al. (2002) and Yuen et al. (2002) has pointed out that intensity ratio underestimation is a well-known phenomenon of microarray data. Notably, there is a significant bias in log ratios calculated from intensities: absolute log ratios are dramatically underestimated compared to truth, in particular for large fold change (Shi et al., 2004). Hence, it is not surprising that the adj-FCs are significantly greater than FCs shown in Tables 1 and 2.


View this table:
[in this window]
[in a new window]

 
Table 1. Ranking of the top 100 FC-sorted genes by their adj-FCs for AFX platform, under 0.001 P-value threshold (only show the top 18 genes, the complete table is available in Table S1 of Supplementary Material.)

 

View this table:
[in this window]
[in a new window]

 
Table 2. Ranking of the top 100 FC-sorted genes by their adj-FCs for ILM platform, under 0.001 P-value threshold (only show the top eight genes, the complete table is available in Table S2 of Supplementary Material.)

 
3.2 ‘Poison’ Effect
Further analysis of those highly expressed genes with large absolute value of BRs shown in Figure 5 reveals that some of the genes with large FC exhibit highly significant nonlinearity. The adj-FCs of the top-rank genes shown in Table 1 and Table 2 are too large comparing with the other genes, which is what we call ‘poison’ effect.

The twelve target genes, identified by Affymetrix are shown to have the ‘poison’ effect (i.e. the genes with red points on Fig. 5a and listed in Table 1), were submitted to the ‘WebGestalt’ web site for further analyses. It turned out that these twelve target genes map to three major gene transcripts (i.e. CSH1, CSH2 and CSHL1 in Table 1). The two target genes identified by Illumina in Table 2 were also found to map to CSH1 and CSH2 via WebGestalt. CSH1 encodes for chorionic somatomammotropin hormone 1, CSH2 encodes for chorionic somatomammotropin hormone 2, and CSHL1 encodes for chorionic somatomammotropin hormone-like 1 proteins; these encoded protein isoforms belong to the same protein family called ‘placental lactogen’, which is a placental form of growth hormone variant. Chorionic somatomammotropin hormones 1 and 2 are upregulated and expressed to the identical mature proteins during development (Ng et al., 2003; Selvanayagam et al., 1984). Human placental lactogen (hPL), a potent glycoprotein, is made throughout gestation, increasing progressively until the 36th week, where it can be found in the maternal serum at a concentration of 5–15 µg/ml , the highest concentration of any known protein hormone (Kliman et al., 1988). This particular family member of genes (CSH1, CSH2 and CSHL1) is expressed mainly in the placenta and utilizes multiple transcription initiation sites.

This feature of ‘multiple transcription initiation’ can be the major reason for the ‘highest concentration seen for human protein hormones of hPL’ and ‘poison’ effect seen on these expressed target genes in our analyses. Due to ‘multiple initiation sites of transcription’, CSH1 or CSH2 can be found highly expressed even in a sample mixture with little proportion of placenta RNA (e.g. 5% placenta RNA mixture); also, because of high concentrations and the digital saturation of the scanner, the intensity no longer increases linearly with concentration (Shi et al., 2004). Furthermore, the similarity of the sequences of the target genes (CSH1, CSH2 and CSHL1) in the RNA mixture may also enhance the intensity by cross hybridizations to their complementary probe genes in the microarray plate. Nonetheless, in either case, it confirms ‘poison’ effect of the gene expression for some special genes.

The adj-FC of those genes with ‘poison’ effect provides a way to estimate the approximate true FC in quantity without nonlinearity. However, since we do not know the true values, it is difficult to verify the estimates directly. Some studies correcting underestimation by other specific ways have been reported. Hekstra et al. (2003), Naef et al. (2003) and Held et al. (2003) addressed the problem of sequence-specific response of fluorescent signal as a function of concentration, and proposed ways to correct ratio underestimation for genes with high-fold changes observed in Affymetrix chips based on Langmuir adsorption and free-energy calculations, respectively (Shi et al., 2004).

3.3 Consistency of adj-FC
Multiple commercial microarray platforms are widely used for genome-scale gene expression analysis (Shippy et al., 2004). Microarray manufacturers provide different underlying technologies and protocols for data analysis (Barnes et al., 2005; Draghici et al., 2006; Shippy et al., 2004; Thompson et al., 2005). Conflicting results have been reported on cross-platform reproducibility and consistency (Bammler et al., 2005; Draghici et al., 2006; Irizarry et al., 2005; Miller et al., 2004). Barnes et al. (2005) experimentally compared and validated Affymetrix and Illumina platforms. The results of platform comparison indicated high agreement between these two platforms, especially for the DEGs. Our results confirm this conclusion. The shapes of the BR-vs-FC figures derived from the two platforms are very similar (Fig. 5).

Percentage of overlapping genes (POG) has been reported to be an effective metric for cross-platform consistency analysis (Shi et al., 2005). POG represents the number of genes common in the two ‘significant’ gene lists (derived from the two platforms, AFX and ILM), divided by L, the number of genes in the selected gene list. The higher the POG values, the more consistent the two platforms are. Shi et al. (2005) found that ranking genes by their FC yields a better POG curve than ranking genes by their P-values. The results from MAQC Consortium (2006) had also indicated that a straightforward approach of FC ranking plus a non-stringent P-value cutoff can be successful in identifying reproducible gene lists. In this study, we intend to test if adj-FC can improve the POG curve and the cross-platform consistency via 0.001 P-value threshold.

In microarray experiments, many probes are not specific for a single transcript of a gene. Probes that could not be matched across platforms represent probes which did not map to a ‘known’ gene, or to genes represented on only one platform (Barnes et al., 2005). In this article, we only consider the genes in the common gene list in POG analysis.

The significant (P-value < 0.001) genes were sorted by their fold changes, and top 1500 genes from each platform were selected. In Figure 6, two POG curves were plotted: one was directly derived from the 1500 FC-sorted significant genes; the other was generated from these 1500 genes after they were sorted by adj-FC. L was set as 1, 2, 3, ... , 1500. Figure 6 indicates that adj-FC is more consistent than simple fold change to rank significant genes between two platforms, especially for the top 500 genes. When the genes after 1500 ranking were considered, these two POG curves are similar. As more candidate genes were selected, more noise and errors are also introduced because more genes with small FC were included, which affected BRs and finally influence adj-FCs.


Figure 6
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. Percentage of overlapping genes (POGs) based on adj-FC and simple FC from the top 1500 FC-sorted common genes with 0.001 P-value threshold between two platforms. Adj-FC is more consistent than simple FC to select significant genes especially for the top 500 genes. A color version of this figure can be found in www.bioinformatics.oxfordjournals.org online.

 
3.4 Robustness and Application
When the log2 unnormalized data from AFX and ILM platforms were analyzed, their BR-vs-FC figures were very similar to Figure 5. We also identified 12 and 2 genes with ‘poison’ effect respectively from AFX platform and ILM platform. It should be noted that the different data processing techniques, such as PLIER, RMA, GCRMA, yield different numbers of genes showing significant deviations in expression values between samples A and B but normalization methods are not the primary causes for the deviation from the linear assumption (Shippy et al., 2006). Our results indicated that BRs were consistent between normalized data and unnormalized data especially for those genes with large FCs. Furthermore, the number of dilution series can be one of the important factors influencing robustness of BR. We suggest more dilution series and symmetrical dilution design in the microarray experiment, which can improve the analysis of proportion model.

The adj-FC from proportion model predicted the real level of gene expression outside the dynamic range of the current array technology, and also enhanced the reproducibility measure of differentially expressed genes (DEGs) among different platforms, especially for the top and highly significant DEGs that are of biologist’s interests. Furthermore, the BR-versus-FC figure can be a calibration metric to test whether the quantified signal intensities are linearly related to the gene expressions.


    4 CONCLUSION AND FUTURE WORK
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 
Our reanalysis of the dataset of Barnes et al. (2005) indicated that there are global biases comparing with the linear assumption to some extent. BR-vs-FC figure can clearly illustrate the distribution of BR and FC, which may deviate the perfect curve of linear assumption. Genes with ‘poison’ effect, those large absolute values of FC with BR close to 1 or –1, also can be clearly shown in BR-vs-FC figure. Moreover, their adj-FCs imply some inherited feature of the genes, such as ‘multiple initiation sites of transcription’. The comparison of different POGs between adj-FC and simple FC indicates that adj-FC outperformed simple FC to show that AFX and ILM platforms are consistent, via 0.001 P-value threshold.

Figure 5 showed that BRs distributed randomly without any pattern for those genes with small FC, although their P-values were lower than 0.001. The variance of BR or AP can be considered as one of the important factors, when we develop a specific method to estimate P-value for the proportion model in future. Moreover, BR from the regularized beta function is an approximate metric, and may be not enough for the future work.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 
We thank Michael Barnes and Paul Pavlidis for sharing the dilution design microarray datasets. We also would like to thank Michael Baron from Department of Mathematical Sciences in University of Texas at Dallas for critical reading of the manuscript.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Trey Ideker

Received on September 25, 2006; revised on January 2, 2007; accepted on January 8, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Bammler T, et al. Standardizing global gene expression analysis between laboratories and across platforms. Nat. Methods, ( (2005) ) 2, : 351–356.[CrossRef][ISI][Medline].

    Barnes M, et al. Experimental comparison and cross-validation of the Affymetrix and Illumina gene expression analysis platforms. Nucleic Acids Res, ( (2005) ) 33, : 5914–5923.[Abstract/Free Full Text].

    Bengtsson H, et al. Calibration and assessment of channel-specific biases in microarray data with extended dynamical range. BMC Bioinformatics, ( (2004) ) 5, : 177.[CrossRef][Medline].

    Dodd LE, et al. Correcting log ratios for signal saturation in cDNA microarrays. Bioinformatics, ( (2004) ) 20, : 2685–2693.[Abstract/Free Full Text].

    Draghici S, et al. Reliability and reproducibility issues in DNA microarray measurements. Trends Genet, ( (2006) ) 22, : 101–109.[CrossRef][ISI][Medline].

    Dudley AM, et al. Measuring absolute expression with microarray with a calibrated reference sample and an extended signal intensity range. Proc. Natl Acad. Sci. USA, ( (2002) ) 99, : 7554–7559.[Abstract/Free Full Text].

    Hekstra D, et al. Absolute mRNA concentrations from sequence-specific calibration of oligonucleotide arrays. Nucleic Acids Res, ( (2003) ) 31, : I1962–1968.[CrossRef].

    Held GA, et al. Modeling of DNA microarray data by using physical properties of hybridization. Proc. Natl Acad. Sci. USA, ( (2003) ) 100, : 7575–7580.[Abstract/Free Full Text].

    Irizarry RA, et al. Multiple-laboratory comparison of microarray platforms. Nat. Methods, ( (2005) ) 2, : 345–350.[CrossRef][ISI][Medline].

    Kliman HJ. From trophoblast to human placenta. In: Encyclopedia of Reproduction, —Knobil E, Neill JD, eds. ( (1998) ) San Diego, CA. ACADEMIC PRESS..

    Lyng H, et al. Profound influence of microarray scanner characteristics on gene expression ratios: analysis and procedure for correction. BMC Genomics, ( (2004) ) 5, : 10.[CrossRef][Medline].

    MAQC Consortium. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat. Biotechnol, ( (2006) ) 1151–1161..

    Marshall E. Getting the noise out of gene arrays. Science, ( (2004) ) 306, : 630–631.[Abstract/Free Full Text].

    Miller RM, et al. Dysregulation of gene expression in the 1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine-lesioned mouse substantia nigra. J. Neurosci, ( (2004) ) 24, : 7445–7454.[Abstract/Free Full Text].

    Naef F, et al. A study of accuracy and precision in oligonucleotide arrays: extracting more signal at large concentrations. Proc. Natl Acad. Sci. USA, ( (2003) ) 100, : 4748–4753.[Abstract/Free Full Text].

    Ng E, et al. mRNA of placental origin is readily detectable in maternal plasma. Proc. Natl Acad. Sci. USA, ( (2003) ) 100, : 4748–4753.[Abstract/Free Full Text].

    R (2006) R Development Core Team. R: A language and environment for statistical computing Vienna, Austria, URL http://www.R-project.org. In: R Foundation for Statistical Computing, . ISBN 3-900051-900007-900050..

    Ramdas L, et al. Sources of nonlinearity in cDNA microarray expression measurements. Genome Biol, ( (2001) ) 2, . RESEARCH0047..

    Rocke DM, Durbin B. A model for measurement error for gene expression arrays. J. Comput. Biol, ( (2001) ) 8, : 557–569.[CrossRef][ISI][Medline].

    Selvanayagam C, et al. Multiple origins of transcription for the human placental lactogen genes. J. Biol. Chem, ( (1984) ) 259, : 14642–14646.[Abstract/Free Full Text].

    Shi L, et al. Microarray scanner calibration curves: characteristics and implications. BMC Bioinformatics, ( (2004) ) 6, : S11..

    Shi L, et al. Cross-platform comparability of microarray technology: intra-platform consistency and appropriate data analysis procedures are essential. BMC Bioinformatics, ( (2005) ) 6, (Suppl. 2): S12..

    Shippy R, et al. Performance evaluation of commercial short-oligonucleotide microarrays and the impact of noise in making cross-platform correlations. BMC Bioinformatics, ( (2004) ) 5, : 61.[CrossRef][Medline].

    Shippy R, et al. Using RNA sample titrations to assess microarray platform performance and normalization techniques. Nat. Biotechnol, ( (2006) ) 1123–1131..

    Thompson KL, et al. Use of a mixed tissue RNA design for performance assessments on multiple microarray formats. Nucleic Acids Res, ( (2005) ) 33, : e187.[Abstract/Free Full Text].

    Yuen T, et al. Accuracy and calibration of commercial oligonucleotide and custom cDNA microarrays. Nucleic Acids Res, ( (2002) ) 30, : e48.[Abstract/Free Full Text].

    Zar JH. Biostatistical Analysis, ( (1999) ) 4th edn. NJ, USA: Prentice Hall..


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/11/1339    most recent
btm002v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Zheng, X.
Right arrow Articles by Liu, Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zheng, X.
Right arrow Articles by Liu, Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?