Bioinformatics Advance Access originally published online on January 19, 2007
Bioinformatics 2007 23(11):1339-1347; doi:10.1093/bioinformatics/btm002
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Modeling nonlinearity in dilution design microarray data
1Department of Mathematical Sciences, 2Department of Computer Science, 3Department of Molecular and Cell Biology, University of Texas at Dallas, Richardson, TX 75083-0688, 4Department of Statistics, Iowa State University, Ames, IA 50011 and 5Microarray Core facility, University of Texas Southwestern Medical Center, Dallas, TX 75390 USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Dilution design (Mixed tissue RNA) has been utilized by some researchers to evaluate and assess the performance of multiple microarray platforms. Current microarray data analysis approaches assume that the quantified signal intensities are linearly related to the expression of the corresponding genes in the sample. However, there are sources of nonlinearity in microarray expression measurements. Such nonlinearity study in the expressions of the RNA mixtures provides a new way to analyze gene expression data, and we argue that the nonlinearity can reveal novel information for microarray data analysis. Therefore, we proposed a statistical model, called proportion model, which is based on the linear regression analysis. To approximately quantify the nonlinearity in the dilution design, a new calibration, beta ratio (BR) was derived from the proportion model. Furthermore, a new adjusted fold change (adj-FC) was proposed to predict the true FC without nonlinearity, in particular for large FC.
Results: We applied our method to one microarray dilution dataset. The experimental results indicated that, to some extent, there are global biases comparing with the linear assumption for the significant genes. Further analysis of those highly expressed genes with significant nonlinearity revealed some promising results, e.g. poison effect was discovered for some genes in RNA mixtures. The adj-FCs of those genes with poison effect, indicate that the nonlinearity can be also caused by the inherent feature of the genes besides signal noise and technical variation. Moreover, when percentage of overlapping genes (POG) was used as a cross-platform consistency measure, adj-FC outperformed simple fold change to show that Affymetrix and Illumina platforms are consistent.
Availability: The R codes which implements all described methods, and some Supplementary material, are freely available from http://www.utdallas.edu/~ying.liu/BetaRatio.htm
Contact: ying.liu{at}utdallas.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Microarray technology, one of the latest breakthroughs in molecular biology, is shifting experimental approaches from single gene studies to genome-level analysis. It allows monitoring of gene expression for tens of thousands of genes in parallel. One of the key aims of microarray is to determine a list of differentially expressed genes (DEGs) that distinguish biological change of states such as disease from normal.
The success of microarray technology has led to the production of multiple array platforms (Barnes et al. 2005). Although different microarrays are based on hybridization of nucleic acid strands, they differ in the types and composition of probes used (short-oligonucleotide, long-oligonucleotide, cDNA, etc.), the deposition technologies, the hybridization and labeling protocols (Barnes et al., 2005; Draghici et al., 2006; Thomson et al., 2005) The reproducibility and cross-platform consistency have drawn a lot of concerns on microarray technology (Barnes et al., 2005; MAQC Consortium, 2006; Marshall, 2004). To determine the cross-platform reproducibility, Barnes et al. (2005) proposed a study design based on a dilution series of two human tissues (blood and placenta) and tested in duplicate on Affymetrix (AFX) and Illumina (ILM) platforms. The dilution design is to mix two different RNA samples (from blood and placenta) at known proportions (Fig. 1) and the same RNA mixture is analyzed in duplicate on each platform. Microarray measurements normally assume a linear relationship between the quantified intensity and the amount of mRNA in the sample. However, as pointed out by Ramdas et al. (2001), there are sources of nonlinearity in microarray expression measurements and the true intensity may not follow exact linear relationships; also, ratio underestimation is a well-known phenomenon of microarray data (Dudley et al., 2002; Yuen et al., 2002). Some studies dealing with nonlinearity in microarray data have been reported, which mainly focused on correcting the nonlinear relationship between intensity data from different photomultiplier tube (PMT) gains. (Bengtsson et al., 2004; Dodd et al., 2004; Dudley et al., 2002 Lyng et al., 2004).
|
In our study, the different dilutions of the mixed RNA samples in the same PMT setting were utilized to quantify and correct the nonlinearity, so that we can predict the true fold change for improving reproducibility and consistency between platforms. When analyzing the dilution design gene expression data (Barnes et al., 2005), we found that the linear assumption does not always stand. In one extreme case, the intensity of the gene 208295_x_at in the sample with 5% placenta is almost as high as its intensity in the sample with 100% placenta (Fig. 2a). This is a significant nonlinear phenomenon in the intensities of a gene. Another example is shown in Figure 2b. Two genes with almost the same fold change (FC) are plotted and they exhibit different inclination of nonlinearity. Although those genes have the same FC, they should be considered differently when determining the preference of DEGs because of their different nonlinearity.
|
In an effort to model the nonlinearity in dilution design, we first estimated the actual proportion (AP) according to the quantified intensities detected in the mixed RNA samples. Based on dilution proportion (DP) and AP, we defined the nonlinearity via the curve in the AP versus DP plot. Then we examined the nonlinearity in the gene expression data and developed a statistical model called proportion model, based on the linear regression analysis. According to proportion model, we define fold change (FC) as the estimated FC derived from all dilution samples, rather than the traditional FC (only pure samples). Moreover, a calibration, beta ratio (BR), derived from regularized beta function was calculated to approximately quantify the nonlinearity. We also discussed the effect of log2 transform on BR, and proposed a new adjusted fold change (adj-FC) based on the BR versus FC plot, to predict the true FC without nonlinearity.
We applied our model to the microarray dilution dataset (Barnes et al., 2005). The DEG lists were determined by ANOVA (Zar, 1999), and 0.001 was selected as P-value threshold. The experimental results indicated that, to some extent, there are global biases comparing with the linear assumption for the significant genes. Further analysis of those highly expressed genes with significant nonlinearity revealed some promising results as seen from the genes 208295_x_at (Fig. 2a) with poison effect. The adj-FCs of such genes are much larger than normal genes, which indicates that the nonlinearity can be also caused by the inherent feature of the genes besides signal noise and technical variation. Two representative genes, CSH1 and CSH2, were identified to be such genes with poison effect. They behave with the feature of multiple initiation sites of transcription, and are translated to become human placental lactogen which has the highest concentration known in pregnancy. Moreover, when percentage of overlapping genes (POG) was used as a cross-platform consistency measure (Shi et al., 2005), adj-FC outperformed simple fold change to show that Affymetrix and Illumina platforms are consistent, via 0.001 P-value threshold, in particular for large FC.
| 2 MATERIALS AND METHODS |
|---|
|
|
|---|
2.1 Gene expression dataset
The datasets, consisting of 54 675 genes for Affymetrix platform and 24 136 genes for Illumina platform were made publicly available by the original authors (Barnes et al., 2005). Briefly, Affymetrix U133plus2.0 GeneChip and Illumina Sentrix HumanRef-8 Expression BeadChip were utilized by Barnes et al. (2005) to get the intensity data for the RNA mixtures from whole blood and placenta samples. These expression data were retrieved from the web site http://benzer.ubic.ca/platformCompare. All these data were derived by the assay of six RNA mixtures (100% : 0%, 95% : 5%, 75% : 25%, 50% : 50%, 25% : 75%, 0% : 100%; blood : placenta) (Fig. 1) with two technical duplicates for each mixture (Barnes et al., 2005). AFX data were normalized with the RMA method from Bioconductors affy package and ILM data were normalized using the normalize.quantiles function from affy package (Barnes et al., 2005). No further filtering or normalization on these data was performed. The common gene list for the inter-platform comparisons was derived from the results table (http://benzer.ubic.ca/platformCompare/results.txt) prepared by Barnes et al. (2005), which identified probes that map to common genes between the platforms. There were 28 383 AFX probes and 17 711 ILM probes, which could be matched to a common gene (83 and 89% of probes mapped to genes), covering 14 929 known genes altogether (Barnes et al., 2005). We applied our method to all 54 675 genes for AFX platform and 24 136 genes for ILM platform respectively, in order to obtain the BR versus FC plots; and then POG analysis only utilized the genes from the common gene list.
2.2 Proportion model
Our approach employs the linear regression model and operates on a gene-by-gene basis. For each gene g:
Let
g be the true gene expression (transcription) level, µg be the true signal intensity. Ideally, if we could observe the true intensity µg, then we may infer the true expression level
g. There is a measurement function µ g = f(
g) to describe the relationship between
g and µg, which may not be linear (Bengtsson et al., 2004).
Let
1, ... ,
m (0% =
1 <
2 < ... <
m = 100%) denote the dilution proportions (DP) of placenta from the RNA mixture experimental design, where m is the number of dilution series (m = 6 and
1 = 0% ,
2 =5% ,
3 = 25% ,
4 =50% ,
5 =75% and
6 =100% in our study). Since the RNA mixture with the proportion
i (1
i
m) was created from sample blood and placenta, we can formulate the true expression level in this specific RNA mixture as
|
| (1) |
i = f(
g,
i), and define the corresponding actual proportions (APs) pg1, ... , pgm as
|
| (2) |
Our model estimates AP, pgi, using the weighted least square method, which is described in more details in section 2.2.1. Actually, the constraint 0% =pg1
pg2
...
pg(m-1)
pgm =100% is enforced, because the true intensity from the mixed sample (e.g. 50% placenta) should be between sample A (0% placenta) and sample B (100% placenta) in general. In the perfect linear situation, AP is the same as DP, i.e. pgi =
i (1
i
m). Hence, to some extent, the information from pgi can be used to describe the nonlinearity. Then, we calculate BR value from the regularized beta function to combine all values of pgi together in order to describe the curvature of the nonlinearity.
2.2.1 The estimation of actual proportion
Let ni (1
i
m) denote the number of replicates with the ith dilution series, and Yg,ij (1
j
ni) for the jth quantified intensity in the ith dilution series. Hence, the quantified intensity Yg,ij can be expressed as
|
| (3) |
|
| (4) |
|
| (5) |
There are two unknown types of variables: the true intensity vector ßg from the sample A and B, the actual proportions
g = (pg2, pg3, ... , pg(m-1))' in Equation (5). Based on the definition of AP in Equation (2), we should determine whether µ g, blood = µ g, placenta first via other statistical methods (e.g. ANOVA), and then analyze further those genes lower than critical P-value:
Considering the different variances in the same dilution replicates, we introduce a diagonal matrix of weights Wg. The weights to be used in weighted regression are inversely proportional to the sample variance
of the ith dilution series (Rocke and Durbin, 2001). A positive number (we chose
) was added in deriving the weights so that the weights do not become too large or too small. Hence, we get
.
We need to minimize the residual sum of squares (RSS) to estimate these variables. Actually, RSS can be expressed as a function of ßg and
g:
|
| (6) |
g:
|
| (7) |
g: RSS(
g). Therefore, to solve the weighted least square problem becomes to solve the following optimization problem:
|
| (8) |
We use Nelder–Mead method with the constrained conditions to get the numerical results, which is provided by the function constrOptim in R (2006). The DPs are used for the initial points.
2.2.2 Beta ratio: an approximate measure of nonlinearity
We applied Method (8) to normalized AFX data without log2 transform. The AP of each gene was calculated based on both of the replicates. The relationship between DP and AP, DP-versus-AP curve, was shown in Figure 3a (only some of the genes were shown). From Figure 3a, several facts can be observed: for some genes, only 5% DP behaves as if there was around 90% AP reflected on the intensity, i.e. DP << AP, (e.g. 208295_x_at); while for some other ones, 75% DP only corresponds to around 20% actual proportion, i.e. DP >> AP, (e.g. 237729_at). Furthermore, some genes follow the linear pattern that the DP is approximately equal to the AP (e.g. 213478_at). In order to measure the curvature of the nonlinearity of Figure 3a, we employed a smooth curve to approximate AP of each gene.
|
Assume that there is a curve family
with the parameter
, satisfying f
(
i) = pgi for 1
i
m if f
. We substitute pgi with f
(
i) and get RSS1 (
) = RSS(f
(
1), ... , f
(
m)). Hence, Equation (8) becomes an optimization problem about the parameter
:
|
| (9) |
Due to the constraint we mentioned before (0 = pg1}
pg2
...
pg(m-1)
pgm = 1), to determine an appropriate curve family, the following conditions (for
f
) should be satisfied:
- f
(0) =0 and f
(1) =1
- f
(x) should be non-decreasing function in [0, 1].
= (a, b) , is:
|
| (10) |
Beta Ratio can be defined as
(-1 < BR < 1) to approximately measure the curvature of the nonlinearity, where a and b can be derived numerically by Nelder–Mead method in R (2006). Figure 3b showed the relationship between BR and the curvature of the regularized beta function: the curve deviates gradually from the linear line y = x as BR approaches 1 or –1 from 0. For BR greater than 0 (convex), it means that the sample B contributes more to the intensity than sample A; otherwise, sample A contributes more to the intensity than sample B. One more fact needs to be mentioned is that if we chose blood as sample B and placenta as sample A, the sign of BR value would be reversed, because of the permutation of the parameters a, b.
2.2.3 The estimated fold change
Now we can get the estimators of AP
from the solution of Equation (9), and substitute pg1 ... pgm with
to calculate the design matrix Xg in Equation (5). The true intensity vector
will be derived from Equation (7).
The estimated fold change (FCe) derived from
and
, considers each of the dilution series. It is more reasonable than the traditional fold change derived from the pure samples, and some discussions were shown in Figure S1 of Supplementary Material. If the intensity data have been log2 transformed, define
; otherwise, define
for non-log2 data.
2.3 Effect of log2-transform on beta ratio and adjusted fold change
It is a normal practice that the microarray data is log2 transformed before data analysis. BR of log2-transformed data is different from that of non-log2 transformed data. We tried a method to adjust log ratio (i.e. FC), based on the relationship between BR and log2-transformed data.
In the perfect situation, the measurement function is f(
) = c
where c is a positive constant, and AP will be accurately equal to DP. If the data has been log2 transformed, the measurement function will become f ast; (
) = log2(f(
)) = log2(
) + log2(c), and FC satisfies FCg = log2(µ g, placenta / µ g, blood) = log2(
g, placenta /
g, blood). The formula of AP according to the new measurement function f*will be
|
| (11) |
0, where
is DP. The DP-vs-AP curve family was shown in Figure 4a.
|
To derive the relationship between BR and FCg under log2 transform, for each FCg
0, we find a regularized beta function Ia, b(
), which is closest to APg(
), defined in L2 space. It is an optimization problem with the parameters a, b as:
|
| (12) |
In fact, BR from log2-transformed data can be a calibration value for FC. In Figure 4b, BR of point A (e.g. gene A) is greater than the one on the perfect BR-vs-FC curve with the same FC. If the measurement function follows the form of f(
) = c
where c > 0, only point B has appropriate FC to make its BR equal to the BR of A. We define adj-FC of gene A as this appropriate FC of B. Hence, the point maps horizontally to the perfect BR-vs-FC curve, and we get an adjusted FC :
.
| 3 RESULTS AND DISCUSSION |
|---|
|
|
|---|
A key assumption in the microarray data analysis is that the quantified signal intensities are linearly related to the expression of the corresponding genes in the target sample (Ramdas et al., 2001). However, as pointed out by (Ramdas et al., 2001), there are sources of nonlinearity in microarray expression measurements. Dilution design (mixed tissue RNA design) has been utilized by some researchers to evaluate and assess the performance of multiple microarray platforms (Barnes et al., 2005; MAQC Consortium, 2006; Thompson et al., 2005). RNA mixtures can be used to measure the gene transcription levels in different dilution profiles and measure the linear range as a function of both signal intensity and fold change; deviations from linearity can result from signal noise and assay imprecision as well as from signal plateauing at upper or lower limits of linearity out of dynamic range (Thompson et al., 2005).
We proposed proportion model to describe the nonlinearity phenomenon in dilution design of microarray data. A calibration value (BR) and adj-FC were calculated to approximately quantify and correct the nonlinearity respectively, based on log2-transformed data. We applied proportion model to the normalized log2-transformed data from AFX and ILM prepared by Barnes et al. (2005) respectively. By the definition of AP from Equation (2), we determined a DEG list via ANOVA (Zar, 1999) of which each gene had statistical significance according to the null hypothesis µ g, blood = µ g, placenta. As suggested by MAQC Consortium (2006) 0.001 was selected as P-value threshold.
3.1 Nonlinear characterization and adjusted FC
Figure 5 shows the relationship between BR and FC when the log2 transformed data was analyzed. The red lines denote the perfect BR-vs-FC curve, around which the genes should be plotted if the gene intensities in different samples follow linear relationship. The deviation of the genes data points from the red curves can be observed clearly. Most of the data points have higher BR values than they should have in the linear assumption (i.e. the corresponding values on the red curve), which indicates that there are global biases from the linear assumption to some extent. The APs for non-log transform according to 5% DP also indicate global biases (shown in Fig. S2 of Supplementary Material); and the average APs for all DEGs were greater than the corresponding DPs, which were shown in Figure S3 of Supplementary Material. The similar comparison and analysis have been reported by Shippy et al. (2006). Furthermore, such global biases imply that placenta RNA contributes more to the mixed intensity than blood RNA.
|
From Figure 5, we can also observe that the genes with small FC (e.g. close to 0) distributed randomly without any pattern. It is widely accepted that the genes with small FC are usually not DEGs. When corresponding BRs are calculated for such genes, random pattern will dramatically influence the effectiveness and robustness of the adj-FCs,
Table 1 and Table 2 show the top 100 FC-sorted genes from AFX and ILM platforms respectively, and those genes were ranked by their adj-FCs shown in the tables. Symbol and Name columns were obtained from the WebGestalt web site (http://bioinfo.vanderbilt.edu/webgestalt/) in Table 1. Dudley et al. (2002) and Yuen et al. (2002) has pointed out that intensity ratio underestimation is a well-known phenomenon of microarray data. Notably, there is a significant bias in log ratios calculated from intensities: absolute log ratios are dramatically underestimated compared to truth, in particular for large fold change (Shi et al., 2004). Hence, it is not surprising that the adj-FCs are significantly greater than FCs shown in Tables 1 and 2.
|
|
3.2 Poison Effect
Further analysis of those highly expressed genes with large absolute value of BRs shown in Figure 5 reveals that some of the genes with large FC exhibit highly significant nonlinearity. The adj-FCs of the top-rank genes shown in Table 1 and Table 2 are too large comparing with the other genes, which is what we call poison effect.
The twelve target genes, identified by Affymetrix are shown to have the poison effect (i.e. the genes with red points on Fig. 5a and listed in Table 1), were submitted to the WebGestalt web site for further analyses. It turned out that these twelve target genes map to three major gene transcripts (i.e. CSH1, CSH2 and CSHL1 in Table 1). The two target genes identified by Illumina in Table 2 were also found to map to CSH1 and CSH2 via WebGestalt. CSH1 encodes for chorionic somatomammotropin hormone 1, CSH2 encodes for chorionic somatomammotropin hormone 2, and CSHL1 encodes for chorionic somatomammotropin hormone-like 1 proteins; these encoded protein isoforms belong to the same protein family called placental lactogen, which is a placental form of growth hormone variant. Chorionic somatomammotropin hormones 1 and 2 are upregulated and expressed to the identical mature proteins during development (Ng et al., 2003; Selvanayagam et al., 1984). Human placental lactogen (hPL), a potent glycoprotein, is made throughout gestation, increasing progressively until the 36th week, where it can be found in the maternal serum at a concentration of 5–15 µg/ml , the highest concentration of any known protein hormone (Kliman et al., 1988). This particular family member of genes (CSH1, CSH2 and CSHL1) is expressed mainly in the placenta and utilizes multiple transcription initiation sites.
This feature of multiple transcription initiation can be the major reason for the highest concentration seen for human protein hormones of hPL and poison effect seen on these expressed target genes in our analyses. Due to multiple initiation sites of transcription, CSH1 or CSH2 can be found highly expressed even in a sample mixture with little proportion of placenta RNA (e.g. 5% placenta RNA mixture); also, because of high concentrations and the digital saturation of the scanner, the intensity no longer increases linearly with concentration (Shi et al., 2004). Furthermore, the similarity of the sequences of the target genes (CSH1, CSH2 and CSHL1) in the RNA mixture may also enhance the intensity by cross hybridizations to their complementary probe genes in the microarray plate. Nonetheless, in either case, it confirms poison effect of the gene expression for some special genes.
The adj-FC of those genes with poison effect provides a way to estimate the approximate true FC in quantity without nonlinearity. However, since we do not know the true values, it is difficult to verify the estimates directly. Some studies correcting underestimation by other specific ways have been reported. Hekstra et al. (2003), Naef et al. (2003) and Held et al. (2003) addressed the problem of sequence-specific response of fluorescent signal as a function of concentration, and proposed ways to correct ratio underestimation for genes with high-fold changes observed in Affymetrix chips based on Langmuir adsorption and free-energy calculations, respectively (Shi et al., 2004).
3.3 Consistency of adj-FC
Multiple commercial microarray platforms are widely used for genome-scale gene expression analysis (Shippy et al., 2004). Microarray manufacturers provide different underlying technologies and protocols for data analysis (Barnes et al., 2005; Draghici et al., 2006; Shippy et al., 2004; Thompson et al., 2005). Conflicting results have been reported on cross-platform reproducibility and consistency (Bammler et al., 2005; Draghici et al., 2006; Irizarry et al., 2005; Miller et al., 2004). Barnes et al. (2005) experimentally compared and validated Affymetrix and Illumina platforms. The results of platform comparison indicated high agreement between these two platforms, especially for the DEGs. Our results confirm this conclusion. The shapes of the BR-vs-FC figures derived from the two platforms are very similar (Fig. 5).
Percentage of overlapping genes (POG) has been reported to be an effective metric for cross-platform consistency analysis (Shi et al., 2005). POG represents the number of genes common in the two significant gene lists (derived from the two platforms, AFX and ILM), divided by L, the number of genes in the selected gene list. The higher the POG values, the more consistent the two platforms are. Shi et al. (2005) found that ranking genes by their FC yields a better POG curve than ranking genes by their P-values. The results from MAQC Consortium (2006) had also indicated that a straightforward approach of FC ranking plus a non-stringent P-value cutoff can be successful in identifying reproducible gene lists. In this study, we intend to test if adj-FC can improve the POG curve and the cross-platform consistency via 0.001 P-value threshold.
In microarray experiments, many probes are not specific for a single transcript of a gene. Probes that could not be matched across platforms represent probes which did not map to a known gene, or to genes represented on only one platform (Barnes et al., 2005). In this article, we only consider the genes in the common gene list in POG analysis.
The significant (P-value < 0.001) genes were sorted by their fold changes, and top 1500 genes from each platform were selected. In Figure 6, two POG curves were plotted: one was directly derived from the 1500 FC-sorted significant genes; the other was generated from these 1500 genes after they were sorted by adj-FC. L was set as 1, 2, 3, ... , 1500. Figure 6 indicates that adj-FC is more consistent than simple fold change to rank significant genes between two platforms, especially for the top 500 genes. When the genes after 1500 ranking were considered, these two POG curves are similar. As more candidate genes were selected, more noise and errors are also introduced because more genes with small FC were included, which affected BRs and finally influence adj-FCs.
|
3.4 Robustness and Application
When the log2 unnormalized data from AFX and ILM platforms were analyzed, their BR-vs-FC figures were very similar to Figure 5. We also identified 12 and 2 genes with poison effect respectively from AFX platform and ILM platform. It should be noted that the different data processing techniques, such as PLIER, RMA, GCRMA, yield different numbers of genes showing significant deviations in expression values between samples A and B but normalization methods are not the primary causes for the deviation from the linear assumption (Shippy et al., 2006). Our results indicated that BRs were consistent between normalized data and unnormalized data especially for those genes with large FCs. Furthermore, the number of dilution series can be one of the important factors influencing robustness of BR. We suggest more dilution series and symmetrical dilution design in the microarray experiment, which can improve the analysis of proportion model.
The adj-FC from proportion model predicted the real level of gene expression outside the dynamic range of the current array technology, and also enhanced the reproducibility measure of differentially expressed genes (DEGs) among different platforms, especially for the top and highly significant DEGs that are of biologists interests. Furthermore, the BR-versus-FC figure can be a calibration metric to test whether the quantified signal intensities are linearly related to the gene expressions.
| 4 CONCLUSION AND FUTURE WORK |
|---|
|
|
|---|
Our reanalysis of the dataset of Barnes et al. (2005) indicated that there are global biases comparing with the linear assumption to some extent. BR-vs-FC figure can clearly illustrate the distribution of BR and FC, which may deviate the perfect curve of linear assumption. Genes with poison effect, those large absolute values of FC with BR close to 1 or –1, also can be clearly shown in BR-vs-FC figure. Moreover, their adj-FCs imply some inherited feature of the genes, such as multiple initiation sites of transcription. The comparison of different POGs between adj-FC and simple FC indicates that adj-FC outperformed simple FC to show that AFX and ILM platforms are consistent, via 0.001 P-value threshold.
Figure 5 showed that BRs distributed randomly without any pattern for those genes with small FC, although their P-values were lower than 0.001. The variance of BR or AP can be considered as one of the important factors, when we develop a specific method to estimate P-value for the proportion model in future. Moreover, BR from the regularized beta function is an approximate metric, and may be not enough for the future work.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank Michael Barnes and Paul Pavlidis for sharing the dilution design microarray datasets. We also would like to thank Michael Baron from Department of Mathematical Sciences in University of Texas at Dallas for critical reading of the manuscript.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Trey Ideker
Received on September 25, 2006; revised on January 2, 2007; accepted on January 8, 2007
| REFERENCES |
|---|
|
|
|---|
Bammler T, et al. Standardizing global gene expression analysis between laboratories and across platforms. Nat. Methods, ( (2005) ) 2, : 351–356.[CrossRef][ISI][Medline].
Barnes M, et al. Experimental comparison and cross-validation of the Affymetrix and Illumina gene expression analysis platforms. Nucleic Acids Res, ( (2005) ) 33, : 5914–5923.
Bengtsson H, et al. Calibration and assessment of channel-specific biases in microarray data with extended dynamical range. BMC Bioinformatics, ( (2004) ) 5, : 177.[CrossRef][Medline].
Dodd LE, et al. Correcting log ratios for signal saturation in cDNA microarrays. Bioinformatics, ( (2004) ) 20, : 2685–2693.
Draghici S, et al. Reliability and reproducibility issues in DNA microarray measurements. Trends Genet, ( (2006) ) 22, : 101–109.[CrossRef][ISI][Medline].
Dudley AM, et al. Measuring absolute expression with microarray with a calibrated reference sample and an extended signal intensity range. Proc. Natl Acad. Sci. USA, ( (2002) ) 99, : 7554–7559.
Hekstra D, et al. Absolute mRNA concentrations from sequence-specific calibration of oligonucleotide arrays. Nucleic Acids Res, ( (2003) ) 31, : I1962–1968.[CrossRef].
Held GA, et al. Modeling of DNA microarray data by using physical properties of hybridization. Proc. Natl Acad. Sci. USA, ( (2003) ) 100, : 7575–7580.
Irizarry RA, et al. Multiple-laboratory comparison of microarray platforms. Nat. Methods, ( (2005) ) 2, : 345–350.[CrossRef][ISI][Medline].
Kliman HJ. From trophoblast to human placenta. In: Encyclopedia of Reproduction, —Knobil E, Neill JD, eds. ( (1998) ) San Diego, CA. ACADEMIC PRESS..
Lyng H, et al. Profound influence of microarray scanner characteristics on gene expression ratios: analysis and procedure for correction. BMC Genomics, ( (2004) ) 5, : 10.[CrossRef][Medline].
MAQC Consortium. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat. Biotechnol, ( (2006) ) 1151–1161..
Marshall E. Getting the noise out of gene arrays. Science, ( (2004) ) 306, : 630–631.
Miller RM, et al. Dysregulation of gene expression in the 1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine-lesioned mouse substantia nigra. J. Neurosci, ( (2004) ) 24, : 7445–7454.
Naef F, et al. A study of accuracy and precision in oligonucleotide arrays: extracting more signal at large concentrations. Proc. Natl Acad. Sci. USA, ( (2003) ) 100, : 4748–4753.
Ng E, et al. mRNA of placental origin is readily detectable in maternal plasma. Proc. Natl Acad. Sci. USA, ( (2003) ) 100, : 4748–4753.
R (2006) R Development Core Team. R: A language and environment for statistical computing Vienna, Austria, URL http://www.R-project.org. In: R Foundation for Statistical Computing, . ISBN 3-900051-900007-900050..
Ramdas L, et al. Sources of nonlinearity in cDNA microarray expression measurements. Genome Biol, ( (2001) ) 2, . RESEARCH0047..
Rocke DM, Durbin B. A model for measurement error for gene expression arrays. J. Comput. Biol, ( (2001) ) 8, : 557–569.[CrossRef][ISI][Medline].
Selvanayagam C, et al. Multiple origins of transcription for the human placental lactogen genes. J. Biol. Chem, ( (1984) ) 259, : 14642–14646.
Shi L, et al. Microarray scanner calibration curves: characteristics and implications. BMC Bioinformatics, ( (2004) ) 6, : S11..
Shi L, et al. Cross-platform comparability of microarray technology: intra-platform consistency and appropriate data analysis procedures are essential. BMC Bioinformatics, ( (2005) ) 6, (Suppl. 2): S12..
Shippy R, et al. Performance evaluation of commercial short-oligonucleotide microarrays and the impact of noise in making cross-platform correlations. BMC Bioinformatics, ( (2004) ) 5, : 61.[CrossRef][Medline].
Shippy R, et al. Using RNA sample titrations to assess microarray platform performance and normalization techniques. Nat. Biotechnol, ( (2006) ) 1123–1131..
Thompson KL, et al. Use of a mixed tissue RNA design for performance assessments on multiple microarray formats. Nucleic Acids Res, ( (2005) ) 33, : e187.
Yuen T, et al. Accuracy and calibration of commercial oligonucleotide and custom cDNA microarrays. Nucleic Acids Res, ( (2002) ) 30, : e48.
Zar JH. Biostatistical Analysis, ( (1999) ) 4th edn. NJ, USA: Prentice Hall..
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






