Bioinformatics Advance Access originally published online on August 18, 2005
Bioinformatics 2005 21(20):3880-3886; doi:10.1093/bioinformatics/bti636
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A statistical model providing comprehensive predictions for the mRNA differential display
1Institut EIT 1, Universitaet der Bundeswehr Muenchen 85577 Neubiberg, Germany
2Department of Physiology II, Universitaet Freiburg 79104 Freiburg, Germany
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Differential display (DD) or arbitrarily primed fingerprinting serves to identify differentially expressed genes, but these techniques cannot determine how many of the theoretically available genes have been uncovered. Previous mathematical models are unsatisfying as they are not suitable to analyze experimental data.
Results: In the present study, we provide a statistical model based on the redundancy of cDNA fragments amplified during DD experiments. This model is applicable to any DD and predicts (1) the total number of genes expressed in a sample cell type or tissue, (2) the number of differentially expressed genes, (3) the coverage obtained with any given number of primer combinations. In a DD experiment comparing two developmental stages of the post natal rat inner ear, we estimated the total number of differentially expressed genes accessible by DD to be 445, and the number of primer combinations required to uncover 90% of these to be 127.
Availability: The algorithms were implemented in MatlabTM (The Mathworks, Inc., Natick, MA) environment and are available at www.physiologie.uni-freiburg.de/download.html
Contact: ellen.reisinger{at}physiologie.uni-freiburg.de
| 1 INTRODUCTION |
|---|
|
|
|---|
The mRNA differential display (DD) is a reliable and highly sensitive technique to compare gene expression in different cell types or between distinct developmental stages of a given cell type (Liang and Pardee, 1992; Liang et al., 1993; McClelland et al., 1995; Wan et al., 1996). cDNA fragments are amplified in a PCR using arbitrary primers combined with one- or two-base anchored oligo(dT)-primers each. These primers bind multiply in the transcriptome in a partly degenerated way (Liang and Pardee, 1992; Bauer et al., 1993). Mixtures of DNA fragments resulting from DD-PCRs are displayed after electrophoresis and autoradiography. Selected cDNA fragments are recovered from the gel and identified by cycle sequencing. As the intensity of a band in the display reflects the mRNA amount of the respective gene, the DD allows selective identification of genes that differ in their expression level.
To date, proven arbitrary primers as well as reliable protocols are available (see Systems and Methods section), and since genomic sequences of common laboratory animals are on hand, most sequences of the identified bands match with public database entries, making this technique easier to handle than some years ago. However, at present little is known about the coverage, i.e. the portion of differentially expressed genes detected by the DD.
Several efforts were undertaken to predict the number of arbitrary primers required to cover most expressed genes in a cell. First estimations by Liang and Pardee (1992) predicted 240 primer combinations required to statistically cover all mRNA sequences expressed in one type of cell. These predictions, however, suffered from inaccurate estimations on the total number of genes in higher organisms, and did not account for redundancy, i.e. the possibility that the same cDNA might be hit by several primer combinations. Recently, a highly elaborated mathematical model (Yang and Liang, 2004) predicted the probability for a single cDNA to be amplified at least once after a certain number of DD-PCRs, therefore being independent of redundancy or the total number of genes. This model is based on predictions on the hybridization of primers and cDNA under certain experimental conditions and considers the degeneracy of primer binding in detail. However, it exhibits major disadvantages: first, no experimental data on the degenerate binding of the 5' end of arbitrary primers other than 13mers are provided, making the model barely applicable to a DD using longer or shorter arbitrary primers. Second, changes in experimental conditions will alter the binding properties of the primers and affect the basic assumptions of this model. Third, at least some of the parameters required for predictions in this model may not be verified experimentally. Fourth, as small changes in the parameters result in major deviations in the predicted coverage, this model gives inaccurate predictions. Moreover, none of the models proposed so far is applicable to so-called RAP-PCR (RNA fingerprinting by arbitrarily primed PCR), where two arbitrary primers are combined with each other (Welsh et al., 1992).
In the present study, we provide a statistical model for mRNA DDs as well as RAP-PCRs that was developed on experimental data of a combined DD/RAP experiment consisting of 28 arbitrarily primed PCRs. Different from previous models, the model presented here is based on experimental observations of redundantly amplified genes, provided that a finite number of expressed genes in any examined tissue will result in repeated amplification of the same gene after a sufficient number of DD cycles. It is shown that this approach provides valid predictions for mRNA DD or RAP experiments, e.g. the total number of expressed genes in the examined tissue, the number of differentially expressed genes, and the course of coverage and redundancy.
| 2 SYSTEMS AND METHODS |
|---|
|
|
|---|
2.1 cDNA preparation for DD
mRNA samples were isolated from organs of Corti of newborn (P0) and of rats on post natal day 14 (P14) using DYNABeads (Dynal, Oslo, Norway) according to the manufacturer's instructions. cDNA was synthesized by reverse transcription of
150 ng of P0 or P14 mRNA with Superscript II (Invitrogen, Karlsruhe, Germany) in supplemented buffer and 10 mM DTT, 1 mM dNTPs, 1 pmol oligo(dT)-primer and 0.3 µl RNasin (Promega, Madison, WI) for 2 h at 42°C. The integrity of the cDNA was confirmed by a PCR amplifying a fragment of GAPDH cDNA.
2.2 DD-PCR
Primer sequences were derived from Delta Differential Display Kit (BD Biosciences Clontech, Palo Alto, CA). Arbitrary P-primers consisted of a consensus sequence of 16 nt (5'-ATTAACCCTCACTAAA-3') and a 9 nt arbitrary sequence at the 3' end:
5'-TGCTGGGGA-3' (P1), 5'-TCGGTCATAG-3' (P2), 5'-TGCTGGTAG-3' (P4), 5'-GATCTGACTG-3' (P5), 5'-TGCTGGGTG-3' (P6), 5'-TGCTGTATG-3' (P7), 5'-TGTGGCAGG-3' (P9) and 5'-GCACCGTCC-3' (P10).
The consensus sequence of the T-primers consisted of 28 nt (5'-CATTATGCTGAGTGATATCTTTTTTTTT-3') followed by 5'-AG-3' (T3), 5'-CC-3' (T5) or 5'-GA-3' (T7) at the 3' end. These primers were combined for 28 DD-PCRs as indicated in Table 1. The differential display procedure was performed as recommended in the user manual (PT 1173-1) supplemented with the Delta Differential Display Kit (Clontech), that is based on a long-distance DD-PCR described in Diachenko et al. (1996). Briefly, DD-PCRs were carried out in duplicate with 5 pmol of each primer and cDNA corresponding to 2.4 ng mRNA (higher concentration) or 0.6 ng mRNA (lower concentration) in a total volume of 20 µl consisting of 1x Klen Taq PCRBuffer (Clontech), 50 µM dNTPs each, 50 nM [
-33P]dATP (2500 Ci/mmol, Amersham, Freiburg, Germany) and 0.4 µl Advantage Klen Taq Polymerase Mix (Clontech). The temperature cycle protocol comprised 5 min 94°C, 5 min 40°C, 5 min 68°C followed by two cycles with 1 min 94°C, 2 min 40°C and 5 min 68°C and 25 cycles with 45 s at 94°C, 45 s 40°C and 2 min 68°C and a final elongation step at 68°C for 7 min. The resulting DNA fragments were separated electrophoretically on a denaturing acrylamide gel and visualized through autoradiography.
|
2.3 Analysis of the DD
The number of bands in each fingerprint was counted visually at least three times using the mean value as the result. To reduce the number of false positives, only bands occurring in both duplicates were considered (Diachenko et al., 1996), and bands were regarded as differential only if both bands in the duplicate fingerprint were stronger than both bands in the other. Owing to our special interest in upregulated genes, only DNA bands with higher intensity at the later stage of development are indicated as differential; bands with higher intensity at the earlier stage are included in the non-differential bands. Differential bands were excised from the gel, reamplified and sequenced by cycle sequencing. If possible, reamplified PCR products were sequenced directly, otherwise the cDNA fragments were cloned in pBluescript SK, prior to sequencing. In the latter case, to ensure that the major cDNA of this band had been identified, the sequences of at least two individual clones needed to be identical. The sequences of the DD bands were checked for redundancy by comparing these sequences against each other and furthermore, against sequences in public databases to identify the fragments and thus uncover non-overlapping but redundant sequences.
Redundancy occurring within the same primer combination, in most cases visible as double or triple bands, was eliminated prior to statistical analysis. This accorded to 73 of the identified 296 DNA bands. The number of unidentified non-differential bands was multiplied with the ratio of 223/296, assuming the same proportion of double or triple bands within the non-differential bands.
2.4 Computer simulation
The computer simulated experiments were performed by generating pseudo-random numbers as realizations of the random variables that will be introduced in the next section. Based on the 90 combinations of T-primers (two-base anchored primers with 5'-Linker-T9 VV-3', where V is A, G or C) with 10 different P-primers and 45 combinations of P-primers with each other possible in a commercially available DD kit, we exemplarily simulated a DD experiment consisting of n=135 primer combinations.
| 3 ALGORITHM |
|---|
|
|
|---|
A DD experiment identified genes that were upregulated during the post natal development of the inner ear. Identity and significance of the differentially expressed genes for inner ear physiology will be published elsewhere (Reisinger et al., 2005), an extract of the autoradiography is available as Supplementary information at the Journal's web page. In this DD, a total of 1407 cDNA fragments was amplified, and 223 of these bands were found to be differential. Subsequent analysis by quantitative real-time PCR on independent RNA isolates confirmed that 93% of the identified genes were indeed differential, i.e. exhibiting at least 2-fold upregulation during inner ear maturation (data not shown). From these 223 differential bands 174 originated from different mRNA sequences, whereas 49 bands corresponded to redundantly amplified genes. Closer investigation of this redundancy showed no clear correlation between the frequency with which a gene was identified and the expression level quantified by real-time PCR (data not shown), as observed earlier (Wan et al., 1996). Furthermore, even though it has been described that fingerprints using the same two-base anchored oligo(dT) primer exhibit increased redundancy owing to mispriming effects (Liang et al., 1993), only two redundantly occurring genes were amplified with the same two-base anchored oligo(dT)-primer, and both did not originate from mispriming. Therefore, we consider the event of two primers matching a cDNA so that a 2501500 bp fragment is amplified as a random event with equal probability.
The number of redundant bands increased with increasing numbers of DD-PCRs performed. From this observation we hypothesized that the number of genes identified during the experiment will saturate, leading to the idea of using a detailed analysis of the redundancy to predict the total number of differentially expressed cDNAs and furthermore, the total number of cDNAs in the compared tissues. To this end, we developed a statistical model for the mRNA DD providing reliable answers of the following questions in statistical mean:
- What is the total number of differentially expressed genes?
- What is the total number of expressed genes?
- How does the redundancy, i.e. the number of repeatedly identified genes increase during the series of DD-PCRs?
- What is the development of the coverage?
- How many primer combinations are required to achieve a certain coverage?
For the following considerations, the total of fingerprints originating from the same primer combination are termed as subexperiment. To describe the statistical model for the mRNA DD, the subsequent notation is introduced:
![]() |
![]() |
i:
![]() |
![]() |
i(br, bu, bs) is the probability that the i-th subexperiment produces br redundant differential bands, bu non-redundant differential bands and bs non-differential bands. In the DD, the sample size yi is unknown, too. Therefore, we also have to model the sample size in each subexperiment. To that end let the random variable
- Yi denote the sample size of the ith subexperiment, i = 1, ..., n.
- (
i,
i): probability space with multinomial distribution,
- Y1, ..., Yn: independent and identically distributed.
- Y1, ..., Yn: independent and identically distributed.
With this model in hand we can derive other random variables of interest. We denote by
- Xi the number of differential bands in the ith subexperiment,
- Di the total number of differential bands after i subexperiments,
- Zi the total number of non-redundant differential bands after i subexperiments,
- Ui the number of new non-redundant differential bands in the i-th subexperiment,
- Ri the redundancy (i.e. the number of redundant bands) in the i-th subexperiment,
- Ci the coverage after i subexperiments.
- Di the total number of differential bands after i subexperiments,
![]() | (1) |
the experimental sample sizes and by
the estimator, we use
![]() | (2) |
![]() | (3) |
is a sample of X1, ..., Xi, i.e. the experimentally observed number of differential bands. Similarly, we get an estimator
for the total number of cDNAs by maximizing the function
![]() | (4) |
![]() | (5) |
![]() |
denotes the number of new non-redundant bands in the i-th subexperiment. Note that the numerical maximization of F(N) will not exhibit a maximum if no redundancy in the experimental data is provided. Using the estimators
and
one can derive estimators
,
,
,
and
for the total number of differential cDNAs N1, the total number of non-redundant differential bands Zi, the number of new non-redundant differential bands Ui and the redundancy Ri in the i-th subexperiment and finally for the coverage Ci after i subexperiments (for details see the Appendix, Supplementary information).
![]() | (6) |
![]() | (7) |
![]() | (8) |
![]() | (9) |
![]() | (10) |
![]() | (11) |
| 4 IMPLEMENTATION |
|---|
|
|
|---|
The numbers of bands in total, differential and non-redundant bands in each of the 28 DD-PCRs as well as the applied primers are listed in Table 1. Note that in the first 12 subexperiments an arbitrary P-primer was combined with a T-primer (see Systems and Methods), whereas in the following subexperiments two P-primers were combined (RAP-PCRs). Using the data in Table 1, estimators for all parameters may be calculated.
First, according to Equation (2) the mean sample size is given as
![]() |
and the RAP-PCRs (1328) with
. In the following considerations, the mean value of all 28 subexperiments was used. Next, the maximum-likelihood estimator for p is calculated after any subexperiment according to Equation (3) and given in the last column of Table 1. Figure 1A illustrates the convergence of the graph for p with an increasing number of subexperiments. The most accurate estimator for p is the last one,
![]() |
increases with an increasing number of subexperiments. More precisely, the length of the confidence interval for
is proportional to
with t being the summed sample size.
|
According to Equation (4), to calculate
the function F(N) has to be maximized, which is the product of the densities fi(N)(
). Therefore, the data of Table 1 are implemented in fi(N)(k). For instance, for every
we have
![]() |
![]() |
with an increasing number of subexperiments. Hence, approximately 2800 different cDNA transcripts may be amplified under standard PCR conditions from the rat organ of Corti. Using Equation (6) leads to an estimated number of upregulated genes of
![]() |
Combining the estimators for
,
and
and Equations (7) and (8), the expected number of differential bands as well as the number of non-redundant differential bands
after i subexperiments can be computed, as illustrated in Figure 2A, and at an expanded scale (subexperiments 128) in Figure 2B, here complemented by the experimental data. To verify the predictions of the model, we performed computer simulated experiments by generating pseudo-random numbers for y and x whose distributions are given by the corresponding random variables Yi and Xi and determined by the calculated estimators
and
. The respective graphs (gray lines in Fig. 2) represent a computed DD experiment using the same estimators for
,
and
as in the experimental DD. These simulated experiments correlate with the calculated course for zi, ci and di (Fig. 2A). The graphs of the simulated experiments and of the experimental data show similar features, both supporting the suitability of the statistical model (Fig. 2B).
|
The model predictions were further evaluated by training the model with the first 1427 subexperiments and predicting the number of differential and non-redundant differential bands after 28 experiments (Fig. 3, gray dots). Although the number of training experiments is rather small, the predictions for d28 and z28 are in fair agreement with the experimentally obtained values. Another validation method that shows an even better match is obtained by performing a sort of resampling, also shown in Figure 3. For this purpose, 20 sets of subexperiments were chosen randomly without replacement and the resulting predictions for d28 and z28 were averaged. For example, the averaged predictions of 20 sets comprised 20 subexperiments each leading to a total number of expected differential genes to 226, compared with 223 experimentally obtained differential bands. Similarly, the averaged predictions for z28 using the data of 20 subexperiments resulted in 173 non-redundant differential bands to be expected after 28 subexperiments, what is in agreement with the experimentally observed 174 bands. The predictions of the resampling procedure improve with a growing number of considered subexperiments, as indicated by decreasing error intervals shown in Figure 3.
|
Implementation of the model requires redundancy to occur. Equation (10) estimates the number of redundant bands that may be expected in any subexperiment. Setting
leads to
![]() |
The statistical model allows further predictions on the coverage of DD experiments: according to Equation (11), the coverage after any number of subexperiments can be estimated. Theoretically, 28 subexperiments lead to a coverage of
![]() |
![]() | (12) |
Finally, the subexperiment where the number of redundant bands equals the non-redundant ones can be estimated by setting
, resulting in
![]() |
| 5 DISCUSSION |
|---|
|
|
|---|
Since the original publication of the mRNA DD by Liang and Pardee (1992), the method has systematically been improved (Liang et al., 1993, 1994; Mou et al., 1994) and is now available as a commercial set of primer sequences and a reliable description of the experimental procedure. Starting from such a DD experiment on the developing inner ear, we analyzed the frequency with which individual sequences were amplified within the first 28 subexperiments. The observation that redundancy increased with an increasing number of subexperiments was used for the establishment of a statistical model that allows predictions for further DD-PCRs. The resulting model enables one to estimate the total number of genes expressed in the examined tissue, the number of differentially expressed genes and the coverage in dependence of the number of applied primer combinations. The model is strongly supported by our experimental data, the validation of the predictions by the resampling procedure and the results of computer simulated experiments. These validations also show a reasonable robustness of the predictions toward small changes in the dataset.
Previous mathematical models relied on theoretical predictions of the hybridization of arbitrary P-primers and T-primers with cDNA molecules. The first estimations on how many primer combinations are required to cover the cellular mRNAs (Liang and Pardee, 1992; Bauer et al., 1993) based on the assumption that higher organisms express
100 000 genes and individual cells make use of 15% of these. This assumption might easily be corrected using the most recent approximations for the number of expressed genes from the International Human Genome Sequencing Consortium (IHGSC, 2004). However, their predictions on the DD coverage underestimate the required primer combinations as they do not account for the possibility that more than one cDNA fragment of the same gene might be amplified. A more detailed study of the statistics of primer binding of Yang and Liang (2004) predicted 160 arbitrary primers combined with 3 one-base anchored T-primers will amplify a cDNA of 600 bp length with a probability of 93%. The resulting 480 primer combinations for 93% coverage are obviously higher than the 167 primer combinations leading to 93% coverage estimated from Equation (12) in our model (using
). This difference might be explained by the fact that hybridization-based models rely on rough estimations of parameter values that are at least in part not accessible to experimental tests. Moreover, the predictions vary over a broad range if small changes in parameter values as length of the cDNAs or number of base matches between arbitrary primers and cDNA are assumed. Since all parameters in our model are calculated from experimental data, this model is supposed to give more accurate and reliable predictions than other mathematical models describing DD experiments. Furthermore, our statistical model is the first suitable model for experiments using combinations of internally binding arbitrary primers (RAP-PCRs). Another possible application are DD experiments comparing more than two samples, as these can be regarded as a set of one-by-one comparisons (e.g. bands that arise from only one sample but not from all others), and thus each set can be analyzed separately. Moreover, the model predictions are not affected by the use of total RNA instead of mRNA, since most mRNA preparations are not completely free of rRNA. Taken together, the major benefit of the mathematical model proposed here relies on the parameters that are estimated from experimental data, making this model widely applicable. When adopted to a new dataset, the reliability of the model predictions can be verified by the provided validation methods (as in Figs 1A, C and 3).
The usefullness of this model is demonstrated by implementation on a DD comparing two developmental stages of the organ of Corti, a rather small and homogeneous tissue of neuroepithelial origin. The considerable proportion of 15% of upregulated genes reflects intense morphological and physiological maturation processes that take place during inner ear development (Pujol et al., 1997). The calculated 2800 genes predict
1114% of the protein encoding genes of higher organisms to be expressed in the mammalian organ of Corti tissue (IHGSC, 2004). For two reasons, however, we expect that the total number of expressed genes in the organ of Corti is somewhat higher: first, a few cDNA fragments were amplified with only one (mostly arbitrary) primer binding in sense and antisense direction. Hence, the use of the same primer in different combinations can lead to a redundant amplification that is not truly independent, resulting in a small overestimation of redundancy. Consequently, the most accurate application for the model proposed here would be an experiment using every primer only once. Second, it should be kept in mind that not every DNA sequence is successfully amplified in a standard PCR, and an efficient amplification of sequences with high GC content requires the supplementation of DMSO or other hydrogen bond weakeners (Arezi et al., 2003). As our DD was performed under standard PCR conditions, we conclude that a series of cDNA sequences, difficult to amplify, is not accounted for the here proposed estimation on the total number of expressed genes in the inner ear and therefore, for the number of differentially expressed genes. However, according to the fact that such sequences will not be amplified with an increasing number of primer combinations, the coverage calculated from our statistical model respects only genes available under standard PCR conditions.
The statistical model proposed in this study is applicable to any DD experiment after the analysis of a sample of subexperiments as soon as substantial redundancy is observed. Hence for any examined tissue, a DD experiment will be suitable to determine the number of cDNA molecules that are amplified under standard PCR conditions and thus the number of expressed genes might be estimated. Although, to get reliable results it is necessary to avoid redundancy originating from mispriming of the one- or two-base anchored oligo(dT) primers described in Liang et al. (1993). This mispriming can be overcome if the reverse transcription reaction (RT) is performed with non-anchored oligo(dT) primers, and anchored primers are used during the PCR only (Rothschild et al., 1997; Frost and Guggenheim, 1999). The increased priming specificity might result from the higher annealing temperature for PCR than during RT.
| SUPPLEMENTARY DATA |
|---|
|
|
|---|
Autoradiography of the DD and Mathematical Appendix are available on Bioinformatics online.
| Acknowledgments |
|---|
We thank Dr Bernd Fakler for helpful the discussions and for critically reading the manuscript. This work was supported by DFG-Graduiertenkolleg 843.
Conflict of Interest: none declared.
Received on June 30, 2005; revised on August 9, 2005; accepted on August 16, 2005
| REFERENCES |
|---|
|
|
|---|
Arezi, B., et al. (2003) Amplification efficiency of thermostable DNA polymerases. Analyt. Biochem., 321, 226235.
Bauer, D., et al. (1993) Identification of differentially expressed mRNA species by an improved display technique (DDRTPCR). Nucleic Acids Res., 21, 42724280
Diachenko, L.B., et al. (1996) Combining the technique of RNA fingerprinting and differential display to obtain differentially expressed mRNA. Biochem. Biophys. Res. Commun., 219, 824828[CrossRef][ISI][Medline].
Frost, M.R. and Guggenheim, J.A. (1999) Mammalian polyadenylation sites: implications for differential display. Nucleic Acids Res., 27, 13861391
International Human Genome Sequencing Consortium (IHGSC). (2004) Finishing the euchromatic sequence of the human genome. Nature, 431, 931945[CrossRef][Medline].
Liang, P., et al. (1993) Distribution and cloning of eukaryotic mRNAs by means of differential display: refinements and optimization. Nucleic Acids Res., 21, 32963275.
Liang, P. and Pardee, A.B. (1992) Differential display of eukaryotic messenger RNA by means of the polymerase chain reaction. Science, 257, 967971
Liang, P., et al. (1994) Differential display using one-base anchored oligo-dT primers. Nucleic Acids Res., 22, 57635764
McClelland, M., et al. (1995) RNA fingerprinting and differential display using arbitrarily primed PCR. Trends Genet., 11, 242246[CrossRef][ISI][Medline].
Mou, L., et al. (1994) Improvements to the differential display method for gene analysis. Biochem. Biophys. Res. Commun., 199, 564569[CrossRef][ISI][Medline].
Reisinger, E., et al. (2005) Gene expression associated with the onset of hearing detected by mRNA differential display in mammalian organ of Corti. in press.
Rothschild, C.B., et al. (1997) DD/AP-PCR: combination of differential display and arbitrarily primed PCR of oligo(dT) cDNA. Analyt. Biochem., 245, 4854.
Pujol, R., Lavigne-Rebillard, M., Lenoir, M. In Rubel, E.W., Popper, A.W., Fay, R.R. (Eds.). Development of the Auditory System, (1997) , NY Springer, pp. 146192.
Wan, J.S., et al. (1996) Cloning differenitally expressed mRNAs. Nat. Biotechnol., 14, 16851691[CrossRef][ISI][Medline].
Welsh, J., et al. (1992) Arbitrarily primed PCR fingerprinting of RNA. Nucleic Acids Res., 20, 49654970
Yang, S. and Liang, P. (2004) Global analysis of gene expression by differential display: a mathematical model. Mol. Biotechnol., 27, 197208[CrossRef][ISI][Medline].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



























