Skip Navigation


Bioinformatics Advance Access originally published online on August 12, 2004
Bioinformatics 2005 21(1):80-89; doi:10.1093/bioinformatics/bth472
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/1/80    most recent
bth472v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (16)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Reverter, A.
Right arrow Articles by Dalrymple, B. P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Reverter, A.
Right arrow Articles by Dalrymple, B. P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Bioinformatics vol. 21 issue 1 © Oxford University Press 2005; all rights reserved.

A rapid method for computationally inferring transcriptome coverage and microarray sensitivity

A. Reverter *, S. M. McWilliam , W. Barris and B. P. Dalrymple

Bioinformatics Group, CSIRO Livestock Industries, Queensland Bioscience Precinct 306 Carmody Road, St Lucia, QLD 4067, Australia

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 INFERRING TRANSCRIPTOME...
 3 INFERRING SENSITIVITY OF...
 4 DISCUSSION
 REFERENCES
 

Motivation: There are many different gene expression technologies, including cDNA and oligo-based microarrays, SAGE and MPSS. For each organism of interest, coverage of the transcriptome and the genome will be different. We address the question of what level of coverage is required to exploit the sensitivity of the different technologies, and what is the sensitivity of the different approaches in the experimental study.

Results: We estimate the transcriptome coverage by randomly sampling transcripts from a pre-defined tag-to-gene mapping function. For a given microarray experiment, we locate the thresholds in intensities that define the distribution of transcript abundance. These values are compared against the distribution obtained by applying the same thresholds to the intensities from differentially expressed genes. The ratio of these two distributions meets at the equilibrium defining sensitivity. We conclude that a collection of ~340 000 sequences is adequate for microarrays, but not large enough for maximum utilization of tag-based technologies. In the absence of large-scale sequencing, the majority of the tags detected by the latter approaches will remain unidentified until the genome sequence is available.

Contact: Tony.Reverter-Gomez{at}csiro.au


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 INFERRING TRANSCRIPTOME...
 3 INFERRING SENSITIVITY OF...
 4 DISCUSSION
 REFERENCES
 
Sequencing technology allows researchers to acquire expressed sequence tags (ESTs) to characterize genes expressed in the organisms they study. However, to repeatedly sequence ESTs derived from the same gene is vastly inefficient. One goal of an EST sequencing project is to obtain as many distinct sequences as possible, with minimal redundancy. Knowing how many genes are represented in a library (the coverage of the transcriptome) before having sequenced them all can help to decide when to normalize a library, and can be used to compute the proportion of expected diversity that has been sampled, the library coverage (Hraber, 2001).

In many cases, EST libraries are constructed and sequenced to generate probes for use in microarray experiments. Owing to the low sensitivity of microarrays, the harder the transcripts from a particular gene are to isolate, the less likely it is that differential expression of this gene will be observed in an EST-based microarray experiment. As the size of public databases of ESTs continues to increase, individual groups will rely on sequences and clones from other groups for the construction of microarrays. How large does a collection of ESTs for a species have to be before a substantial proportion of the genes expressed above the sensitivity level of microarrays are likely to be represented in the collection?

Approaches such as massively parallel signature sequencing [MPSS; see Brenner et al. (2000) for a description of the technique] and SAGE (Velculescu et al., 1995) do not rely on the EST sequences for the detection of expression, but such a collection and/or a genome sequence (or a collection or genome from a closely related organism) is required for identification of the transcripts and hence genes from which the tag sequences were derived.

A reliable estimate of the number of genes represented in a transcriptome depends on the frequency distribution of transcripts. Although the true distribution is unknown, the results from expression assay studies (Lockhart and Winzeler, 2000) indicate that its general form is likely to vary across cell types. However, it has become apparent that the distribution associated with the relative frequencies of transcripts in a collection of cells is heavily skewed to the right (i.e. there are few frequent, and many rare classes). Skeweness of this type is a characteristic of gene expression data (Velculescu et al., 1997, Kuznetsov, 2001, Kuznetsov et al., 2002, Morris et al., 2003, Ueda et al., 2004). The generalized discrete Pareto (GDP) model proposed by Kuznetsov et al., (2002) and given by:


where f(m) is the probability that a randomly chosen tag (representing a gene) occurs m times in the library, was found to fit all large-scale gene expression datasets in yeast, mouse and human. However, the parameters of the GDP model (the normalizing factor z, the skewness k and the deviation b) showed a dependence on sample size, eukaryote species and method used to generate the library.

Jongeneel et al., (2003) recently reported empirical distribution of transcript abundance in human cell lines with MPSS. Interestingly, this function collapses into the same curve found to fit the experimental noise of microarray experiments (Tu et al., 2002) and given by:


where x is the base-10 logarithm of tag abundance in transcripts per million (tpm).

Table 1 illustrates this phenomenon with the distribution of tag abundance and microarray noise. This coincidence provides further evidence of the existence of a universal distribution associated with gene expression Ueda et al. (2004) and is used in this study to propose a simple method for computationally inferring amount of transcriptome coverage and microarray sensitivity. Our approach to compute transcriptome coverage, uses stochastic simulation to sample transcripts from a pre-defined tag-to-gene mapping function.


View this table:
[in this window]
[in a new window]
 
Table 1 Distribution of tag abundance from MPSS and microarray noise

 
Sensitivity is defined as an index of the performance of a diagnostic test, and calculated from the conditional probability of having a positive test result given having the disease (Everitt, 2002). However, discussions on sensitivity in microarray experiments are often confounded as this feature can be defined in more than one way (Kane et al., 2000, Lemon et al., 2003, Zien et al., 2003, Pepe et al., 2003). We define sensitivity, not straight from statistical confidence (e.g. the ability to detect a pre-specified fold-change), but rather by the minimum detectable concentration (Brown et al., 1996, O'Malley and Deely, 2003). We then ascertain when the probability of erroneously detecting a differential gene expression (Type I error) equals that of not detecting a genuine differential gene expression (Type II error). The appeal of this method is its efficiency, as it does not require the development of an expensive time-consuming validation trial based on spike-in PCR probes.

We have used the transcriptome of Bos taurus, an example of a species for which a genome sequence will soon be available. The comparison of the results of the simulation agrees well with coverage of the B.taurus predicted from comparisons with the human transcriptome.


    2 INFERRING TRANSCRIPTOME COVERAGE
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 INFERRING TRANSCRIPTOME...
 3 INFERRING SENSITIVITY OF...
 4 DISCUSSION
 REFERENCES
 
We propose to compute transcriptome coverage by organizing the total number of genes (a pre-defined parameter) in a given genome into the assumed distribution of size abundance. We generate as many transcripts as classes of genes (classes according to abundance and/or amount of expression). An EST library of desired length is simulated, by sampling without replacements from the available transcripts, and the total number of genes that are being captured is then recorded.

We urge the reader to recall Table 1 where the similarity between the distribution of tag abundance from MPSS and that of microarray noise is shown. The former originates from Jongeneel et al. (2003) plus two MPSS datasets supplied to us by Lynx Therapeutics, Inc. (http://www.lynxgen.com/) each containing 25 503 unique tags. Microarray noise was characterized and found to collapse onto Equation (1) by Tu et al. (2002) after designing sets of replicate experiments.

In order to establish the number of genes and transcripts in each threshold, the following four-step initialization routine is required prior to the computation of the amount of transcriptome coverage:

Step 1. Declare N g , the total number of genes in the genome (for instance, N g = 30 000).
Step 2. Define the nine thresholds of transcript abundance (a i ) as 1, 5, 10, 50, 100, 500, 1000, 5000 and 10 000 (i.e. first column of Table 1).
Step 3. Compute g i , the number of genes in each threshold from:


where x i = log(a i ) and f(x i ) is given in Equation (1) and re-defined here as the proportion of transcripts with concentration ≥x i tpm.
Step 4. Compute N t , the total number of transcripts, from: .

From the above routine, two issues are worth noting: first, the parameter g i indicates the number of genes with an abundance ≥a i but <a i+1. Second, the resulting total number of transcripts (N t) depends on the pre-specified total number of genes in the genome (N g) as well as on the assumed distribution of genes [f(x i)]. As such, N t can be greater than a million. This has particular relevance as transcript concentration is usually measured in tpm. Setting N g = 30 000 and the assumed distribution f(x i ) in (1), produces N t = 1 395 316 and g i = 13 121, 5843, 7503, 1449, 1500, 251, 245, 39 and 49 for i equals 1 to 9, respectively.

Next, we declare variable clone, an array of length N t with four arguments to indicate:

  1. the category or threshold (from 1 to 9) of abundance where it derives;
  2. the gene (from 1 to N g) of origin;
  3. a found dummy flag to indicate whether or not it is contained in the EST library under study; and
  4. a inform dummy flag to indicate whether or not it is informative (i.e. it contains the genuine tag for the transcript and thus allows for the proper tag-to-transcript, and hence gene mapping, if the transcript has been annotated).

Because ESTs frequently do not include the 3' end of the transcript, not all sequences are informative for the MPSS or SAGE tag for a particular gene. In our study, we explored four levels of tag ‘informativeness’: 10, 20, 40 and 100%. A symbolic representation (FORTRAN-like pseudocode) of the repeated execution (loop) of statements that was used to initialize variable clone follows:

geneid = 0    !Counter for genes

index = 0    !Counter for transcripts

DO i = 1, 9    !Loop for thresholds

DO j = 1, g i     !Loop for genes

geneid = geneid + 1

DO k = 1, a i     !Loop for abundance

index = index + 1

clone(index)% category = i

clone(index)% gene = geneid

ENDDDO

ENDDO

ENDDO

In the above algorithm, variables geneid and index are temporary counters that go from 1 to N g, and from 1 to N t, respectively. Variable clone maps each transcript with its corresponding gene of origin. The final step involves sampling clones without replacement as many times as the size of the EST library (N EST) under scrutiny. The assumption is made that N EST < N t. However, this assumption can be relaxed by either sampling with replacement or by increasing the proportion of non-informative transcripts.

As each clone is sampled, a uniform random number from the [0,1] interval is drawn and compared against the set probability of the EST being informative. The scrutiny of sampled clones allows the direct computation of the transcriptome coverage in the given library. Table 2 presents the predicted number of genes to be represented, or with informative tags, in EST collections of five different sizes from 62 500 to 500 000. These figures are compared against those obtained from the coverage of similar sized EST collections derived from the cattle transcriptome. The figures for the cattle transcriptome were determined using the following procedure: the set of ~325 000 cattle EST and mRNA sequences available on GenBank in July 2003 were downloaded. Sequences were screened for low-quality sequences and putative chimeras (Hawken et al., 2004). These sequences were removed from the collection and the first 62 500, 125 000 and 250 000 sequences, taken in sequential order by GenBank g i number, were clustered using stackPACK (Miller et al., 1999). The set of consensus sequences and singletons were then BLASTed against a set of 21 786 virtual mRNAs constructed from the exons of human Ensembl genes downloaded on January 20, 2004 using EnsMart. The exons for all coding transcripts derived from the Ensembl genes were sorted and concatenated to produce a single record for each gene containing all exons present in the coding transcripts derived from that gene. The BLAST searches were undertaken with the default parameters and a cut off of 1 x 10–10. A modified reciprocal best BLAST hit procedure was developed to allow for the presence of multiple bovine consensus sequences potentially derived from the same gene. For a reciprocal best hit, the top hit to a human virtual mRNA for each cattle sequence must also be the highest scoring human hit to the cattle sequence. The number of human virtual mRNAs with a reciprocal cattle best hit were then counted to identify the proportion of the probable cattle transcriptome covered by the collection of ESTs and mRNAs. The coverage lying between the number of hits counted and the number of hits scaled to a nominal genome size of 30 000 protein coding genes for humans.


View this table:
[in this window]
[in a new window]
 
Table 2 Number of genes predicted to be represented, or with informative tags, in EST collections of different sizes compared to the coverage of similar sized EST collections derived from the cattle and sheep transcriptomes

 
Interestingly, the transcriptome coverage of the bovine EST and mRNA sequences deposited in GenBank is close to predicted size based on the empirical distribution of transcripts from MPSS experiments. This indicates that the collective approaches of the different groups, using different samples and different cloning methodologies, appear to have produced the equivalent of a random sampling of the complete transcriptome of the cow. The results of the analysis of the complete cattle EST and mRNA sequence set are available on http://www.livestockgenomics.csiro.au

Figure 1 shows the estimated proportion of transcriptome coverage by tag concentration in tpm for EST libraries of four sizes and four levels of informativeness. Even at the low 10% of ESTs containing an informative tag, a randomly generated EST library of size 62 500 is estimated to contain all of the tags with a concentration ≥1000 tpm. However, according to Equation (1), such tags are expected to represent <1.2% of the whole transcriptome. On the other extreme, 90% of tags with a concentration ≥5 tpm (covering 56% of the transcriptome) are expected to be included in a fully informative EST library of size 500 000. This is clearly an unrealistically high level of informativeness for a multi-source collection of ESTs, thus very large collections of ESTs and/or very targeted libraries (3'-untranslated region directed) are required to be able to identify the gene of origin of all SAGE or MPSS tags. Indeed, even for humans, for which a very large EST collection has been built, the complete genome sequence still enabled a significant increase in the assignment of MPSS tags to be made (Jongeneel et al., 2003). The NCBI SAGEmap (Lash et al., 2000) site predicts tags based on the clustering of ESTs and mRNAs in their Unigene database. For the B.taurus Unigene build 52, ~214 000 of the ~324 000 available sequences were included to generate 15 785 Unigene clusters. From this SAGEmap predicted ~52 000 unique tags in ~53 500 tag-Unigene cluster pairs for the reliable set. However, this gives a ratio of 3.4 cluster-tag pairs per cluster, much greater than the ideal figure of 1. Taking a smaller cut at a reliability score ≥2 000 000, gave a set of 7697 cluster-tag pairs and 6394 clusters. A total of 66 641 sequences had contributed information to this set, giving a rate of informativeness for the complete set of B.taurus EST and mRNA sequences of ~20%.



View larger version (45K):
[in this window]
[in a new window]
 
Fig. 1 Estimated proportion of transcriptome coverage by tag concentration in tpm for four sizes of EST library (62 500, 125 000, 375 000 and 500 000) and four levels of informativeness (100, 40, 20 and 10%).

 
Two large- and one small-scale SAGE experiments using B.taurus transcripts have been described (Meissner et al., 2003, Neill and Ridpath, 2003, Berthier et al., 2003), the tag dataset is available for the first of these experiments, and two additional sets of data have been deposited in the NCBI Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo/). The authors of these publications have tried to identify the genes from which these tags are derived with varying degrees of reported success from 400 of 5347 different tags (Neill and Ridpath, 2003) through ~2000 of each of two sets of 6634 and 10 476 different tags (Meissner et al., 2003) to 1306 of 2281 different tags (Berthier et al., 2003). However, the proportion of these assignments that are likely to be correct is not addressed. The SAGE tag datasets from Meissner et al. (2003), and GSM3036 and GSM3037 from the GEO were analysed in more detail using the latest available B.taurus Unigene and SAGEmap data. Our reduced set of 7697 highly reliable tags was used to assign likely identity to the sets of tags. Figure 2 presents the proportion of tags, annotated tags and reliably annotated tags by tag concentration for these four bovine SAGE libraries. Not unexpectedly, the curve for the proportion of total tags was very similar across the four datasets, and genes expressed at a high level were much more likely to be identifiable on the basis of their SAGE tag. For the {gamma}{delta} T-cells (Meissner et al., 2003), the transcript expression level with a ≥50% probability that the gene of origin could be reliably identified was ~1000 tpm with a total of 7–14% of the tags sequenced identifiable. For the other two datasets, the level was ~100 tpm and a total of 14.5–18% of the tags sequenced identifiable. These large differences presumably reflect the experimental methodology and the nature of the genes expressed in the different samples.



View larger version (42K):
[in this window]
[in a new window]
 
Fig. 2 Proportion of tags (PT), annotated tags (AT) and reliably annotated tags (RAT) by tag concentration in tpm for four bovine SAGE libraries (CD8POS, CD8NEG, GSM3036 and GSM3037) with varying number of tags and transcripts.

 

    3 INFERRING SENSITIVITY OF MICROARRAY EXPERIMENT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 INFERRING TRANSCRIPTOME...
 3 INFERRING SENSITIVITY OF...
 4 DISCUSSION
 REFERENCES
 
The basic premise underlying the use of microarrays is that the measured spot intensities are proportional to the abundance of the corresponding target genes in the original sample. The kinetics of hybridization is considered to be a second-order reaction. The higher the concentration of the probe, the higher the reannealing rate. Several studies have explored the relationship between signal intensity and transcript concentration (Chudin et al., 2001, Li and Wong, 2001, Dudley et al., 2002, Dror et al., 2003, Wang et al., 2003). Additionally, the correlation between intensity measurements and tag counts resulting from SAGE has been reported to be at 0.82 by Ishii et al., (2000). However, the inferential validity of these studies is based on a set of controlled spike experiments at known dilution levels. The development of controlled spike experiments, although required to fully characterize the sensitivity and power of a given experiment, is a costly process and its application limited to the particular study under consideration.

Our approach to inferring microarray sensitivity is based on locating the minimum concentration of transcript at which the probability of a gene being at this or greater concentration equals the probability of this same gene being identified as differentially expressed (DE). Hence, sensitivity is defined by the minimum detectable concentration at which the probability of erroneously detecting a differential gene expression equals that of not detecting a genuine differential gene expression. The rationale behind our approach is the concept of ‘equilibrium’ and the inspiration comes from the well-known concept of ‘price at market equilibrium’ from Economics Theory. Equilibrium price is the price that equates the quantity demanded and quantity supplied. In a market graph, the equilibrium price is found at the intersection of the demand curve and the supply curve (Nicholson, 1985). Our criterion for sensitivity finds some similarity in the use of the area under the receiver operating characteristic (ROC) curve widely used in the medical screening models (Hanley and McNeil, 1982; Heagerty et al., 2000) and more recently within the context of differential gene expression (Bickel, 2004; Li and Gui, 2004). The originality of the proposed method is in the formalization of the tag-to-gene mapping function and in taking account of the prevalence of DE genes at a particular concentration.

For a given microarray experiment, the thresholds in gene expression intensity readings that define the distribution of transcript abundance [and assumed to be given by Equation (1)] are recorded. These values are compared against the distribution obtained by applying the same thresholds to the expression intensity readings observed for those genes found to be DE. The ratio of these two distributions meets at the equilibrium defining the sensitivity of the experiment. The processing can be described by the following FORTRAN-like pseudocode:

i = 1

cat_nde(i) = nde    !For each category compute N and

cat_pde(i) = 100.0 * nde/ntot    !Proportion of DE Genes

DO i = 2, 9

j = ntot – int(ntot*cat(i)/100.00)    !Pointer location of threshold

m = 0    !Counter for DE genes found so far

DO k = 1, ntot

IF( gene(k)% deflag > 0 )THEN    !This gene is DE

m = m + 1

IF( gene(k)% intens > int(gene(j)% intens) )THEN

cat_nde(i) = nde-m + 1

cat_pde(i) = 100*(cat_nde(i)/(ntot*(cat(i)/100)))

EXIT    !Look no further

ENDIF

ENDIF

ENDDO

WRITE(10,1000)i,cat(i),100.0*cat_nde(i)/nde,cat_pde(i)

ENDDO

Figure 3 shows a schematic representation of the inferential validity of the proposed method. There exists a distribution of all the genes (N t) in the microarray experiment under study and defined by f(x t) in (1) (note, here subscript t indicates ‘total’ genes). Similarly, there exists a distribution associated only with those genes (N d) found to be DE and defined as f(x d) (note, subscript d indicates ‘differentially expressed’ genes) and which may or may not be equal to f(x t). With equality, the ratio of both distributions is given by a flat line through the N d/N t y-coordinate. At the upper bound, this line collapses to either one or zero depending on whether or not the gene (or genes) with maximum intensity (and thus concentration) was (or were) identified as DE. The point (concentration) at which the ratio of both distributions intercepts with f(x t) defines the sensitivity of the microarray experiment. The probability of any given gene being at this or higher concentration, equals the probability of this same gene being DE. Thus, at this concentration, we can make an error either way (Type I or Type II) with identical probability. This equality defines the sensitivity of the experiment from the minimum concentration of transcript that can be ‘reliably’ measured. Below this concentration, Type I Error is less than Type II Error. A relatively high confidence is achieved for the few genes that are identified as being DE at concentrations below the sensitivity of the overall experiment. Equivalently, above this concentration (where the vast majority of DE element are likely to be identified), Type I Error surpasses Type II Error and the power of the test is elevated at the expense of a relatively high proportion of false positives.



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 3 Schematic representation of the inferential validity of the proposed method for computationally inferring microarray sensitivity.

 
For illustration and validation purposes, we applied this approach to the MPSS data supplied to us by Lynx Therapeutics, Inc., as well as to the gene expression intensity data from four microarray experiments (unpublished data) from our laboratory dealing with genetic improvement of livestock. Figure 4 presents the design configuration for these four experiments. Also, the publicly available microarray gene expression intensity data from Callow et al. (2000) with 16 arrays and 6384 mouse genes, and that of Lin et al. (2002) with 2 arrays and 27 007 human genes were subjected to our sensitivity algorithm.



View larger version (37K):
[in this window]
[in a new window]
 
Fig. 4 Design configuration for the four microarray experiments performed in our laboratory and for which sensitivity will be assessed. The direction of the arrows indicate the labelling from red to green fluorescent dyes. Experiments 1, 2 and 3 explore the effect of diet quality, breed of cattle and in vitro adipogenesis, respectively. The three are performed using the same cDNA microarray platform containing 9600 elements printed in duplicate. The fourth experiment compares bacterial growth in two conditions, agar and broth, and was performed using a boutique array with 2116 spots representing 132 genes.

 
Three of the four experiments from our laboratory were performed using the same cDNA microarray platform containing 9600 elements printed in duplicate. Preliminary analyses revealed the presence of 450, 387 and 361 DE genes in experiments 1 (EXP1), 2 (EXP2) and 3 (EXP3), respectively. The fourth experiment from our laboratory deals with a ‘boutique’ gene-expression microarray study with 13 slides with 2116 spots each and representing 132 candidate genes from Mycobacterium avium ss. avium. The analysis of this experiment identified 47 genes that were DE across two growth-condition treatments.

In order to provide a detailed account of the mechanics of the proposed procedure, and for the case of EXP1, Table 3 presents the distribution of all the genes [f(x t), with N t = 7638] as well as DE genes [f(x d), with N d = 450]. Across N t, the minimum average intensity was 84.1. At the second threshold, 56.19% of N t and 83.78% of N d showed an average intensity ≥ 720.0. At the third threshold, 36.79% of N t and 62.67% of N d showed an average intensity ≥ 1376.2. And so on, until the ninth threshold, where 0.16% of N t and none of N d showed an average intensity ≥31 375.8. The last column in Table 3 presents the ratio of the two distributions [f(x d)/f(x t)]. For instance at the fifth threshold, corresponding to 100 tpm, this ratio equals 14.51% which originates from 77 DE genes (or 17.11% of N d) over 531 total genes (or 6.95% of 7638).


View this table:
[in this window]
[in a new window]
 
Table 3 Distribution of all of the genes [f(x t)] and of differentially expressed genes [f(x d)] in EXP1 from our laboratory by abundance (tpm) and intensity thresholds

 
Figure 5 provides a pictorial of the sensitivity analysis for all the experiments. The sensitivity for the MPSS test dataset was estimated at 7 tpm, in close agreement with the manufacturer's claim of 5 tpm (available at http://www.lynxgen.com). A sensitivity of 45 tpm was estimated for the B.taurus SAGE experiment of Meissner et al. (2003). The experiment using our ‘boutique’ array was estimated to yield a sensitivity of ~30 tpm. This low estimate (and thus, ‘good’ sensitivity) should be taken with caution as only 132 and 47 genes were used to evaluate f(x t) and f(x d), respectively. However, it could also be a true value of sensitivity because experiments involving ‘boutique’ arrays are usually performed in more controlled conditions. The sensitivity of the studies of Callow et al. (2000) and Lin et al. (2002) was estimated to be at 250 and 55 tpm, respectively. This discrepancy was attributed to the newer and with less number of array slides in the experiment of the latter. Finally, EXP1, EXP2 and EXP3 from our laboratory were estimated to have a sensitivity of 100, 280 and 400 tpm, each.



View larger version (35K):
[in this window]
[in a new window]
 
Fig. 5 Sensitivity, in tpm, for MPSS test data, a bovine SAGE experiment (Meissner et al., 2003) and six gene expression studies. Sensitivity is estimated at the point where the proportion of differentially expressed elements intercepts with the curve defining the distribution of transcript abundance [f(x)].

 

    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 INFERRING TRANSCRIPTOME...
 3 INFERRING SENSITIVITY OF...
 4 DISCUSSION
 REFERENCES
 
Using the method presented in this paper to estimate the sensitivity of cDNA-based microarray experiments, a wide range of values were obtained from the variety of datasets analysed. Taking three independent experiments conducted using identical slides a 4-fold difference in sensitivity between the best and worse experiment was observed. Clearly, the sensitivity of a particular experiment depends on the skill of the scientist and on the nature, quality and number of samples being used. Our analyses confirm previous experimental approaches, which demonstrated that the sensitivity of long oligo-based microarrays is similar to, if not better than, the sensitivity of cDNA microarrays Kane et al. (2000); Lee et al., 2004). A random (from different tissues, without normalization, etc.) collection of 62 500 ESTs from an organism with 30 000 unique genes is likely to be large enough such that most genes for which differential expression can be detected by microarrays will be represented. Such a collection would include sequence data from around 10 000 genes. With normalized libraries and for tissue-specific experiments, smaller collections of ESTs will be required. Affymetrix short-oligo arrays have been reported to have sensitivities in the range 3–20 tpm (Chudin et al., 2001). Thus, substantially larger collections of ESTs are required for the representation of genes able to be detected using Affymetrix type arrays.

In the absence of a complete genome, tag sequences from MPSS and SAGE experiments can only be identified bioinformatically using EST and mRNA collections, preferably after clustering. At the observed level of informativeness for SAGE tags, a collection of 62 500 ESTs is predicted to be inadequate to identify the gene of origin for tags except for very small SAGE experiments.

Given the higher sensitivity of MPSS versus microarray experiments and the lower level of informativeness of the clustered sequences for SAGE or MPSS tags, substantially large libraries are required. Indeed, despite the availability of more than 4 million human EST sequences, the genome sequence enabled an increase in 50% of MPSS tags to be assigned to genes (Jongeneel et al., 2003).

Figure 5 also highlights the importance of making the correct choice of technology platform. The choice of a gene profiling technology finds its optimum at the point were the transcriptome coverage of the available library and its informativeness intersects with the sensitivity of the platform. For example, for B.taurus, given the current size of EST collections, use of microarray leads to a large technology gap with a large number of annotated-genes with expression levels below the sensitivity threshold of the microarray. In contrast, the use of MPSS and large SAGE experiments leads to an annotation gap, with a large number of unknown tags identified as DE. This gap can be bridged by extensive cloning and sequencing using primers based on tags. Currently, Affymetrix short-oligo-based chips appear to represent the best compromise between sensitivity, annotation and cost for the transcriptome coverage of B.taurus. In contrast, for sheep which are closely related to cattle and for which there is a very small public EST collection and no genome sequence on the horizon, MPSS or SAGE with sequencing based on tag primers or EST and long oligo-based microarrays based on B.taurus appear to be the best technologies. With the advent of a complete genome sequence of the organism of interest, tag-based technologies will prevail.


    Acknowledgments
 
We would like to thank Dr Rachel Hawken for her expertise in the development of the clustering of cattle ESTs. We are also grateful to Dr Christian D. Haudenschild from Lynx Therapeutics, Inc. for providing us with the MPSS test data.

Received on May 9, 2004; revised on July 21, 2004; accepted on August 3, 2004

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 INFERRING TRANSCRIPTOME...
 3 INFERRING SENSITIVITY OF...
 4 DISCUSSION
 REFERENCES
 

    Berthier, D., Quéré, R., Thevenon, S., Belemsaga, E., Piquemal, D., Marti, J., Maillard, J.-C. (2003) Serial analysis of gene expression (SAGE) in bovine trypanotolerance: preliminary results. Genet. Sel. Evol., 35, Suppl. 1, S35.

    Bickel, D.R. (2004) Degrees of differential gene expression: detecting biologically significant expression differences and estimating their magnitudes. Bioinformatics, 20, 682–688[Abstract/Free Full Text].

    Brenner, S., Johnson, M., Bridgham, J., Golda, G., Lloyd, D.H., Johnson, D., Luo, S., McCurdy, S., Foy, M., Ewan, M., et al. (2000) Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat. Biotechnol., 18, 630–634[CrossRef][Web of Science][Medline].

    Brown, E.N., McDermott, T.J., Bloch, K.J., McCollom, A.D. (1996) Defining the smallest analyte concentration an immunoassay can measure. Clinical Chem., 42, 893–903[Abstract/Free Full Text].

    Callow, M.J., Dudoit, S., Gong, E.L., Speed, T.P., Rubin, E.M. (2000) Microarray expression profiling identifies genes with altered expression in HDL-deficient mice. Genome Res., 10, 2022–2029[Abstract/Free Full Text].

    Chudin, E., Walker, R., Kosaka, A., Wu, S.X., Rabert, D., Chang, T.K., Kreder, D.E. (2001) Assessment of the relationship between signal intensities and transcript concentration for Affymetrix GeneChip® arrays. Genome Biol., 3, research0005.1–research0005.10.

    Dror, R.O., Murnick, J.G., Rinaldi, N.J., Marinescu, V.D., Rifkin, R.M., Young, R.A. (2003) Bayesian estimation of transcript level using a general model of array measurement noise. J. Comput. Biol., 10, 433–452[CrossRef][Web of Science][Medline].

    Dudley, A.M., Aach, J., Steffen, M.A., Church, G.M. (2002) Measuring absolute expression with microarrays with a calibrated reference sample and an extended signal intensity range. Proc. Natl Acad. Sci. USA, 99, 7554–7559[Abstract/Free Full Text].

    Everitt, B.S. The Cambridge Dictionary of Statistics, (2002) 2nd edn , Cambridge, UK Cambridge University Press.

    Hanley, J.A. and McNeil, B.J. (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143, , pp. 29–36[Abstract/Free Full Text].

    Hawken, R.J., Barris, W.C., McWilliam, S., Dalrymple, B.P. (2004) An Interactive Bovine In silico SNP database (IBISS). Mamm. Genome, 15, 819–827[CrossRef][Web of Science][Medline].

    Heagerty, P.J., Lumley, T., Pepe, M. (2000) Time dependent ROC curves for censored survival data and a diagnostic marker. Biometrics, 56, 337–344[CrossRef][Web of Science][Medline].

    Hraber, P.T. (2001) Discovering molecular mechanisms of mutualism with computational approaches to endosymbiosis. , Albuquerque, NM, USA PhD Dissertation University of New Mexico.

    Ishii, M., Hashimoto, S., Tsutsumi, S., Wada, Y., Matsushima, K., Kodama, T., Aburatani, H. (2000) Direct comparison of GeneChip and SAGE on the quantitative accuracy in transcript profiling analysis. Genomics, 68, 136–143[CrossRef][Web of Science][Medline].

    Jongeneel, C.V., Iseli, C., Stevenson, B.J., Riggins, G.J., Lal, A., Mackay, A., Harris, R.A., O'Hare, M.J., Neville, A.M., Simpson, A.J., Strausberg, R.L. (2003) Comprehensive sampling of gene expression in human cell lines with massively parallel signature sequencing. Proc. Natl Acad. Sci. USA, 100, 4702–4705[Abstract/Free Full Text].

    Kane, M.D., Jatkoe, T.A., Stumpf, C.R., Lu, J., Thomas, J.D., Madore, S.J. (2000) Assessment of the sensitivity and specificity of oligonucleotide (50 mer) microarrays. Nucleic Acids Res., 28, 4552–4557[Abstract/Free Full Text].

    Kuznetsov, V.A. (2001) Distribution associated with stochastic processes of gene expression in a single eukaryotic cell EURASIP. J. Appl. Signal Proc., 4, 285–296.

    Kuznetsov, V.A., Knott, G.D., Bonner, R.F. (2002) General statistics of stochastic process of gene expression in eukaryotic cells. Genetics, 161, 1321–1332[Abstract/Free Full Text].

    Lash, A.E., Tolstoshev, C.M., Wagner, L., Schuler, G.D., Strausberg, R.L., Riggins, G.J., Altschul, F. (2000) SAGEmap: a public gene expression resource. Genome Res., 10, 1051–1060[Abstract/Free Full Text].

    Lee, H.S., Wang, J., Tian, L., Jiang, H., Black, M.A., Madlung, A., Watson, B., Lukens, L., Pires, J.C., Wang, J.J., et al. (2004) Sensitivity of 70-mer oligonucleotides and cDNAs for microarray analysis of gene expression in Arabidopsis and its related species. Plant Biotechnol. J., 2, 45–52[CrossRef][Medline].

    Lemon, W.J., Liyanarachchi, S., You, M. (2003) A high performance test of differential gene expression for oligonucleotide arrays. Genome Biol., 4, R67[CrossRef][Medline].

    Li, C. and Wong, W.H. (2001) Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc. Natl Acad. Sci. USA, 98, 31–36[Abstract/Free Full Text].

    Li, H. and Gui, J. (2004) Partial Cox regression analysis for high-dimensional microarray gene expression data. Bioinformatics, 20, Suppl. 1, i208–i215[Abstract].

    Lin, J.Y., Pollack, J.R., Chou, F.L., Rees, C.A., Christian, A.T., Bedford, J.S., Brown, P.O., Ginsberg, M.H. (2002) Physical mapping of genes in somatic cell radiation hybrids by comparative genomic hybridization to cDNA microarrays. Genome Biol., 3, research0026.1–research0026.7.

    Lockhart, D.J. and Winzeler, E.A. (2000) Genomics, gene expression, and DNA arrays. Nature, 405, 827–836[CrossRef][Medline].

    Meissner, N., Radke, J., Hedges, J.F., White, M., Behnke, M., Bertolino, S., Mitchell, A., Jutila, M.A. (2003) Serial analysis of gene expression in circulating {gamma}{delta} T cell subsets defines distinct immunoregulatory phenotypes and unexpected gene expression profiles. J. Immunol., 170, 356–364[Abstract/Free Full Text].

    Miller, R.T., Christoffels, A.G., Gopalakrishnan, C., Burke, J., Ptitsyn, A.A., Broveak, T.R., Hide, W.A. (1999) A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base. Genome Res., 11, 1143–55.

    Morris, J.S., Baggerly, K.A., Coombes, K.R. (2003) Bayesian shrinkage estimation of the relative abundance of mRNA transcripts using SAGE. Biometrics, 59, 476–486[CrossRef][Web of Science][Medline].

    Neill, J.D. and Ridpath, J.F. (2003) Gene expression changes in BVDV2-infected MDBK cells. Biologicals, 31, 97–102[CrossRef][Web of Science][Medline].

    Nicholson, W. Microeconomic Theory: Basic Principles and Extensions, (1985) 3rd edn , New York The Dryden Press.

    O'Malley, A.J. and Deely, J.J. (2003) Bayesian measures of the minimum detectable concentration of an immunoassay. Aust. N. Z. J. Stat., 45, , pp. 43–65.

    Pepe, M.S., Longton, G., Anderson, G.L., Schummer, M. (2003) Selecting differentially expressed genes from microarray experiments. Biometrics, 59, 133–142[CrossRef][Web of Science][Medline].

    Tu, Y., Stolovitzky, G., Klein, U. (2002) Quantitative noise analysis for gene expression microarray experiments. Proc. Natl Acad. Sci., USA, 99, 14031–14036[Abstract/Free Full Text].

    Ueda, H.R., Hayashi, S., Matsuyama, S., Yomo, T., Hashimoto, S., Kay, S.A., Hogenesch, J.B., Lino, M. (2004) Universality and flexibility in gene expression from bacteria to human. Proc. Natl Acad. Sci., USA, 101, 3765–3769[Abstract/Free Full Text].

    Velculescu, V.E., Zhang, L., Vogelstein, B., Kinzler, K.W. (1995) Serial analysis of gene expression. Science, 270, 484–487[Abstract/Free Full Text].

    Velculescu, V.E., Zhang, L., Zhou, W., Vogelstein, J., Basrai, M.A., Bassett, E.E., Hieter, P., Vogelstein, B., Kinzler, K.W. (1997) Characterization of the yeast transcriptome. Cell, 88, 243–251[CrossRef][Web of Science][Medline].

    Wang, H., Hubbell, E., Hu, J., Mei, G., Cline, M., Lu, G., Clark, T., Siani-Rose, M.A., Ares, M., Kulp, D.C., Haussler, D. (2003) Gene structure-based splice variant deconvolution using a microarray platform. Bioinformatics, 19, Supp1. 1, i315–i322[Abstract].

    Zien, A., Fluck, J., Zimmer, R., Lengauer, T. (2003) Microarrays: how many do you need?. J. Comput. Biol., 10, 653–667[CrossRef][Web of Science][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
A. Reverter and E. K. F. Chan
Combining partial correlation and an information theory approach to the reversed engineering of gene co-expression networks
Bioinformatics, November 1, 2008; 24(21): 2491 - 2497.
[Abstract] [Full Text] [PDF]


Home page
Physiol. GenomicsHome page
E. de la Vega, M. R. Hall, K. J. Wilson, A. Reverter, R. G. Woods, and B. M. Degnan
Stress-induced gene expression profiling in the black tiger shrimp Penaeus monodon
Physiol Genomics, September 11, 2007; 31(1): 126 - 138.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. Reverter, A. Ingham, S. A. Lehnert, S.-H. Tan, Y. Wang, A. Ratnakumar, and B. P. Dalrymple
Simultaneous identification of differential gene expression and connectivity in inflammation, adipogenesis and cancer
Bioinformatics, October 1, 2006; 22(19): 2396 - 2404.
[Abstract] [Full Text] [PDF]


Home page
RNAHome page
S. LEE, J. BAO, G. ZHOU, J. SHAPIRO, J. XU, R. Z. SHI, X. LU, T. CLARK, D. JOHNSON, Y. C. KIM, et al.
Detecting novel low-abundant transcripts in Drosophila
RNA, June 1, 2005; 11(6): 939 - 946.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/1/80    most recent
bth472v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (16)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Reverter, A.
Right arrow Articles by Dalrymple, B. P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Reverter, A.
Right arrow Articles by Dalrymple, B. P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?