Bioinformatics Advance Access originally published online on April 21, 2005
Bioinformatics 2005 21(14):3122-3130; doi:10.1093/bioinformatics/bti452
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Classification of oligonucleotide fingerprints: application for microbial community and gene expression analyses
1Department of Statistics, University of California Riverside, CA 92521, USA
2Department of Plant Pathology, University of California Riverside, CA 92521, USA
3Central Laboratories, Israeli Ministry of Health Yaakov Eliav 9, 94467 Jerusalem, Israel
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Oligonucleotide fingerprinting of ribosomal RNA genes (OFRG) is a procedure that sorts rRNA gene (rDNA) clones into taxonomic groups through a series of hybridization experiments. The hybridization signals are classified into three discrete values 0, 1 and N, where 0 and 1, respectively, specify negative and positive hybridization events and N designates an uncertain assignment. This study examined various approaches for classifying the values including Bayesian classification with normally distributed signal data, Bayesian classification with the exponentially distributed data, and with gamma distributed data, along with tree-based classification. All classification data were clustered using the unweighted pair group method with arithmetic mean.
Results: The performance of each classification/clustering procedure was compared with results from known reference data. Comparisons indicated that the approach using the Bayesian classification with normal densities followed by tree clustering out-performed all others. The paper includes a discussion of how this Bayesian approach may be useful for the analysis of gene expression data.
Contact: james.press{at}ucr.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
Microorganisms are integral components of ecosystems and human civilization. They play important roles in the detoxification of polluted environments, provide essential nutrients for plants and transform waste materials into useful commodities such as compost (Atlas and Bartha, 1998; Christensen, 1989; Tuomela et al., 2000). Microorganisms are used in fermentation processes, producing a great number of important food and beverage products. In the biotechnology arena, they are a vital source of useful compounds and provide the means for the production of genetically engineered products such as pharmaceuticals (Bull et al., 2000; Chapela, 1997). Despite these important discoveries, the microbial contribution to natural ecosystems and their potential for society have yet to be fully realized, as current experimental methods do not allow thorough descriptions of the microbial communities inhabiting most environments.
One of the first steps in characterizing an ecosystem is to identify the organisms inhabiting it. Traditionally, microorganisms have been classified by characterizing their morphological and physiological traits. However, such traits do not provide a meaningful framework for evolutionary classifications. Moreover, this approach will only detect a fraction of the existing microorganisms, as the majority of them do not readily grow on laboratory media (Amann et al., 1995). In the 1970s, the development of comparative ribosomal RNA (rRNA) analysis provided an evolutionary basis for prokaryotic taxonomy (Fox et al., 1977; Sogin et al., 1972; Woese et al., 1975; Woese and Fox, 1977). The subsequent development of strategies to analyze rRNA molecules and genes (rDNA) obtained from the environment provided a culture-independent means to examine the immense diversity of microorganisms inhabiting the natural world (Giovannoni et al., 1990; Pace, 1997).
| 2 THE PROBLEM |
|---|
|
|
|---|
Numerous rDNA-based strategies have been developed for microbial community analysis. The most accurate approach is to analyze the nucleotide sequence of rRNA molecules or genes (Giovannoni et al., 1990; Olsen et al., 1986; Pace et al., 1986; Ward et al., 1990). However, because of the high costs associated with examining such diverse communities in this manner, this approach is usually impractical for thorough analysis of microbial community composition. Other methods such as denaturing gradient gel electrophoresis (Muyzer et al., 1993), terminal restriction fragment length polymorphisms (Liu et al., 1997), ribosomal intergenic space analysis (Borneman and Triplett, 1997) and amplified ribosomal DNA restriction analysis (Vaneechoutte et al., 1992) enable relatively inexpensive and rapid analysis of many samples, but they typically generate only partial descriptions of microbial community composition.
To overcome these experimental obstacles, a method termed oligonucleotide fingerprinting of ribosomal RNA genes (OFRG) was developed (Valinsky et al., 2002a, b). OFRG is an adaptation of a method used for gene expression profiling (Drmanac, 1999; Drmanac et al., 1991; Lennon and Lehrach, 1991). OFRG is an array-based method that enables extensive analysis of microbial community composition. OFRG works by sorting rDNA clones into taxonomic groups through a series of hybridization experiments, each using a single DNA probe. The probe sequences for the OFRG analysis are selected for their ability to differentiate known rDNA sequences in the GenBank database (NCBI) (Borneman et al., 2001). These hybridization experiments are used to produce hybridization fingerprints which specify the presence or absence of the probe sequences in each clone. The microorganisms are identified by clustering the hybridization fingerprints of the unknown rDNA clones with those of known rDNA sequences.
A brief description of the experimental process for OFRG follows. Microbial rDNAs are isolated from a sample of interest by extracting DNA from the microorganisms and then PCR amplifying the rDNA. Cloned rDNA fragments are arrayed on nylon membranes and subjected to a series of hybridization experiments, each using a single DNA oligonucleotide probe. The signal intensities from these experiments are transformed into binary vectors, which we call hybridization fingerprints. Hybridization fingerprints from the unidentified rDNA clones are clustered with fingerprints from known rDNA sequences using UPGMA (unweighted pair group method with arithmetic mean). The unidentified rDNA clones are identified by their association with known rDNA sequences in the UPGMA tree. To date, we have developed OFRG probe sets for bacteria and fungi. The bacterial analysis is capable of classifying rRNA gene sequences to the species level (Valinsky et al., 2002b). For fungal analysis, OFRG was able to resolve clones with average sequence identities of 99.2% (Valinsky et al., 2002a).
| 3 THE DATA |
|---|
|
|
|---|
One of the crucial components of the OFRG analysis is the process by which the signal intensity data are transformed into hybridization fingerprints. The hybridization fingerprints specify whether the probes hybridize, or do not hybridize, to the rDNA clones. Probes that hybridize to the rDNA clones should produce larger signal intensity values than those that do not hybridize. The following is a description of how prior OFRG studies have processed the data to produce the hybridization fingerprints.
Hybridization fingerprints have been generated by transforming the hybridization signal intensity data into three discrete values 0, 1 and N, where 0 and 1, respectively, specify negative and positive hybridization events and N designates an uncertain assignment. The signal intensity data from the unidentified rDNA clones were transformed into 0, 1 and N based on the signal intensities from control clones, which are clones with defined nucleotide sequences. For most probes, the control clones expected not to hybridize with the probe (negative controls) have signal intensity values less than the control clones expected to hybridize with the probe (positive controls); conversely, the signal intensity values from the positive clones are higher than those from the negative clones. For probes that function in this manner, clones with intensity values
x were given a 0 classification, where x is the highest intensity value generated by a negative control. Clones with intensity values
y were given a 1 classification, where y is the lowest value generated by a positive control. All other clones were given an N classification. For some probes, not all of the control clones perform in the predicted manner; for example, some positive control clones may have intensity values that are lower than some of the negative control values and vice versa. For probes that function in this manner, clones with intensity values <x were given a 0 classification, where x is the lowest intensity value generated from a positive control. Clones with intensity values >y were given a 1 classification, where y is the highest value generated by a negative control. All other clones are given an N classification. Performing this analysis with all probes for all clones creates a hybridization fingerprint for each clone. An example of a hybridization fingerprint created by 26 probes is 000101N001000N110101111000.
This report compares and evaluates several approaches for transforming the signal intensity data into hybridization fingerprints. As with most nucleic acid hybridization experiments, the signal intensity data do not consistently fall into discrete categories. This factor along with the aforementioned method for transforming these data lead to the production of hybridization fingerprints with a considerable number of N classifications. The goal of this work was to minimize the number of N classifications, which should increase the accuracy of the OFRG analysis.
| 4 STATISTICAL ANALYSIS |
|---|
|
|
|---|
4.1 Overview
Our data consist of signal intensities from hybridization experiments between arrayed rDNA genes and probes. We classify these hybridizations as 1, if hybridization took place, 0 if it did not, and an N, if we were unable to determine how to classify the result. The 01-N classification produces hybridization fingerprints, which are used to cluster the rDNA clones into taxonomic groups. In prior studies, this grouping has been less than fully satisfactory because the number of gene/probe clone combinations in the N group was not small, and we had no idea how to classify the N combinations. (However, we felt confident about the combinations that had been already classified as 1 or 0, according to whether hybridization had taken place in that experiment, or not.) In addition, the hierarchical clustering algorithm (UPGMA) used to cluster the gene/probe combinations classifies the Ns somewhat randomly, as it allocates them proportionally to already-established 0-1 values between pairs of genes, and then forms a distance matrix for the clustering procedure.
To illustrate the allocation procedure, we use a simple example. Suppose there are 3 genes and 10 probes, and the 01 sequences are given by
![]() |
![]() |
![]() |
![]() |
The symmetric pairwise-distance matrix for this example is given in Table 1, which the PAUP (Phylogenetic Analysis Using Parsimony) computer program, Version 4.0, Beta 9 (Swofford, 2001; Maddison et al., 1997) would then use to form clusters, using UPGMA. Because the proportional allocation approach for classifying and clustering the Ns was clearly quite arbitrary, we sought to improve upon that procedure.
|
The procedure we adopted to seek improvement in clustering consists of first classifying all of the gene/probe combinations statistically to confirm whether hybridization has taken place, or not, thereby eliminating the number of combinations in the unclassified, or unknown, category. Then we use UPGMA to cluster a fully classified set of 0-1 data, with no proportional allocation clustering being necessary. We used a reference set of data for which we knew the nucleotide sequences of the rDNA clones to evaluate how well the statistical classification/clustering procedure would compare with the 1-0-N procedure. The statistical procedures used were:
- Bayesian classification; that is, the procedures involved in classifying the intensities of each of the gene/probe combinations, and the procedures used were developed using Bayesian classification modeling (see, e.g. Press, 1982,Press, 2003). We made various assumptions about the distributions of the data and we tested them, as described in Section 4.2. The data were classified into one of two classes, the hybridized class and the unhybridized class. But to establish the structure of the populations we needed known data, that is, data whose classifications were known with certainty. For known data we used the reference data, and we refer to it as training data, to be consistent with terminology used in the classification literature.
- Multivariate hierarchical clustering using UPGMA.
- Tree modeling for the resulting clustering procedure that minimized the distance between the tree corresponding to the reference population, and the statistically clustered data.
The various distributions studied and compared were normal distributions, exponential distributions and gamma distributions. We also tried tree-based (non-parametric) classification. The complete dataset consists of 27 probes and 1464 genes, after excluding genes that did not hybridize properly. From the 1464 genes, 65 were randomly selected and used as training data. These 65 genes had known class membership. That is, we knew with certainty whether these 65 had hybridized or not. Our objective was to find the best methods for both classifying and clustering the training data (65 genes), as compared with the correct cluster values, which we knew, in order to apply the resulting optimal procedure to the entire dataset (1464 genes).
The performance of each classification procedure was evaluated by calculating the apparent error rate (APER), defined as the fraction of misclassified observations in the training sample (Johnson and Wichern, 1992). The APER did not depend on the form of population densities and could be readily calculated by constructing the 2 x 2 confusion matrix (Press, 1982), where the actual and predicted class memberships obtained from each classification approach were compared.
Prior to conducting classification and clustering analyses, all normalized intensities that were negative were truncated to zero. Then, we explored the distributions of the remaining data (in each dimension) using both histograms and normal probability plots (Venebles and Ripley, 1999). Reference data points for each probe were used to estimate the parameters of the underlying distributions. The data usually looked non-normally distributed, and very much like exponentially distributed, or gamma distributed data. We examined classifications using transformations of the non-normal data (we attempted to transform very non-normal looking data to normality). Therefore, each classification procedure would be performed on either the original or the transformed intensities, depending on the distribution assumption indicated in the procedure being applied. The derivations of the classification methods used may be found (Press, 1982, 1989, 2003).
4.2 Bayesian classification with normal distributions
The preliminary investigation of histograms and normal probability plots revealed that the normalized intensities in each probe tended to have a rather right-skewed, non-normal shape. The BoxCox transformation (Box and Cox, 1964) technique, consequently, was utilized to transform the intensities in each probe to approximate normality. Then the Bayesian classification approach was performed on the transformed data. The unhybridized and hybridized classes or populations are denoted by
0 and
1, respectively, and the distributions are denoted by:
and
, respectively. All parameters
, were assumed to be unknown.
For class
j (j = 0, 1) in the training sample with known class memberships corresponding to each clone, the intensities,
, were assumed to be independent and identically distributed (i.i.d.) according to
j. For convenience, the superscript j would be omitted when considering only one class, j. Let
and
be sufficient, unbiased estimators of µj and
respectively. Since
![]() |
![]() | (1) |
denotes proportionality. Adopting a vague prior distribution (such a distribution represents minimum prior information),
. The posterior density is formed by Bayes' theorem as a product of the likelihood function and prior information, expressed as
![]() | (2) |
The interest here was to predict class membership of an observation, z, which is known to belong to either
0 or
1. After integrating the product of the likelihood of observation z and the posterior distribution (2) with respect to all unknown parameters, the predictive distribution would then be obtained in the form of a Student's t-distribution, shown as
![]() | (3) |
0 and
1, respectively. The posterior odds ratio, afterwards, could be calculated based on the posterior classification probabilities, yielding the ratios of pairs of Student's t-densities as given by
![]() | (4) |
![]() |
1 if the value of odds ratio was larger than 1, and into
0, otherwise.
4.3 Bayesian classification with exponential distributions
The histograms and normal probability plots of normalized intensities in each probe appeared to have shapes slightly skewed to the right as pointed out previously. Without any transformation, the Bayesian classification, alternatively, could be conducted directly on the original intensities, approximately exhibiting similar characteristics to those in the family of gamma distributions. For convenience in computation, we first assume that the intensities in every probe follow an exponential distribution,
j
exp (ßj) where j = 0, 1 and ßj denotes the unknown parameters in class
j. That is, now assume (x1,x2,...,xnj), are i.i.d. from
j. Likewise, the likelihood function would be written in terms of a sufficient estimator of ßj, given by
![]() | (5) |
, the maximum-likelihood estimator (MLE) of ßj. We see that the density of
is in the form of an Inverted-gamma (nj, njßj). With the use of vague prior, p(ßj)
1/ßj, 0<ßj <
. After replacing the value of
by
, the posterior density of ßj could then be derived and given by
![]() | (6) |
![]() | (7) |
![]() | (8) |
0 if the odds ratio was >1, and vice versa.
4.4 Bayesian classification with gamma distributions
As mentioned earlier, the intensities in each probe mostly tended to have a distribution skewed to the right, suggesting a gamma distribution. In this section, we generalize the intensities in each probe with a flexible shape parameter, instead of with a fixed shape parameter, as in the exponential case. Similarly, all intensities in class j, (x1,x2,...,xnj), were assumed to be i.i.d. from
j,
j
Gamma(
j,ß j). With two unknown parameters and the likelihood function involving a gamma function, this causes difficulties in the derivation of the predictive density. The result is that there is no closed form for the joint posterior density, so we cannot develop Bayesian estimates of the two parameters jointly. To circumvent this problem, we assumed the shape parameter,
j, was known and we carried out Bayesian estimation of the resulting scale parameter, ßj, for known
j. To estimate the fixed
j, we estimated (
j,ßj) jointly by MLE, and then we suppressed the estimate of ßj. We next describe the Bayesian estimation of ßj conditional on
j.
Given all the data, and the estimated value of
, the likelihood as a function of the unknown parameter ßj could be written as
![]() | (9) |
denoted the sufficient estimator of ßj. Suppose that we adopt a vague prior, p(ßj)
1/ßj, 0 < ßj <
. The resulting posterior distribution of ßj would be obtained with the substitution of
, expressed as
![]() | (10) |
distribution. Likewise, the predictive distribution of an observation z, afterwards, would be given by
![]() | (11) |
![]() |
![]() | (12) |
. Equation (12), then, would be rewritten as
![]() | (13) |
0 if its odds ratio was larger than 1, and to
1, otherwise.
4.5 Tree classification
The classification tree (Venebles and Ripley, 1999; Crawley, 2002) was performed based on binary recursive partitioning for which the data were split consecutively along the coordinates of the independent variables. The partition resulted in a path from the top of the tree, called the root, and continued proceeding to one of the terminal nodes, called a leaf, following criteria for successive splits. At each splitting point, the threshold for the response variable was chosen and the splitting continued until no further splits were allowed owing to sufficient homogeneity of observations, or very small numbers of observations in each node. That is, from the root, the tree splits into two groups called 0 and 1 using the intensity value (threshold) as a criterion for splitting. Genes with intensity less than the threshold are placed in group 0, and those with intensity greater than the threshold are placed in group 1. Sometimes the process of splitting ended here and we got the threshold for a clear split. However, sometimes all genes in groups 0 and 1 could be split further, as long as the reduction in deviance could still be achieved, or the defaults of the program are not yet met.
In this study, a threshold of intensities for a given probe at each node was selected and the deviances (D) of the response above and below this threshold were calculated, as defined by
![]() | (14) |
is an estimate of the proportion of clones in node i assigned to class k. The whole procedure would be repeated until there is no further reduction in deviances or too few data for further subdivision. With tree-based procedures, each intensity in every probe ultimately would be classified into either class
0 or class
1. Along with the originally assigned data, all binary data obtained from the various classification approaches were then clustered using the computer program, UPGMA in PAUP. The analyses could also be performed with the presence of unknown values (N) in the original classification. With the default parameters, UPGMA assigned the unknown values proportionally to the known 0-1 values that appeared in the pairs of clones in comparison, as pointed out previously.
The pairwise-distance matrices based on the proportions of different, or mismatched, characters between pairs of clones were constructed, and the hierarchical clusters, or trees, would be formed afterwards. With the same cloneprobe combination, all trees drawn from different binary data corresponding to each classification approach were eventually compared with the reference tree obtained by analogously applying the UPGMA on the binary reference data. Tree comparisons in this study were performed based on several different criteria:
- the agreement subtrees (Agree) (Swofford, 2001);
- symmetric-difference metric (SymDiff) (Penny and Hendy, 1985);
- agreement-subtree metric d (AgD) (Goddard et al., 1994);
- agreement-subtree metric d1 (AgD1) (Goddard et al., 1994).
The classification approach that out-performed all others yields small error rates of classification and indicates large agreement with the reference tree in the cluster analysis performed on the binary data produced from that classification approach. The procedure exhibiting this improvement over the original classification was finally adopted to classify all the 1464 clones.
| 5 RESULTS |
|---|
|
|
|---|
With the same cloneprobe combinations, all confusion matrices obtained from Bayesian classification approaches and tree-based classification, as described in the previous section, were established separately in comparison with the classification of reference data (Table 2).
|
The performances of the various classification approaches conducted on the reference data were evaluated from the APER corresponding to each confusion matrix, as provided in Table 3. Most Bayesian classification methods performed on a family of gamma distributions resulted in high percentages of misclassification. Between two classification procedures performed on the exponential distribution and the gamma distribution using MLE, the errors of misclassification tended to be smallest with the use of an MLE to approximate the shape parameter. Using the transformed-to-normality signal intensities, the Bayesian classification exhibited a moderate rate of misclassification. The tree-based classification produced the smallest error rate of classification in this study. But as we see, classification error rate was only part of the story. We were ultimately concerned with how the tree results compared for the various approaches, i.e. How far from the reference tree (the tree corresponding to the known classifications and clusters) were the trees corresponding to the various methods of classification? We show this comparison in Table 4.
|
|
In addition to the percentages of misclassification, the performance of various classification approaches could also be assessed from clustering results. All trees constructed from varied binary data with respect to each classification approach, together with that of the originally assigned data, were compared with the reference tree. As shown in Table 4, the results of this comparison based on four criteria appeared to be consistent across all classification approaches. The tree from the Bayesian classification using normal densities tended to exhibit the most agreement with the reference tree, followed by those from the Bayesian classifications on exponential and gamma densities, roughly producing the same level of agreement. The original approach (0-1-N classification with proportional allocation to clusters) produced a tree with the second smallest level of agreement. Among all classification methods considered here, the tree-based classification indicated the least agreement of subtrees although it yielded the minimum rate of misclassification. One possible explanation for the poor performance of tree-based classification was possibly due to the fact that while such a classification was associated with the smallest classification error rate, because it is non-parametric and therefore does not take the form of the data distribution into account when developing classification criteria, it may miss the characteristics of the data that are most relevant to tree distances. As a consequence, with a moderate rate of misclassification, and the most agreement of subtrees with the reference tree, the Bayesian classification with normal densities approach out-performed all others, including an improvement over the (0-1-N) original approach. The tree that resulted from the Bayesian normal classification of the reference data along with the reference tree are illustrated in Figures 1 and 2, respectively. Finally, this Bayesian normal classification approach was adopted to classify all the transformed signal intensities in the set of 1464 clones. The tree that resulted from this relatively optimal approach applied to all of the 1464 clones is not displayed in this paper since it required
14 pages for presentation. Accordingly, we decided not to include it in the paper.
|
|
| 6 CONCLUSIONS |
|---|
|
|
|---|
In this paper, we compared and evaluated several strategies for transforming hybridization signal intensity data from oligonucleotide fingerprint experiments into hybridization fingerprints composed of binary values, thereby resolving uncertainty in the originally assigned fingerprints. The performance of each classification approach was assessed in terms of both misclassification rates and the levels of agreement in the hierarchical clusters. The analysis revealed that the Bayesian classification on transformed-to-normal data appeared to out-perform all others considered in the study, including the original (0-1-N) approach. However, we were still limited in that there were just a very small number of clones falling in either the hybridized or the unhybridized group in some probes of the training data whose class memberships were established. This surely affected the performance of all statistics used in all of the Bayesian classification approaches. But most statistics have reasonably good properties as the sample size gets large. As a consequence, it is recommended that in applications, a larger training dataset be used, where possible, to gain maximum advantage from the training data. Overall, utilization of the Bayesian classification scheme should increase the reliability of oligonucleotide fingerprint analyses.
This Bayesian approach may also be useful for other studies including analysis of standard gene expression data. Typical goals of gene expression analysis include identifying genes that are expressed at similar/differential levels or identifying samples that have similar/different expression patterns. These studies typically utilize cluster analyses, which group objects by their similarities. One problem with these approaches is defining what constitutes similarity, as the results of any clustering experiment can be strongly influenced by how this parameter is defined (Brazma and Vilo, 2000). One way to approach this problem is to discretize the data before it is clustered. Shmulevich and Zhang (2002) developed an approach which included discretizing gene expression data into a binary format. This approach successfully separated tumor types while reducing data noise and increasing computational efficiency. Utilization of the Bayesian approach described in this paper will provide an alternative approach for discretizing expression data based on prior knowledge, which could lead to new strategies for gene expression analysis.
| Acknowledgments |
|---|
We wish to thank Professor Tao Jiang and Andres Figueroa (Department of Computer Science and Engineering), Professor Mark S. Springer (Department of Biology, University of California, Riverside, CA) for their useful advice about tree comparisons. This work was funded in part by grants from the NSF BDI Program (J.B. and S.J.P.) and the UC Biotechnology Research & Training programme (K.J., J.B. and S.J.P.). L.V. was supported by Vaddia-BARD postdoctoral award FI-306-00 from BARD, The United States-Israel Binational Agricultural Research and Development Fund. K.J. was supported in part by a graduate student fellowship from the Government of Thailand.
Received on February 3, 2005; revised on April 13, 2005; accepted on April 13, 2005
| REFERENCES |
|---|
|
|
|---|
Amann, R.I., et al. (1995) Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiol. Rev., 59, 143169
Atlas, R.M. and Bartha, R. Microbial Ecology Fundamental and Applications, (1998) Benjamin/Cummings.
Borneman, J. and Triplett, E.W. (1997) Molecular microbial diversity in soils from eastern Amazonia: evidence for unusual microorganisms and microbial population shifts associated with deforestation. Appl. Environ. Microbiol, 63, 26472653[Abstract].
Borneman, J., et al. (2001) Probe selection algorithms with applications in the analysis of microbial communities. Bioinformatics, 17, S39S48[Abstract].
Box, G.E.P. and Cox, D.R. (1964) An analysis of transformations. J. R. Stat. Soc. Ser. B, 26, 211252.
Brazma, A. and Vilo, J. (2000) Gene expression data analysis. FEBS Lett, 480, 1724[CrossRef][Web of Science][Medline].
Bull, A.T., et al. (2000) Search and discovery strategies for biotechnology: the paradigm shift. Microbiol. Mol. Biol. Rev, 64, 573606
Chapela, I.H. (1997) Bioprospecting: myths, realities and potential impact on sustainable development. In Palm, M.E. and Chapela, I.H. (Eds.). Mycology in Sustainable Development: Expanding Concepts, Vanishing Borders, , Boone, NC Parkway Publishers, pp. 238256.
Christensen, M. (1989) A view of fungal ecology. Mycologia, 81, 119[CrossRef].
Crawley, M.J. Statistical Computing: An Introduction to Data Analysis Using S-Plus, (2002) , West Sussex, UK John Wiley.
Drmanac, R. (1999) cDNA screening by array hybridization. Methods Enzymol, 303, 165178[Web of Science][Medline].
Drmanac, R., Lennon, G., Drmanac, S., Labat, I., Crkvenjakov, R., Lehrach, H. (1991) Supercomputing and the human genome. In Cantor, C. and Lim, H. (Eds.). Proceedings of the First International Conference on Electrophoresis, , Singapore World Scientific, pp. 6075.
Fox, G.E., et al. (1977) Comparative cataloging of 16S ribosomal ribonucleic acid: molecular approach to prokaryotic systematics. Int. J. Syst. Bacteriol, 27, 4457
Giovannoni, S.J., et al. (1990) Genetic diversity in Sargasso Sea bacterioplankton. Nature, 345, 6063[CrossRef][Medline].
Goddard, W., et al. (1994) The agreement metric for labeled binary trees. Math. Biosci, 123, 215226[CrossRef][Web of Science][Medline].
Johnson, R.A. and Wichern, D.W. Applied Multivariate Statistical Analysis, (1992) 3rd edn , Prentice Hall, NJ.
Lennon, G.S. and Lehrach, H. (1991) Hybridization analyses of arrayed complementary DNA libraries. Trends Genet, 7, 314317[Web of Science][Medline].
Liu, W.T., et al. (1997) Characterization of microbial diversity by determining terminal restriction fragment length polymorphisms of genes encoding 16S rRNA. Appl. Environ. Microbiol, 63, 45164522[Abstract].
Maddison, D.R., et al. (1997) Nexus: an extensible file format for systematic information. Syst. Biol, 46, 590621[CrossRef][Web of Science][Medline].
Muyzer, G., et al. (1993) Profiling of complex microbial populations by denaturing gradient gel electrophoresis analysis of polymerase chain reaction amplified genes coding for 16S rRNA. Appl. Environ. Microbiol., 59, 695700
Olsen, G.J., et al. (1986) Microbial ecology and evolutiona ribosomal-RNA approach. Annu. Rev. Microbiol., 40, 337365[CrossRef][Web of Science][Medline].
Pace, N.R. (1997) A molecular view of microbial diversity and the biosphere. Science, 276, 734740
Pace, N.R., et al. (1986) The analysis of natural microbial-populations by ribosomal-RNA sequences. Adv. Microb. Ecol., 9, 155.
Penny, D. and Hendy, M.D. (1985) The use of tree comparison metrics. Syst. Zool., 34, 7582[CrossRef].
Press, S.J. Applied Multivariate Analysis: Including Bayesian and Frequentist Methods of Inference, (1982) , Malabar, FL Krieger Publishing Co.
Press, S.J. Bayesian Statistics: Principles, Models and Applications, (1989) , NY John Wiley.
Press, S.J. Subjective and Objective Bayesian Statistics: Principles, Models, and Applications, (2003) 2nd edn. , NY John Wiley.
SAS Institute. SAS/STAT User's Guide: Volume 1, (1994) 4th edn , Cary, NC SAS Institute Inc.
Shmulevich, I. and Zhang, W. (2002) Binary analysis and optimization-based normalization of gene expression data. Bioinformatics, 18, 555565
Sogin, S.J., et al. (1972) Phylogenetic measurement in prokaryotes by primary structural characterization. J. Mol. Evol, 1, 173184.
Swofford, D.L. PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods) Version 4.0, (2001) , Sunderland, MA Sinauer Associates.
Tuomela, M., et al. (2000) Biodegradation of lignin in a compost environment: a review. Bioresource Technol, 72, 169183[CrossRef].
Valinsky, L., et al. (2002a) Oligonucleotide fingerprinting of ribosomal RNA genes for analysis of fungal community composition. Appl. Environ. Microbiol, 68, 59996004
Valinsky, L., et al. (2002b) Analysis of bacterial community composition by oligonucleotide fingerprinting of rRNA genes. Appl. Environ. Microbiol, 68, 32433250
Vaneechoutte, M., et al. (1992) Rapid identification of bacteria of the comamonadaceae with amplified ribosomal DNA-restriction analysis (ARDRA). FEMS Microbiol. Lett., 93, 227234[CrossRef].
Venables, W.N. and Ripley, B.D. Modern Applied Statistics with S-Plus, (1999) 3rd edn , NY Springer-Verlag.
Ward, D.M., et al. (1990) 16S Ribosomal RNA sequences reveal numerous uncultured microorganisms in a natural community. Nature,, 345, 6365[CrossRef][Medline].
Woese, C.R. and Fox, G.E. (1977) Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc. Natl Acad. Sci. USA, 74, 50885090
Woese, C.R., et al. (1975) Conservation of primary structure in 16S ribosomal RNA. Nature, 254, 8386[CrossRef][Medline].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






















