Bioinformatics Advance Access originally published online on May 3, 2008
Bioinformatics 2008 24(13):1556-1558; doi:10.1093/bioinformatics/btn217
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
DAnTE: a statistical tool for quantitative analysis of -omics data
Pacific Northwest National Laboratory, Richland, WA 99352, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: Data Analysis Tool Extension (DAnTE) is a statistical tool designed to address challenges associated with quantitative bottom-up, shotgun proteomics data. This tool has also been demonstrated for microarray data and can easily be extended to other high-throughput data types. DAnTE features selected normalization methods, missing value imputation algorithms, peptide-to-protein rollup methods, an extensive array of plotting functions and a comprehensive hypothesis-testing scheme that can handle unbalanced data and random effects. The graphical user interface (GUI) is designed to be very intuitive and user friendly.
Availability: DAnTE may be downloaded free of charge at http://omics.pnl.gov/software/
Contact: rds{at}pnl.gov or proteomics{at}pnl.gov
Supplementary information: An example dataset with instructions on how to perform a series of analysis steps is available at http://omics.pnl.gov/software/
| 1 INTRODUCTION |
|---|
|
|
|---|
Although a number of tools are available for high-throughput microarray data processing (Gentleman et al., 2004; Saeed et al., 2003), the data from LC–MS based quantitative bottom-up proteomics measurements (i.e. label-free approaches, stable isotope labeling methods, spectral counting approaches and the Accurate Mass and Time Tag method) pose different challenges than what these tools are designed to address. One of the major issues associated with proteomics data is often the extent of missing values that is largely due to the larger number of species near the threshold for detection and leads to unbalanced datasets. In addition, proteomics data involves another level of grouping or rollup information to map peptides to proteins. Peptide abundances are often used to infer the corresponding protein abundances.
Developed to address the issues common to proteomics data, Data Analysis Tool Extension (DanTE) is readily extendable. Though the target application is high-throughput proteomics, DAnTE has also been successfully demonstrated for microarray data analysis and can readily be applied to other forms of high-throughput omics data that bears similar characteristics (e.g. metabolomics data). A screenshot of the DAnTE user interface is illustrated in Figure 1.
|
| 2 DESCRIPTION |
|---|
|
|
|---|
2.1 Dependencies
The graphical user interface (GUI) of DAnTE is implemented using the C# language, and the core algorithms are implemented in the open source R statistical environment (R Development Core Team, 2008). DAnTE runs on a Microsoft WindowsXP platform within a. NET 2.0 framework. The connectivity between R and the C#/.NET environment is achieved by using the open source R(D)COM server application (Baier and Neuwirth, 2007). This unique choice of environments makes DAnTE a very user friendly software tool, even though it cannot integrate into the popular Bioconductor (Gentleman et al., 2004) project.
2.2 Application features
2.2.1 Data loading
The input data to DAnTE can be any file that stores tabular data, including flat files (either CSV or tab-delimited text files) and Microsoft Excel files. A unique feature of the data loading mechanism is that it preserves peptide-to-protein mapping information for use later in plotting peptides that belong to a particular protein, as well as in the peptides-to-protein rollup methods. In addition, DAnTE can also process SEQUEST (Eng et al., 1994) results and create spectral count tables.
2.2.2 Factor definitions
Factors are used to capture the fixed and random effects in experimental design. For example, the biological condition is a fixed effect factor, while a list of liquid chromatography (LC) columns used to separate the samples can be treated as a random effect. This information is vital in normalization, imputation and hypothesis testing methods in DAnTE. Factors can either be declared once the data is loaded or be loaded from a flat file.
2.2.3 Investigative plots
Various statistical plots, including histograms, box plots, correlation diagrams and MA (or R-I: ratio-intensity) plots can be plotted in DAnTE. These plots help the user evaluate reproducibility within the study set and single-out problematic datasets so that they can be excluded from further analysis.
2.2.4 Data normalization
As normalization is arguably the most important step in downstream data analysis, DAnTE employs several normalization methods that have been successfully tested for both proteomics data (Callister et al., 2006) and microarray genomics data (Quackenbush, 2002; Smyth et al., 2003). Among them are a robust linear regression method, lowess method and a quantile normalization method. In addition, global intensity adjustment based on median absolute deviation (MAD) and central tendency adjustment methods are also available.
2.2.5 Missing value imputation
Incomplete datasets due to missing values are common with high-throughput proteomics. As imputing these values is a much-debated topic (Troyanskaya et al., 2001), DAnTE offers several simple methods, as well as some advanced algorithms to chose from. The simple methods allow the user to fill in missing values with either the dataset mean/median or with a pre-chosen constant. Advanced methods include filling in with a row mean based on a user-defined factor, K-nearest neighbor imputation (KNNimpute), and singular value decomposition-based imputation (SVDimpute).
2.2.6 Peptide-to-protein rollup
In most proteomics methods, peptide measurements are rolled up to corresponding protein abundances. Ideally, all peptides from a single protein should have similar abundances that manifest as similar signal intensities; however, in reality many factors, such as digestion efficiency, electrospray ionization efficiency, etc., can affect the identifications and abundances or signal intensities of peptides. In the RRollup method available in DAnTE, peptides that originate from the same protein are first scaled on the basis of a chosen reference peptide in order to bring all peptide profiles across biological conditions to the same level and then averaged to obtain the protein abundance. During scaling, the peptide with the most observations is chosen as the reference peptide and its total abundance across datasets is used as a tiebreaker. In the ZRollup method, a scaling method similar to z-scores (except that medians instead of means from peptide profiles across biological conditions are used) is applied first to peptides that originate from a single protein and then the scaled pepetides are averaged to obtain relative protein abundance. In both RRollup and Zrollup methods, outlying peptide values are excluded from protein abundance calculations, using a Grubb's outlier test (Grubbs, 1969). In the third QRollup method, peptides are selected on the basis of a user selected abundance cutoff value, and protein abundance is calculated as the average of these selected peptides.
2.2.7 Analytical algorithms
DAnTE offers several well-characterized algorithms to further explore patterns in the data. Traditional principal component analysis (Jolliffe, 2002) and associated scores and loadings plots can be useful as an unsupervised way of finding the principal variation in the data. In contrast, the partial least squares method (Wold et al., 1984) available in DAnTE can be used as a discrimination procedure whereby the grouping information is assigned using factors. Hierarchical and k-means clustering methods on features/samples are also available as part of the heat map plotting function.
2.2.8 Hypothesis testing
A comprehensive ANOVA scheme for unbalanced studies that uses marginal sums of squares (Fox, 1997) and mixed models (Pinheiro and Bates, 2000) is included in DAnTE. The user can also test for interactions among factors in a multi-way analysis of variance (ANOVA). The q-values are also calculated along with the p-values in order to control the false discovery rate in multiple testing (Storey, 2002). In addition, DAnTE can check whether the data follows a normal distribution by employing the Shapiro–Wilks test and features two non-parametric hypothesis tests (Wilcoxon rank sum test and Kruskal–Wallis test) when the normality assumption fails to hold.
| 3 SUMMARY |
|---|
|
|
|---|
DAnTE is designed as a complete downstream analysis tool that incorporates a host of algorithms for large-scale bottom-up proteomics data. This tool features an interactive GUI interface and harnesses the power of R statistical environment; its uniqueness lies in its ability to handle incomplete data and to roll peptides up to proteins. Though designed specifically for analyzing proteomics data, DAnTE performs equally well on genomics microarray data.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
The authors thank Joel Pounds, Susan Varnum, and Kim Hixson for their many suggestions and extensive testing; and Thomas O. Metz for data and support for early methods development.
Funding: Portions of this research were supported by the National Institute of General Medical Sciences (NIGMS, Large Scale Collaborative Research Grants U54 GM-62119-02), the NIH National Center for Research Resources (RR18522), Laboratory Directed Research and Development (LDRD) program (W.-J.Q.) at Pacific Northwest National Laboratory (PNNL) and the National Institute of Allergy and Infectious Diseases NIH/DHHS (through interagency agreement Y1-AI-4894-01). Work was performed at PNNL in the Environmental Molecular Sciences Laboratory, a national scientific user facility sponsored by the US Department of Energy (DOE) Office of Biological and Environmental Research. PNNL is operated by Battelle for the DOE under contract DE-AC05-76RLO-1830.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: John Quackenbush
Received on February 22, 2008; revised on April 9, 2008; accepted on April 29, 2008
| REFERENCES |
|---|
|
|
|---|
Baier T, Neuwirth E. (2007) (last accessed date May 23, 2008). R (D)COM Server V2.01. Available at http://sunsite.univie.ac.at/rcom/.
Callister SJ, et al. Normalization approaches for removing systematic biases associated with mass spectrometry and label-free proteomics. J. Proteome Res. (2006) 5:277–286.[CrossRef][Web of Science][Medline]
Eng JK, et al. An approach to correlate tandem mass-spectral data of peptides with amino-acid-sequences in a protein database. J. Am. Soc. Mass Spectrom. (1994) 5:976–989.[CrossRef][Web of Science]
Fox J. Applied Regression Analysis, Linear Models, and Related Methods. (1997) Thousand Oaks, CA: Sage Publications.
Gentleman RC, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. (2004) 5:R80.[CrossRef][Medline]
Grubbs F. Procedures for detecting outlying observations in samples. Technometrics (1969) 11:1–21.[CrossRef][Web of Science]
Jolliffe IT. Principal Component Analysis (2002) New York: Springer.
Pinheiro JC, Bates DM. Mixed-Effects Models in S and S-PLUS (2000) New York: Springer.
Quackenbush J. Microarray data normalization and transformation. Nat. Genet. (2002) 32:496–501. (Suppl.)[CrossRef][Web of Science][Medline]
R Development Core Team. R: a language and environment for statistical computing. (2008) Vienna, Austria: R Foundation for Statistical Computing. Available at http://www.R-project.org.
Saeed AI, et al. TM4: a free, open-source system for microarray data management and analysis. Biotechniques (2003) 34:374–378.[Web of Science][Medline]
Smyth GK, et al. Statistical issues in cDNA microarray data analysis. Methods Mol. Biol. (2003) 224:111–136.[Medline]
Storey JD. A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B Stat. Methodol. (2002) 64:479–498.[CrossRef]
Troyanskaya O, et al. Missing value estimation methods for DNA microarrays. Bioinformatics (2001) 17:520–525.
Wold S, et al. Modeling data tables by principal components and pls – class patterns and quantitative predictive relations. Analusis (1984) 12:477–485.
This article has been cited by other articles:
![]() |
S. P. Albaum, H. Neuweger, B. Franzel, S. Lange, D. Mertens, C. Trotschel, D. Wolters, J. Kalinowski, T. W. Nattkemper, and A. Goesmann Qupe--a Rich Internet Application to take a step forward in the analysis of mass spectrometry-based quantitative proteomics experiments Bioinformatics, December 1, 2009; 25(23): 3128 - 3134. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. V. Karpievitch, T. Taverner, J. N. Adkins, S. J. Callister, G. A. Anderson, R. D. Smith, and A. R. Dabney Normalization of peak intensities in bottom-up MS-based proteomics using singular value decomposition Bioinformatics, October 1, 2009; 25(19): 2573 - 2580. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Karpievitch, J. Stanley, T. Taverner, J. Huang, J. N. Adkins, C. Ansong, F. Heffron, T. O. Metz, W.-J. Qian, H. Yoon, et al. A statistical framework for protein quantitation in bottom-up MS-based proteomics Bioinformatics, August 15, 2009; 25(16): 2028 - 2034. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

