Skip Navigation


Bioinformatics Advance Access originally published online on April 26, 2007
Bioinformatics 2007 23(12):1562-1564; doi:10.1093/bioinformatics/btm127
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/12/1562    most recent
btm127v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Eschrich, S. A.
Right arrow Articles by Hoerter, A. M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Eschrich, S. A.
Right arrow Articles by Hoerter, A. M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Libaffy: software for processing Affymetrix(R) GeneChip(R) data

Steven A. Eschrich 1,* and Andrew M. Hoerter 1

1H. Lee Moffitt Cancer Center and Research Institute, University of South Florida, Tampa, FL 33612, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DESCRIPTION
 3 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Summary: Affymetrix(R) GeneChip(R) microarrays are increasingly used in gene expression studies and in greater number. A software library was developed that supports Affymetrix file formats and implements two popular summary algorithms (MAS5.0 and RMA). The library is modular in design for integration into larger systems and processing pipelines. Additionally, a graphical interface (GENE) was developed to allow end-user access to the functionality within the library.

Availability: libaffy is free to use under the GNU GPL license. The source code and Windows binaries can be freely accessed from the website http://src.moffitt.usf.edu/libaffy. Additional API documentation and user manual are available.

Contact: Steven.Eschrich{at}moffitt.org


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DESCRIPTION
 3 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Affymetrix(R) GeneChip(R) microarrays have become increasingly popular as a platform for gene expression experiments. This popularity has led to more studies using these chips and also studies with larger numbers of chips. The number of probe-sets within a single chip has also increased significantly from 12 600 on the HG-U95A to over 54 000 on the HG-U133Plus chip (www.affymetrix.com). Efficient, open-source methods of calculating gene expressions from this data and integrating these methods into larger software systems are both clear needs.

One of the most popular Affymetrix expression algorithms is the MAS5.0 algorithm (Affymetrix, 2002). An efficient and programmatic interface to this algorithm would be useful in large-scale batch processing of microarrays. In addition, model-based techniques for generating gene expression indices are shown to have better sensitivity than MAS5.0. Probe-set expression values are modeled as weighted combinations of individual probes. However, the parameters for these models are determined through model-fitting techniques requiring all available chips simultaneously. The RMA algorithm (Bolstad et al., 2003; Irizarry et al., 2003a, b) uses the median polish approach to attain estimates of probe-binding affinity, which can then be used to compute gene expression indices for a probe-set.

Although model-fitting approaches can be more sensitive at detecting small changes in expression, they involve far more computational resources. Since all chips must be accessed simultaneously, a significant amount of memory is required. This feature, combined with increasing numbers of chips and number of probe-sets per chip, motivated our emphasis on efficiency within this implementation, particularly memory efficiency.

The Bioconductor project (http://www.bioconductor.org) provides implementations of both the MAS5.0 and RMA algorithms (Gautier et al., 2004). In fact, several of the RMA components in libaffy are based on these implementations. However, integration of Bioconductor routines within C applications can be difficult. Additionally, there are several key efficiency steps implemented within the libaffy software. RMAExpress (Bolstad, 2003a) is another effort to create a stand-alone RMA implementation that integrates Bioconductor code outside of the R environment. We focus on a full C implementation of these algorithms (including Affymetrix file processing) without Bioconductor dependencies. As a result, this software library can also be embedded within various applications. A Windows-based R package is available at our website as an example of this integration.


    2 DESCRIPTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DESCRIPTION
 3 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The software was designed to be modular for easy modification and integration with other applications. It consists of three different components: libaffy (GeneChip(R)-based processing software), GENE (applications for end-user access) and libutils (a utility library). All of the code for processing GeneChip(R) data is located within libaffy, with many different entry points to this code possible. Full documentation for the functions and corresponding arguments is available at the software website listed earlier. A graphical user interface called GENE (Gene Expression and Normalization Engine) and several command line programs were built for end-user access to the MAS5.0 and RMA algorithms.

The libaffy software includes file access routines and MAS5.0 normalization, based on documented file structures and algorithms from Affymetrix (http://www.affymetrix.com). Support exists to read DAT, CDF and CEL files in both version 3 (text) and version 4 (binary) formats. RMA was implemented based on published descriptions and the Bioconductor source code. Both RMA and MAS5.0 algorithms were implemented in a configurable way such that each step (background correction, normalization and summarization) can be called individually or disabled at runtime. Thus, parts or all of these algorithms can be used without modification of the library code. Various algorithm options are controlled by a flag structure defined within the library.

Applications were developed from the library of normalization routines for easy end-user access to the software. Command line programs allow low-overhead access to the library and provide implementation examples. The graphical interface, GENE, provides access to both the RMA and MAS5.0 algorithms and includes graphical controls over many of the configurable parameters. These applications are thin software layers that wrap the functionality of libaffy but do not extend it. Therefore, the library provides an interface for an end-user application and the embedded functionality to use within microarray processing pipelines.

2.1 Implementation
The software was developed in ANSI C and has been compiled on various platforms, including Microsoft Windows, OS X, Solaris and Linux. The graphical interface (GENE) was designed using wxWidgets, a multiplatform GUI component library.

File access routines support reading DAT, CDF and CEL files. Binary file access is supported across platforms by a series of macros determined at compile time. Affymetrix scanner image (DAT) file support is designed so the entire image is stored in memory. Regions are defined as pointers into this image, thereby allowing flexible access to pixels efficiently. Pixel regions are accessible by probe-set identifier or by pixel coordinates. Probe-level data is stored within a two-dimensional matrix according to the CEL file location. Both the outlier and masked probe lists are loaded and stored as individual bits within a bit-string for maximum memory efficiency (i.e. two bits per cell). The CDF structure provides mappings from the raw intensity matrix to perfect match (PM) and mismatch (MM) probes, mappings to probe-sets and direct coordinate mapping. Pointers are used to link probes to probe-sets, thereby providing efficient traversal of data from physical location or logical grouping.

The MAS5.0 implementation was developed according to the algorithm description provided by Affymetrix Affymetrix, (2002). Discrepancies in results compared to the Affymetrix implementation have been noted elsewhere (Bolstad, 2003b) and are observed within libaffy. Compatibility with the Bioconductor software is supported using a configuration option; the libaffy website contains more detail on the issue of correspondence to Affymetrix results. The efficiency of the MAS5.0 implementation is derived from simply processing each chip individually, thereby reducing overhead. By explicitly managing allocated memory, we can summarize the microarray chip and free the memory for the probe-level data before continuing processing the next chip.

The RMA implementation was derived from the published literature (Bolstad et al., 2003; Irizarry et al., 2003a, b) and the open-source code provided within the Bioconductor software. An advantage of the RMA implementation within libaffy is the stand-alone nature of the code, making it easy to integrate into processing pipelines without requiring the invocation of R. The RMA implementation contains several modifications to be computationally efficient while producing identical results. Model-based approaches such as RMA require all samples to be processed at the same time. Perfect-match values can be extracted from the CEL file and the remaining chip information can be released. The quantile normalization step of RMA was redesigned to first sort each PM array and accumulate partial sums (for the mean calculation across all chips). This does not require additional storage other than the partial sums (one array for the entire set). Ranks for the probes are stored within the PM array after the partial sum has been computed. Once all chips have been processed, the mean is computed from the accumulated sums and distributed back to the individual chips (using the stored ranks). This method eliminates the need to readdress the in-memory chip values for computing the mean. Much of the memory would be swapped to disk by the operating system during the process of loading the chips. Therefore, minimizing the number of passes through the individual arrays will minimize the associated page faults, improving overall efficiency.

Figure 1 provides timing results for libaffy as compared to the Bioconductor implementations using 36 HG-U133A chips (Ge et al., 2005) on a Pentium 4/3 GHz PC with 2 GB of RAM. The libaffy implementation of RMA runs slightly longer than Bioconductor, since libaffy loads and parses a text CDF as opposed to the binary format used in Bioconductor. Figure 2 shows the significantly lower memory usage in libaffy and Table 1 details the high level of agreement in results between implementations.


Figure 1
View larger version (7K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Timing of libaffy versus Bioconductor on 36 HG-U133A arrays.

 

Figure 2
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Memory usage for libaffy versus Bioconductor on 36 U133A arrays.

 

View this table:
[in this window]
[in a new window]

 
Table 1. Expression Differences between implementations

 

    3 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DESCRIPTION
 3 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The libaffy software applications and library were developed to provide a modular, efficient platform for several Affymetrix(R) GeneChip(R) processing algorithms. The design focused on modularity to facilitate integration into microarray processing pipelines and to provide a platform for further GeneChip expression research. Efficient implementation of both MAS5.0 and RMA algorithms allows modest PC workstations to process large numbers of CEL files using a graphical user-interface.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DESCRIPTION
 3 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
This research was supported by the DoD, National Functional Genomics Center project, award number DAMD17-02-2-0051. Views and opinions of, and endorsements by, the author(s) do not reflect those of the US Army or the Department of Defense.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Olga Troyanskaya

Received on December 13, 2006; revised on February 23, 2007; accepted on March 25, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DESCRIPTION
 3 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Affymetrix. Statistical Algorithms Description Document. (2002).

    Bolstad B. RMAExpress. (2003a) http://rmaexpress.bmbolstad.com/.

    Bolstad B. Why do my MAS 5.0 values differ? http://128.32.135.2/users/bolstad/MAS5diff/Mas5difference.html. (2003b).

    Bolstad BM, et al. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics (2003) 19:185–193.[Abstract/Free Full Text]

    Gautier L, et al. Affy–analysis of Affymetrix GeneChip data at the probe level. Bioinformatics (2004) 20:307–315.[Abstract/Free Full Text]

    Ge X, et al. Interpreting expression profiles of cancers by genome-wide survey of breadth of expression in normal tissues. Genomics (2005) 86:127–141.[CrossRef][Web of Science][Medline]

    Irizarry RA, et al. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res (2003a) 31:e15.[Abstract/Free Full Text]

    Irizarry RA, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics (2003b) 4:249–264.[Abstract]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/12/1562    most recent
btm127v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Eschrich, S. A.
Right arrow Articles by Hoerter, A. M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Eschrich, S. A.
Right arrow Articles by Hoerter, A. M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?