Skip Navigation


Bioinformatics Advance Access originally published online on June 6, 2007
Bioinformatics 2007 23(16):2088-2095; doi:10.1093/bioinformatics/btm306
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/16/2088    most recent
btm306v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wang, Y.
Right arrow Articles by Player, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wang, Y.
Right arrow Articles by Player, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Characterization of mismatch and high-signal intensity probes associated with Affymetrix genechips

Yonghong Wang 1,3,*, Ze-Hong Miao 2,4, Yves Pommier 2, Ernest S. Kawasaki 3 and Audrey Player 3

1SAIC-Frederick, Inc., NCI-Frederick, Frederick, Maryland 21702, 2Laboratory of Molecular Pharmacology, Center for Cancer Research, National Cancer Institute, Bethesda, MD 20892, 3Microarray Core Facility, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA and 4Current address: Shanghai Institute of Materia Medica, Chinese Academy of Science, Shanghai, 201203, P.R. China

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 Acknowledgements
 References
 

Motivation: For Affymetrix microarray platforms, gene expression is determined by computing the difference in signal intensities between perfect match (PM) and mismatch (MM) probesets. Although the use of PM is not controversial, MM probesets have been associated with variance and ultimately inaccurate gene expression calls. A principal focus of this study was to investigate the nature of the MM signal intensities and demonstrate its contribution to the experimental results.

Results: While most MM intensities were likely associated with random noise, a subset of ~20% (99 485) of the MM probes displayed relatively high signal intensities to the corresponding PM probes (MM > PM) in a non-random fashion; 13 440 of these probes demonstrated exceptionally high ‘outlier’ intensities. About 15 938 PM probes also demonstrated exceptionally high outlier intensities consistently across all hybridizations. About 92% of the MM > PM probes had either a dThymidine (dT) or a dCytidine (dC) at the 13th position of the probe sequence. MM and PM probes displaying extremely high outlier intensities contained high dC rich nucleotides, and low dA contents at other nucleotides positions along the 25mer probe sequence. Differentially expressed genes generated using Genechip Operating System (GCOS) or modified PM-only methods were also examined. Of those candidate genes identified in the PM-only method, 157 of them were designated by GCOS as absent across all datasets and many others contained probes with MM > PM signal intensities. Our data suggests that MM intensity from PM signal can be a major source of error analysis, leading to fewer potentially biologically important candidate genes.

Contact: wangyong{at}mail.nih.gov

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 Acknowledgements
 References
 
Microarrays are a powerful tool used for the high-throughput assessment of gene expression of thousands of genes in a single experiment. Since the first microarray paper was published in Science more than 10 years ago (Schena et al., 1995), the technology has become the leading method for transcript expression profiling in all fields of biological research. In the year 2006, more than 15 000 papers were published related to the topic of microarrays (PubMed search using key words ‘microarray’ and ‘2006’). Many of these studies relied on the Affymetrix platform for expression profiling. Unique to the Affymetrix platform, the absence or presence of a particular gene is based on a discrimination score (R) which calculates the difference between perfect match (PM) and mismatch (MM) signal intensities (Liu et al., 2002) as computed using the GCOS method (Affymetrix); PM representing hybridization intensity of the perfect complement, and MM representing non-specific hybridization. Studies (2003; Li and Wong, 2001; Naef et al., 2002), however, show that MM signal intensities do not accurately represent non-specific binding signals, and use of analyses methods, such as robust multi-array (RMA) (Irizarry et al., 2003) or model-based expression index (MBEI) implemented in dChip1 (Li and Hung Wong, 2001; Li and Wong, 2001) which consider PM-only signal intensities lead to less variance and more potential candidate genes.

This study began as an effort to examine MM probe intensities and their contribution to variance. The approach was to examine MM signal intensities on the Affymetrix U133 plus 2.0 GenechipTM by comparing the data using a method that considered MM (i.e. GCOS) to one that did not (i.e. PM-only). Most of the gene expression data on the U133 plus 2.0 GenechipTM is computed using 11 pairs of tiled, 25mer oligonucleotides (i.e. probes) designed to correspond to the same gene or target, designated as the probeset. Complete analysis of MM probes involved analyzing individual probes and probesets on the U133 plus 2.0 GenechipTM platform. The Human Genome U133 Plus 2.0 GenechipTM utilized for this study was designed several years ago using the now outdated database of Unigene build #133 and many papers have since demonstrated inconsistencies in probe sequences on this and other microarray platforms due to errors in annotation (Handley et al., 2004; 2005; Kothapalli et al., 2002; 2004a, b; Perez-Iratxeta and Andrade, 2005). There has been great progress in cataloguing of human genome sequences and their annotation since then, with a more current Unigene release in May 2007 (build #202). Many ESTs that were thought to correspond to one specific target in the original Unigene assembly were later found to belong to another target or have become obsolete. Gene or target assignments derived from U133 plus 2.0 probesets are in some cases inaccurate (Mecham et al., 2004a, b) due to the errors in the original databases. To circumvent problems resulting from inaccuracies in annotation, for this current study, we chose to use the updated annotation which would allow for more accurate gene assignment. The pre-existing PM-only methods did not allow for incorporation of the corrected information, as a result, we devised our modified PM-only method (see Methods section). Results using the modified PM-only method was compared to data generated using the GCOS method, allowing for assessment of the contribution of MM probes. As the main emphasis of this study was to analyze the nature of MM signal intensities, we interrogated sequences of individual probes and compared GCOS method to our modified PM-only analysis method which contains corrected annotation, consistently making reference to these two methods. The effects of remapping (Harbig et al., 2005; Mecham et al., 2004a, b) and normalization (Bolstad et al., 2003) have been elegantly demonstrated by other investigators, so we did not find it necessary to validate the performance of our modified PM-only method following the changes.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 Acknowledgements
 References
 
2.1 Samples used for dataset analysis
Preparation of RNA and its subsequent hybridization to U133 plus 2.0 Affymetrix GenechipTM (Affymetrix, Santa Clara, CA, USA) are being published as part of another study (Miao et al., Cancer Research 2007 manuscript in press). Microarray data entry followed MIAME guidelines, and was deposited in NCBI's Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/), accessible through GEO Series accession number GSE7161. In summary, total RNA was purified from a control (HCT116) and test (HCT116-Top1 transfected) cell lines. Each sample contained four technical replicates. Eight hybridizations were performed on Affymetrix U133 plus 2.0 GenechipTM. Biotin-CTP and/or biotin-UTP, obtained from Enzo Biochemicals (Farmingdale, NY, USA) and Roche Molecular Biochemicals (Indianapolis, IN, USA), respectively, were incorporated during in vitro transcription, as recommended in the amplification protocol. The Affymetrix U133 plus 2.0 GenechipTM contains 54 613 unique probesets, as assigned by Affymetrix, corresponding to 594 534 unique probe sequences. The high density expression microarray is reported to represent complete coverage of the human genome, including sequences derived from GenBank®, Refseq and expressed sequence tags (ESTs) allowing for analysis of unknown targets. GenechipTM expression microarrays were washed, stained and scanned to determine the target signal intensities, as recommended in the Affymetrix expression manual (www.Affymetrix.com).

2.2 Probe sequence analysis—determination of sequences used for the modified PM-only method analysis
Only correctly assigned probe sequences were utilized in the modified PM-only method. Accuracy of the annotation of individual probe sequences was determined by analysis using Blast method (ftp://ftp.ncbi.nlm.nih.gov/blast/executables). Two databases were used: the human Refseq sequence database downloaded from NCBI (ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/), and the mRNA sequence database from UCSC Genome Center (http://hgdownload.cse.ucsc.edu/downloads.html#human). Briefly, all PM probe sequences downloaded from Affymetrix website were blasted against human Refseq sequence database and those with exact matching sequences were assigned the corresponding Refseq IDs. Probesets not mapping to Refseq, were blasted against mRNA sequence database and assigned mRNA IDs based on the exact mapping results. For our modified PM-only method, probesets and individual probes not mapping to either Refseq or mRNA sequences were excluded from data analysis.

2.3 Analysis methods
GCOS analysis software is supplied by Affymetrix. Signal intensities of the MM probes are adjusted using an algorithm resulting in an idealized mismatch when required, and gene expressions computed using Tukey Biweight method (Hubbell et al., 2002). Intensities are scaled by dividing an arbitrarily assigned target value (i.e. 500) by the average signal intensity as calculated for all probesets. Discrimination between PM/MM signal intensities are used to determine if a target is designated present or absent (Liu et al., 2002). GCOS was used to determine the difference in gene expression between pairwise comparisons of four tests samples versus four control RNA samples, generating 16 comparison files. Differences in signal intensity between probesets are designated as no change, increased, decreased or some marginal variation.

Gene expression data was also generated using our modified PM-only method. Our method is similar to RMA and dCHip, in that only PM intensities were considered as obtained from the particular Affymetrix CEL file. However, for our modified method, all probe level annotations were verified, and only those which correctly mapped to Refseq or mRNA sequences were used, and subsequently, MAS background subtraction performed (using the BioConductor MAS function). For each probeset, the final gene expression intensity is computed using Tukey Biweight method. Extreme signal intensities (outliers) across all correctly assigned probes of each probeset are identified based on the calculation of 1.5-fold of IQR [inter-quartile range, the difference between Quartile 3 (Q3) and Quartile 1 (Q1)]. Signal intensities greater than the upper fence (Q3 + 1.5IQR) or smaller than the lower fence (Q1 – 1.5IQR) were designated as outliers. In this study, we only discuss the outlier probes above the upper fence as there are approximately 30 000 probes in this category on each GenechipTM, while only about 200 probes have outlier intensities below the lower fence. All data manipulations and statistical calculations were performed using Perl scripts and R. Similar to GCOS, 16 files were generated from pairwise combinations of four test and four control hybridizations, and LOESS normalization (Bolstad et al., 2003) were performed and the log2 ratio of gene expression for each probeset were computed within each file. One sample t-test was performed based on the comparisons of all gene expression values (log2 ratio) of the same probeset among the 16 files. A P-value of 0.05 was used as a cut off value and only probesets that demonstrated statistically significant gene expression across all files (P ≤ 0.05) regardless of the magnitudes of the values were used for further data analysis. As only those genes with consistent gene expression values in all replicates could meet the P-value cut off, this was designated the reliable gene list. Candidate genes were identified using z-score test by calculating the 95% cut off interval.

2.4 Quantitative real-time PCR
PCR primers are designed using the Primer3 web based primer designing software (http://cbr-rbc.nrc-cnrc.gc.ca/cgi-bin/primer3_www.cgi). Primers sequences are listed in Supplementary Table 1. Amplicon sizes ranged from 100 to 200 nt. Quantitative RT-PCR was performed as recommended by applied biosystems (ABI) using the SYBR Green PCR protocol and the ABI 7900HT amplification system. Transcript expression levels were determined by using a limited dilution approach. A standard curve was generated by limited dilution analysis of Glyceraldehyde-3-phosphate dehydrogenase (GAPDH) house-keeping gene in a cDNA control template. Relative differences between transcript levels in test versus control samples were determined based on the standard curve.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 Acknowledgements
 References
 
3.1 Probe annotation analysis to determine correct sequences for use in modified PM-only method
Prior to our analyses, we found it necessary to characterize the accuracy of the probesets on the U133 plus 2.0 GenechipTM, as we were designing a PM-only analysis program that would eliminate the incorrectly assigned probes. This modified method was to be used for this and subsequent microarray studies. We envisioned the results of such an endeavor would be most beneficial when considering gene assignment of a particular probeset. For most of the targets on the U133 plus 2.0 GenechipTM, eleven probes are designed to represent one gene or EST target, for a combined total of 54 613 probesets. Of the 54 613 probesets, 31 468 contain at least one or more probes mapping to the designated Refseq sequence (Fig. 1). Of these, 23 275 probesets mapped to a single unique Refseq ID, and the remaining 8193 probesets matched to at least two different Refseq IDs, many of which appear to represent splice variants of a particular genes or different IDs but with same annotation. Eliminating redundancy, these probesets represented 25 815 unique Refseq IDs (out of the total about 29 000 different Refseq IDs in the Refseq database as cited in July 2006).


Figure 1
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Probeset sequences remapping result. All of the 54 613 probesets on the U133 plus 2.0 GenechipTM were analyzed for accuracy. Probes sequences were blasted against Refseq and mRNA databases.

 
Some 23 145 probesets did not map to RefSeq sequences. Many, however, did map to the mRNA sequence database. When blasted against the mRNA sequence database, we found that 15 431 of the probesets could be assigned to at least one mRNA ID. The remaining 7714 probesets could not be assigned to any human Refseq or mRNA ID, as determined by Blast analysis. Many of these sequences included prokaryotic control sequences or simply lacked annotation information. They were removed from analyses when using the modified PM-only method. Even within the 46 899 probesets that mapped to Refseq or mRNA IDs, approximately 9880 probesets contained inaccuracies, where at least one of the eleven probes did not map to either Refseq or mRNA sequences. Supplementary Table 2 shows one typical probeset that contains 6 probes out of total 11 that could not map to the assigned genes. So, in addition to the 7714 non-mapping probesets, the individual probes associated with the 9880 probesets that could not map to any Refseq or mRNA sequences were also removed from analyses when using our modified PM-only method. Targets corresponding to the 9880 probesets remained available for subsequent analysis; only incorrect, individual probes were removed. Supplementary Figure 1 summarizes the distribution of the number of probes correctly mapping to a particular probeset. Incorrectly assigned probes were removed only for the modified PM-only method. All probesets were used for analyses in the GCOS. GCOS analysis method was compared to our modified PM-only method, representing analyses with and without MM information.

3.2 Assessment of MM probes
Mismatched probes are unique to the Affymetrix chips. They are designed by substituting the 13th nucleotide of each PM probe sequence with its complementary nucleotide. Although MM intensities are designed to represent non-specific hybridization, studies have shown that for many, the signal intensities are actually higher than their corresponding PM probes (designated: MM > PM) (Irizarry et al., 2003; Naef et al., 2002). In this current study, we examined all MM probes and identified those MM > PM probes. Of this group, we noticed a subset of probes (99 485; ~20%) which, across all experimental datasets, regardless of the sample type, consistently demonstrated MM > PM. The overall intensity distribution of the corresponding PM probes of these 99 485 sequences was similar to that of all PM probes on the chip (Supplementary Fig. 2A, B), representing a range of intensities, not necessarily low intensity. This phenomenon is probably not due to random noise associated with low signal intensities. Upon analysis of their nucleotide content, we found that 92% (91 369) of these probes had either a dThymidine (dT) or a dCytidine (dC) at the 13th position (Fig. 2A), suggesting that this was more closely related to the presence of dT or dC at the 13th position of the mismatch sequences. About 13 440 of these probes, displayed exceptionally high intensities (outliers based on IQR analysis), and designated as probes with ‘exceptionally high outlier signal intensities’. For these probes, nucleotides surrounding the 13th position of the MM appeared to contribute to exceptionally high outlier signal intensities (Fig. 2B). The exceptionally high outlier signal intensities appeared to be accompanied by regions with high dC and low dA nucleotide contents, compared to the nucleotide content of all MM probes on the chip, which did not demonstrate this pattern (Fig. 2C). This condition was independent of the sample type as the same pattern was observed when other datasets and Affymetrix microarray platforms were examined (data not shown). Following a search of the literature, we found that 2005) made a similar observation with respect to the nucleotide at the 13th position of the MM. Binder and Preibisch (2005) showed that high MM signal intensities were dependent on dT or dC at the 13th position of MM probes. He referred to this as the ‘middle-base related biases’. Our analysis shows that regions surrounding the 13th nucleotide also affect the magnitude of MM intensities, leading to exceptionally high outlier signal intensities.


Figure 2
View larger version (22K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Position-specific nucleotide usage for assessment of high signal intensity MM probes: Affect of representation of particular nucleotides along positions 1–25 of the probe sequence. (A) All probe sequences with higher MM intensities than the corresponding PM intensities (MM > PM). (B) High outlier signal intensity MM probes. (C) All MM probes (604 258) on the chip. The x-axis represents nucleotide positions and y-axis represents percentage of the specific nucleotide at each position. Under random circumstances, we assume that each nucleotide has a 25% chance of occurrence.

 
Table 1 shows a typical probeset containing probes with MM > PM. For many probes, the higher MM signals relative to the corresponding PM probe sequences appeared to be dependent on the particular nucleotide at the 13th position, rather than the mismatch per se. This was observed for both high and low intensity probes, designated as absent and present; however, we did find that 77% (70 744) of these probes were associated with probesets designated as absent. It is not surprising that MM > PM probes are associated with absent calls; however, signals from these particular probes appear to be related to the non-random occurrence of dT or dC specific nucleotides. This did not appear to be affected by incorporation of the biotin analog, as the pattern was unchanged whether biotin-CTP or UTP was incorporated into antisense RNA (Fig. 3A and B), or incorporation of the much smaller amino allyl-CTP molecule (data not shown). We did not examine the affect of using biotin analogs other than CTP or UTP. We conclude that the MM signal may not always represent non-specific binding; for some probes, it represents some unknown, non-random binding behavior, so the use and subtraction of MM intensities may not be appropriate.


View this table:
[in this window]
[in a new window]

 
Table 1. Examination of nucleotides at the mismatch position of high signal intensity probes

 

Figure 3
View larger version (25K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Position-specific nucleotides usage distribution of MM following biotin-CTP (A) or biotin-UTP (B) incorporation. The x-axis represents nucleotide positions and y-axis represents percentage of nucleotide usage at each specific position. Under random circumstances, we assume that each nucleotide has a 25% chance of occurrence.

 
3.3 Variance and accuracy of calls associated with use of the MM
General assessment of MM was accomplished by examining the reproducibility of the signal intensities (i.e. variance) associated with a method that utilize MM signal intensities, compared to an analysis method that did not. Replicate samples were compared using GCOS and a modified PM-only method. We observed increased variance associated with signal intensities generated using GCOS, compared to the modified PM-only method, as demonstrated by scatter plots (Fig. 4A and B) and Box plots (Fig. 4C). Other studies (Irizarry et al., 2003) using PM-only methods, like RMA show a similar pattern. As methods like RMA do not use corrected annotations, it appears that the MM signal intensities are the principal source of the variation. Increased variance of the GCOS method was particularly apparent at the lower signal intensities (Fig. 4A).


Figure 4
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Scatter plot of signal intensities of all probesets following use of the different analysis methods. Scatter plot of the signal intensities from technical replicates were compared following use GCOS and the modified PM-only methods. Variance associated with signal intensities of two control or two test samples was determined. (A) Plot of the signal intensities of the test samples following use of the GCOS method (B) Plot of the signal intensities following use of the modified PM-only method with MAS background subtraction. (C) Box plot of SD distribution of probesets for test and control replicates. From left to right are: test sample (PM-only analysis), control sample (PM-only analysis), test sample (GCOS analysis) and control sample (GCOS analysis).

 
We found that over 70 744 probes, 77% of the 91 369 MM > PM, were associated with absent calls. It could be that these probes contribute significantly to the variance, as they represent discrepant signal intensities. Under the current GCOS design, it is impossible to specifically remove the high MM intensity probes and determine their contribution to variance. However, because most are associated with the absent calls, we suspect a significant contribution to variance.

Based on the reliability of differential expression, not fold change, candidate gene lists were generated using both the modified PM-only and GCOS methods; 908 potential candidate genes were identified for the PM-only method and 364 for GCOS. There are 239 genes common between the two methods, with 669 unique to PM-only and 125 unique to GCOS (see Supplementary Fig. 3). Although a list of 125 reliable genes was unique to GCOS, nearly all either (a) demonstrated <2 SD change in PM-only method analysis or (b) did not map to either reference or mRNA sequences. As a result, they were not examined further, as would be the criteria when selecting genes for down-stream analysis.

A list of 669 reliable genes was unique to the PM-only method, not identified as differentially expressed using GCOS; we suspected that this list consisted of genes, previously determined to be either absent or contain high MM intensity probes relative to the corresponding PM probes. Detailed analysis of the data show that, of the 669 probesets, 157 probesets are designated as absent on all 8 chips using GCOS. Another more than 100 probesets include MM > PM probes. There are also additional 98 probesets, different from the probesets above, containing only 1–3 present calls in the 8 chips, these probesets are likely to be removed from GCOS analysis. In a separate study (data not shown), we examined targets designated as absent and present in 182 hybridizations on the U133 plus 2.0 GenechipTM, using diverse biological sample types. We found that 15 485 probes (~28% of the total probesets on the chip) were ‘absent’ in 90% of the chips (164 out of 182 chips). Total 82% of these targets had valid annotations. These data lead us to suspect that the ‘absent’ designation may, in part, be related to the GCOS analysis, as these targets are consistently absent regardless of the biological sample types examined. Some 100 probesets included in the 669, were included in the group of 28% absent probesets. Variance introduced by the MM likely contributes to our failure to detect reliable candidate genes when using GCOS. We cannot determine the precise contribution of all MM probes compared to high intensity MM probes as this would require complete interrogation of the GCOS algorithm, of which we do not have access.

For validation of the performance of our analysis method, we performed Quantitative RT-PCR of the genes associated with the modified PM-only method. We randomly chose six genes unique to this method, and performed Quantitative RT-PCR validation. Figure 5 shows gene expression information based on both Quantitative RT-PCR and microarray. Quantitative RT-PCR results were similar to microarray, validating the reliability of the modified PM-only method. These six genes were not considered reliably differentially expressed using the GCOS analysis method.


Figure 5
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Validation of candidate genes identified by modified PM-only method. Six genes were randomly chosen for validation of modified PM-only method. Quantitative RT-PCR results are plotted versus microarray analysis. Relative intensities of both test and control samples are computed from the standard curve generated using limited dilution of GAPDH. Ratios are generated by comparing relative intensity of test divided by intensity of control samples.

 
3.4 Nucleotide content analysis of PM probe sequence
The influence of dT or dC nucleotides was not limited to MM sequences. Following examination of the PM signal intensities, we found that 15 938 probes (~2% of unique probes) demonstrated extremely high outlier signal intensities consistently through all eight hybridizations based on the 1.5 x IQR analyses (see Methods section). This phenomenon was independent of the datasets, as a similar pattern was observed from different studies (hybridizations). Nucleotide content analyses of these high outlier signal intensity probes showed that high dC (and to a lesser degree, dT) and low dA contents were associated with this phenomenon (Fig. 6A), compared to that of most other PM probes (Fig. 6B). Again, this pattern appears unchanged with biotin labeled CTP or UTP (Supplementary Fig. 4A–D), suggesting that the high intensity signals are at least partially independent of biotinylated analogs, and associated with nucleotide contents of the probe sequences which favor higher dC and lower dA nucleotide contents (and to a lesser degree, dT). Since this (a) represents a limited number of probes (per probeset) and (b) Tukey biweight estimator rather than mean is used to perform the analysis, it probably does not negatively affect variance and the analysis results for most of the probesets. Comparing the extremely high outlier intensity probes for both the PM (15 938) and MM (13 440), we found 2849 MM common to PM probe list, representing 18% concordance (i.e. 2849/15 938). Difference in the selection criteria most likely explains the incomplete concordance.


Figure 6
View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. Position-specific nucleotide usage for assessment of PM probes. Representation of particular nucleotides along positions 1–25 of the probe. (A) Position-specific nucleotide usage of PM probes demonstrating extreme high outlier signal intensities compared to (B) all PM probes (483 218) that have validated annotations. The x-axis represents the nucleotide positions and y-axis represents percentage of the specific nucleotide at each position.

 

    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 Acknowledgements
 References
 
A major objective of this study was to thoroughly examine and characterize the Affymetrix microarray platform. Principally, we find that (a) many probes within a given probeset are incorrectly assigned to the particular gene (b) signal intensities are affected by the nucleotide content, most obviously effecting MM probes. Similar to previous studies, we find that the MM probes are a source of variation for the Affymetrix platform (Irizarry et al., 2003; Li and Hung Wong, 2001; Li and Wong, 2001). We show that 99 415 of the MM probes (~20%) have high signal intensity relative to the corresponding PM probes mostly due to the dT or dC nucleotide content at the 13th position, and 77% of these are associated with the absent calls. This is a substantial number of probes which may contribute to absent calls, and also possibly, contribute to variance and affect the number of differentially expressed candidate genes. We find that by utilizing a PM-only method, which excludes MM probe information, we can avoid these problems, allowing for detection of a significant number of differentially expressed genes not detected using the GCOS method. The additional candidate genes detected using the PM-only method could have important biological function, so it is important PM-only methods be considered. Sequence dependent outlier signal intensities were also associated with PM probes, although they appear to ultimately contribute less to variance.

It has long been established that MM probes are a source of variation, however, there are fewer studies examining the precise nature of the variation and its affect on experimental data. The principal goal of this study was such an attempt. The MM probe only differs by 1 nt, at the 13th position, from the corresponding PM probe. Surface plasmon resonance study of the binding behavior of some sequences suggests that at least some of the MM probe sequences can form perfect double strand helix structure with the complementary sequence of the corresponding PM sequences (unpublished results), suggesting that simply by substituting one nucleotide at the 13th position of the probe does not necessarily alter Watson Crick binding. Naef (Naef and Magnasco, 2003; Naef et al., 2002) proposed a hybridization model based on site-specific affinities, which suggested that the base chosen for the 13th position was indeed important. In their model, adenines in the middle of the probe sequence tend to have low binding affinities, while cytosines lead to high binding affinities. A somewhat similar model was proposed by Binder (Binder and Preibisch, 2005). Naef (Naef and Magnasco, 2003) proposed that the larger purines dA and dG might dramatically distort the binding ability of MM to aRNA, resulting in a low non-specific binding signal. On the other hand, it could be that small pyrimidine nucleotides dT and dC, do not necessarily alter the double helix structure, but may relax the steric tension caused by the large biotin analogs in the PM probe, leading to increased binding affinities of MM probes to aRNA than the PM probes. We suspect that biotin-analogs can affect the binding, and ultimate signal intensity of probes, but we did not observe a dependence on biotin-UTP compared to biotin-CTP. We used datasets generated as a part of previous studies, so we did not examine the affects of using other biotin analogs. Naef 's model suggested differences between adenines and cytosines at the 13th position, while our results also demonstrate high dC and low dA contents at additional regions of the probe sequences could also contribute to the high binding affinities between aRNA and DNA probe sequences. Co-occurrence of this phenomenon with the high intensity probes suggests possible contribution to the signal intensity.

Overall, there are cases where MM does appear to represent non-specific signal intensity, so one might argue that it suits the objective. However, for a significant number of probes the MM intensity is non-random and biased, so, it is probably more appropriate not to subtract the MM signal intensity as this can be the source of variance, hence error. In addition, we propose that the high signal intensity probes contribute to inaccurate gene expression calls, ultimately affecting the number of reliable targets, and candidate genes, many of which might be biologically relevant.

In conclusion, we observed two groups of probes displaying high intensities (a) MM > PM probes, many of which were associated with ‘absent’ calls in GCOS and had dT or dC at the 13th position and (b) MM and PM probes with exceptionally high outlier intensities, defined by regions of high dC (and to a lesser degree, dT) and low dA nucleotide contents. This sequence related bias most likely contributes to abnormally high signal intensities. Given that a significant number of the MM > PM probes are associated with absent calls in GCOS, we suggest they contribute significantly to variance and ultimately unreliable gene expression results, all of which can be avoided with use of a PM-only analysis method. Although the PM sequences do not seem to be an issue, the high intensity MM does, and can easily be eliminated utilizing a PM-only data analysis method. In addition, our results also provide guidance in the probe design algorithm, that is by properly controlling the overall nucleotide contents of both PM and MM probe sequences and middle nucleotide content of MM probes, we can potentially reduce the errors generated from sequence related sources.


    Acknowledgements
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 Acknowledgements
 References
 
This project was funded in whole or in part with federal funds from the National Cancer Institute, National Institutes of Health, under contract N01-CO-12400. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the US Government.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Chris Stoeckert

1dChip considers either PM-MM or PM-only signal intensities in its analysis of datasets. Back

Received on January 18, 2007; revised on May 30, 2007; accepted on June 1, 2007

    References
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 Acknowledgements
 References
 

    Binder H, Preibisch S. Specific and nonspecific hybridization of oligonucleotide probes on microarrays. Biophys. J (2005) 89:337–352.[CrossRef][Web of Science][Medline]

    Bolstad BM, et al. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics (2003) 19:185–193.[Abstract/Free Full Text]

    Handley D, et al. Evidence of systematic expressed sequence tag IMAGE clone cross-hybridization on cDNA microarrays. Genomics (2004) 83:1169–1175.[CrossRef][Web of Science][Medline]

    Harbig J, et al. A sequence-based identification of the genes detected by probesets on the Affymetrix U133 plus 2.0 array. Nucleic Acids Res (2005) 33:e31.[Abstract/Free Full Text]

    Hubbell E, et al. Robust estimators for expression analysis. Bioinformatics (2002) 18:1585–1592.[Abstract/Free Full Text]

    Irizarry RA, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics (2003) 4:249–264.[Abstract]

    Kothapalli R, et al. Microarray results: how accurate are they? BMC Bioinformatics (2002) 3:22.[CrossRef][Medline]

    Li C, Hung Wong W. Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol (2001) 2. Research0032.1–Research0032.11.

    Li C, Wong WH. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc. Natl Acad. Sci. USA (2001) 98:31–36.[Abstract/Free Full Text]

    Liu WM, et al. Analysis of high density expression microarrays with signed-rank call algorithms. Bioinformatics (2002) 18:1593–1599.[Abstract/Free Full Text]

    Mecham BH, et al. Sequence-matched probes produce increased cross-platform consistency and more reproducible biological results in microarray-based gene expression measurements. Nucleic Acids Res (2004a) 32:e74.[Abstract/Free Full Text]

    Mecham BH, et al. Increased measurement accuracy for sequence-verified microarray probes. Physiol. Genomics (2004b) 18:308–315.[Abstract/Free Full Text]

    Miao, et al. Cancer Research. (2007) (in press).

    Naef F, Magnasco MO. Solving the riddle of the bright mismatches: labeling and effective binding in oligonucleotide arrays. Phys. Rev. E Stat. Nonlin. Soft Matter Phys (2003) 68:011906.[Medline]

    Naef F, et al. DNA hybridization to mismatched templates: a chip study. Phys. Rev. E Stat. Nonlin. Soft Matter Phys (2002) 65:040902.[Medline]

    Perez-Iratxeta C, Andrade MA. Inconsistencies over time in 5% of NetAffx probe-to-gene annotations. BMC Bioinformatics (2005) 6:183.[CrossRef][Medline]

    Schena M, et al. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science (1995) 270:467–470.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BiostatisticsHome page
J. Hardin and J. Wilson
A note on oligonucleotide expression values not being normally distributed
Biostat., July 1, 2009; 10(3): 446 - 450.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
Q. Xu, M. R. Schlabach, G. J. Hannon, and S. J. Elledge
Design of 240,000 orthogonal 25mer DNA barcode probes
PNAS, February 17, 2009; 106(7): 2289 - 2294.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/16/2088    most recent
btm306v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wang, Y.
Right arrow Articles by Player, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wang, Y.
Right arrow Articles by Player, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?