Skip Navigation


Bioinformatics Advance Access originally published online on December 12, 2006
Bioinformatics 2007 23(4):458-465; doi:10.1093/bioinformatics/btl630
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/4/458    most recent
btl630v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Baek, J.
Right arrow Articles by McLachlan, G. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Baek, J.
Right arrow Articles by McLachlan, G. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Segmentation and intensity estimation of microarray images using a gamma-t mixture model

Jangsun Baek 1,*, Young Sook Son 1 and Geoffrey J. McLachlan 2

1 Department of Statistics, Chonnam National University Gwangju 500-757, South Korea
2 Department of Mathematics and Institute for Molecular Bioscience, University of Queensland Brisbane, QLD 4072, Australia

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION AND CONCLUSIONS
 REFERENCES
 

Motivation: We present a new approach to the analysis of images for complementary DNA microarray experiments. The image segmentation and intensity estimation are performed simultaneously by adopting a two-component mixture model. One component of this mixture corresponds to the distribution of the background intensity, while the other corresponds to the distribution of the foreground intensity. The intensity measurement is a bivariate vector consisting of red and green intensities. The background intensity component is modeled by the bivariate gamma distribution, whose marginal densities for the red and green intensities are independent three-parameter gamma distributions with different parameters. The foreground intensity component is taken to be the bivariate t distribution, with the constraint that the mean of the foreground is greater than that of the background for each of the two colors. The degrees of freedom of this t distribution are inferred from the data but they could be specified in advance to reduce the computation time. Also, the covariance matrix is not restricted to being diagonal and so it allows for nonzero correlation between R and G foreground intensities. This gamma-t mixture model is fitted by maximum likelihood via the EM algorithm. A final step is executed whereby nonparametric (kernel) smoothing is undertaken of the posterior probabilities of component membership.

The main advantages of this approach are: (1) it enjoys the well-known strengths of a mixture model, namely flexibility and adaptability to the data; (2) it considers the segmentation and intensity simultaneously and not separately as in commonly used existing software, and it also works with the red and green intensities in a bivariate framework as opposed to their separate estimation via univariate methods; (3) the use of the three-parameter gamma distribution for the background red and green intensities provides a much better fit than the normal (log normal) or t distributions; (4) the use of the bivariate t distribution for the foreground intensity provides a model that is less sensitive to extreme observations; (5) as a consequence of the aforementioned properties, it allows segmentation to be undertaken for a wide range of spot shapes, including doughnut, sickle shape and artifacts.

Results: We apply our method for gridding, segmentation and estimation to cDNA microarray real images and artificial data. Our method provides better segmentation results in spot shapes as well as intensity estimation than Spot and spotSegmentation R language softwares. It detected blank spots as well as bright artifact for the real data, and estimated spot intensities with high-accuracy for the synthetic data.

Availability: The algorithms were implemented in Matlab. The Matlab codes implementing both the gridding and segmentation/estimation are available upon request.

Contact: jbaek{at}chonnam.ac.kr

Supplementary information: Supplementary material is available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION AND CONCLUSIONS
 REFERENCES
 
Microarray technology, such as complementary DNA (cDNA) arrays provides a means for measuring the expression levels of thousands of genes simultaneously. cDNA arrays are mostly used in 2-channel experiments. In such an experiment two different cDNA probes are prepared independently for the analysis; the first probe is prepared from control mRNA and labeled with Cy3 (green) dye, and the second from mRNA isolated from the treated cells or tissue under investigation is labeled with Cy5 (red) dye. Both probes are hybridized together on the same microarray slide. Then the fluorescent signal from Cy3 and Cy5 labels is scanned to provide a 16-bit gray-scale image, respectively. The relative intensities of the dyes in individual spots measure the relative amounts of specific transcripts in each sample, meaning differential gene expression can be examined.

The image analysis for cDNA arrays consists of three steps: (1) the addressing step aligns the spots on the array; (2) the segmentation step partitions the set of pixels into foreground and background; and (3) the intensity estimation step extracts the red (R) and green (G) intensities and assigns the log ratio (log2 R/G) after background correction to represent the (log) relative abundance of each spot.

In this paper, we use a simple addressing procedure similar to the one in Li et al. (2005) and mainly focus on segmentation and intensity estimation. There are many segmentation methods for microarrays. The existing methods can be grouped into two categories according to whether a parametric distribution of the pixel intensity is assumed or not: parametric segmentation and nonparametric segmentation (see Yang et al., 2002 for other categorizations.)

Nonparametric segmentation methods do not assume any particular type of distribution on intensities, but use only the descriptive statistics or some information of the histogram, which is a nonparametric estimator for the true distribution of the pixel intensity. Fixed circle segmentation (Eisen, 1999) and adaptive circle segmentation (Axon Instruments Inc., 2003, http://www.axon.com) use a circle of fixed and adaptive radius for fitting foreground, respectively. Jain et al. (2002) use histogram information within a circle. The seeded region growing method (Adams and Bischof, 1994) selects the pixel whose intensity is nearest to the average of the intensities in the neighboring foreground or background. Chen et al. (1997) used a threshold value based on Mann–Whitney test and GSI Lumonics (1999, http://las.perkinelmer.com/), defined foreground and background as the mean intensities between some predefined percentile values. All these methods do not correctly segment doughnut-shaped spots.

In parametric segmentation, the distribution for intensity is specified up to a vector of unknown parameters. Glasbey and Ghazal (2003) assumed the bivariate normal distribution for the red and green intensities of foreground and background in one of their methods. Steinfath et al. (2001) fitted a scaled bivariate normal distribution to pixel values, and Brändle et al. (2003) used an M-estimator for fitting the normal spot model. Li et al. (2005) described a model-based segmentation method in which the normal distribution for the combined intensity (R + G) is implicitly assumed. Demirkaya et al. (2005) fitted the exponential distribution to both foreground and background intensities and used a Markov random field model for segmentation. Gottardo et al. (2006) also applied a Markov random field approach to segmentation, but assumed an uncorrelated bivariate t distribution as the joint distribution of the red and green intensities for both foreground and background.

The correct assumption on the underlying density of pixel intensities is crucial for segmentation and mean intensity estimation in parametric methods. The distribution of background intensities is usually skewed in various degrees while the foreground intensities are scattered elliptically. Thus it would be more appropriate to assume a general distribution that can be fitted well to as many real microarray images as possible. We propose the bivariate gamma and t as the underlying distribution for background and foreground, respectively, both of which are quite general and reduce to other distributions in special cases.

In Section 2, we describe our image segmentation and intensity estimation method, including automatic gridding, justification of the proposed pixel intensity distributions, and mixture model-based segmentation and intensity estimation. In Section 3, we demonstrate the stable parameter estimation result of the proposed method using synthetic data, and present the segmentation and estimation results from applying our method to two real experimental microarray image datasets. We compare the results with those from other methods. Finally, we summarize the results in Section 4.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION AND CONCLUSIONS
 REFERENCES
 
We have implemented the following for microarray image analysis: automatic gridding, model-based clustering of pixels and intensity estimation. The original image data consist of a pair of unsigned 16-bit images, which are stored as TIFF files. One corresponds to the dye Cy5 and another to Cy3. We transformed these original data by taking square roots or logarithms for gridding and segmentation/estimation. The transformed images (R for Cy5 and G for Cy3) have two advantages compared to the original 16-bit images. Very high pixel intensities are damped in the combined image on which gridding is performed based, to prevent very bright pixels from dominating in both gridding and segmentation steps, and the transformed lower-bit images are computationally efficient (Yang et al., 2002).

2.1 Automatic gridding
A microarray slide consists of several blocks containing hundreds of spots. The gridding procedure is to identify blocks and to position rows and columns of spots within each block. Since the blocks are usually separated clearly on a slide, we can take out each block image and analyze them separately. We use a step similar to that of Li et al. (2005). Our algorithm starts by obtaining the combined image and projecting the intensities by summing up across the pixels in each row and each column. We smooth these projections using robust loess (Cleveland and Devlin, 1988) with the bandwidth approximately equal to the width of a typical spot. Then the smoothed projections form a series of peaks and valleys. The grid is defined by drawing a line in each valley. The two figures in Supplementary material show the results from applying the method to a block containing 18 x 24 spots in an example image data at CSIRO (http://spot.cmis.csiro.au/spot/demodownload/SpotExamples.zip). The first figure shows the projections along with the smoothed projections; the valleys correspond to the grid lines. The second figure shows the resulting grids positioning the spots well.

The input parameters for this procedure are the number of spots in each row or column and the bandwidth of the smoother. A crude estimate for the bandwidth calculated as the average number of row (or column) pixels of a spot sufficed in our analyses.

2.2 Distributions of pixel intensities
As parametric segmentation methods depend on the underlying distributions of the pixel intensities for the background and foreground, it is crucial that these distributions are modeled adequately. Several researchers have suggested normal, exponential, t distributions as the underlying distribution for R and G in both background and foreground (Glasbey and Ghazal, 2003, Li et al., 2005, Demirkaya et al., 2005, Gottardo et al., 2006). In Li et al. (2005), the R and G intensities were not considered in a bivariate framework, but rather a univariate combination of them was used for segmentation of the image. In Gottardo et al. (2006), the R and G intensities were modeled in a bivariate framework through the use of the bivariate t distribution, but they were taken to be uncorrelated, whereas our model allows for correlation between the R and G foreground intensities. Further, instead of also adopting a bivariate t distribution for the R and G background intensities, we postulate a bivariate gamma based model. Previously, Newton et al. (2001) assumed independent gamma distributions for the background-corrected R and G obtained after segmentation.

Figure 1 shows the distributions of the pixel intensities for a spot in the CSIRO example image. In Figure 1, the original spot image is displayed in (a); and segmented by our method into background and foreground shown in (b); (c) shows the scatter plot of the 2D pixel intensities (R, G)'s clustered into the background ({Delta}) and the foreground ({circ}). The intensities are concentrated at low values and gradually become sporadic at higher values on both axes for the background. For the foreground, on the other hand, the intensities are scattered elliptically. We fitted the distribution of the proposed mixture model with the estimates obtained by our method to each clustered dye intensity. We modeled the R and G intensities of the background by the gamma distribution, and plotted the fitted p.d.f. with the histogram in (e) and (f). The background distributions are different in two channels. The distribution of the background R looks like exponential which is a special case of the gamma distribution, and the background G follows the gamma distribution as well. The R and G intensities of the foreground were modeled by the t distribution, and plotted in (g) and (h), respectively. The goodness of fit was significant for each distribution at the significance level of 0.05.


Figure 1
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 (a) original image; (b) segmented image; (c and d): scatter plot of the bivariate pixel intensities segmented by the gamma-t, gamma-normal mixture model; (e and f): histogram and gamma dist. fit of R, G in background; (g and h): histogram and t dist. fit of R, G in foreground.

 
We propose the bivariate distribution as the underlying pixel intensity distribution for the background, whose marginal densities for the R and G are independent three parameter gamma p.d.f.'s. The family of gamma distributions is supported on the positive line and provides a rich class of distributions as special cases. It is an exponential distribution if the shape parameter is equal to 1, contains a {chi}2 distribution as a special case, and converges to a normal distribution as the shape parameter gets large (Johnson and Kotz, 1970). For the underlying distribution for the foreground, we adopt the bivariate t distribution. Although the normal distribution is usually assumed for elliptical data, the tails of the normal distribution are often shorter than required. Figure 1d shows the pixel segmentation result when the normal distribution is assumed for the foreground pixels in the mixture model. Some of the pixels with high intensities are classified as the background because the tails of the normal distribution are not long enough to contain them as its population elements. Also, the parameter estimates of the model being fitted can be affected by atypical observations. The t distribution provides a longer-tailed alternative to the normal distribution. Hence it provides a more robust approach to the fitting of normal mixture models, as observations that are atypical of a component are given reduced weight in the calculation of its parameters (McLachlan and Peel, 2000).

In order to check the validity of the proposed distributions in the model, we implemented the Kolmogorov-Smirnov goodness-of-fit test on the pixel intensities of the spots in the CSIRO example image segmented by our mixture model method. Each pixel in the rectangle containing a spot was classified as background or foreground. The P-value for this goodness-of-fit statistic for each of the exponential, normal, and gamma distributions for the background red intensity Rb was calculated for each spot rectangle. The P-value for the foreground red intensity Rf was calculated also for the exponential, normal and t distributions fitted separately to the intensities on the pixels classified as foreground. The corresponding P-values for the green intensities Gb and Gf were calculated too. The average of these P-values for all spot rectangles are listed in Table 1. The red intensity (Rb) in the background follows the exponential or the gamma distribution on average since the mean P-value is well above the usual significance level of {alpha} = 0.05. On the other hand, the green intensity (Gb) in the background fits the gamma distribution well, but the fit is just compatible with the normal distribution at the 5% level (P-value = 0.0635). The red and green foreground intensities (Rf and Gf) fit either the normal or the t distribution very well (the t distribution is better). It could be argued that the superior results for the gamma over the exponential or normal distribution for the background intensities (red and green) and for the t distribution over the latter two distributions for the foreground intensities may be due to the adoption of the former distributions to first classify the pixels as background or foreground. We therefore recalculated the average P-values for these goodness-of-fit tests after the pixels were segmented according to the method of Li et al. (2005), which adopts the normal distribution for both background and foreground intensities. The P-values so obtained are listed in parentheses in Table 1. It can be seen of comparing the mean P-values in parentheses that the gamma and t distributions provide a better fit to the background and foreground intensities, respectively, than the exponential or normal distributions. The mean correlation coefficient between Rb and Gb in the background for all spot rectangles is 0.0262, while the mean P-value for the null hypothesis that there is no correlation between Rb and Gb is 0.3929. It suggests that it is reasonable to take Rb and Gb to be uncorrelated.


View this table:
[in this window]
[in a new window]

 
Table 1 Mean P-values for goodness-of-fit test

 
In our mixture model for segmentation, we assume the two parameter gamma distribution with known location parameters {gamma}1 and {gamma}2 for y' = (y1 = R, y2 = G) in the background, which has the probability density function (p.d.f.)


Formula 1

(1)
where {Theta} = ({alpha}1, ß1, {alpha}2, ß2)', {alpha}i > 0, ßi > 0, yi > {gamma}i ≥ 0, i = 1, 2. In practice, the location parameter {gamma}i is estimated from the data. We use the estimate Formula for small positive number {varepsilon}, i = 1, 2. For the foreground pixel intensity distribution, we assume the bivariate t distribution whose p.d.f. is


Formula 2

(2)
where {delta}(y, µ; {Sigma}) = (y – µ)' {Sigma}–1 (y – µ), µ = (µ1, µ2)' and µ1 = {alpha}11 + {gamma}1 + {varphi}1, µ2 = {alpha}22 + {gamma}2 + {varphi}2, {varphi}1 ≥ 0, {varphi}2 ≥ 0. Since the mean of the foreground intensity must be larger than that of the background, the location parameter µi is parameterized as the mean of the background ({alpha}ii + {gamma}i) plus the nonnegative parameter {varphi}i. Thus the background corrected spot intensity is measured by {varphi}i for each channel.

2.3 Image segmentation and intensity estimation
Let y1, ... , yn denote the bivariate pixel intensities to be segmented. We will assume that the observed intensities are independent and identically distributed realizations from the density


Formula

where {Psi} = {{Theta}, µ, {Sigma}, {nu}, {pi}1}, {pi}i (0 ≤ {pi}1, {pi}2 ≤ 1, {pi}1 + {pi}2 = 1) is the probability that a pixel belongs to the ith component, and f1 and f2 are the densities defined by Equations (1) and (2), respectively. We will implement the EM algorithm to obtain the maximum likelihood estimates (m.l.e.) of the parameters in {Psi}. In the EM algorithm framework, the observed data are viewed as being incomplete, as the associated component-label vectors z1, ... , zn are not available; zj = (z1j, z2j)' is defined by z1j = 1 if yj belongs to group 1 (background), z1j = 0 otherwise, and z2j = 1 if yj belongs to group 2 (foreground), z2j = 0 otherwise. The complete-data vector is defined to be Formula , where Formula . It is appropriate to assume that the zi are distributed according to the multinomial distribution with probabilities {pi}1 and {pi}2. Then the complete-data likelihood Lc({Psi}) is Lc({Psi}) = Formula and its log likelihood is


Formula 3

(3)

E-step. The missing data are the zj ( j = 1, ... , n), which represent the component (segment) labels. Given the value {Psi}(k) of {Psi} after the kth iteration, the E-step on the (k + 1)th iteration of the EM algorithm involves replacing zj by E{Psi}(k) (Zj|yj), its conditional expectation given yj, using {Psi}(k) for {Psi}. It can be expressed as


Formula 4

(4)
which is the posterior probability that yj belongs to the ith component of the mixture using the current fit {Psi}(k) for {Psi} (i = 1, 2; j = 1, ... , n).

The multivariate t distribution can be viewed as a weighted average multivariate normal distribution with the weight gamma distribution of random variable U (McLachlan and Peel, 2000). The complete-data likelihood Lc({Psi}) can be factored into the product of the marginal densities of the Zj, the conditional densities of the Uj given the zj, and the conditional densities of the Yj given the uj and the zj. Then Q({Psi}; {Psi}(k)), the current conditional expectation of the complete-data log likelihood function, log Lc({Psi}) plugged with Formula , using the current value {Psi}(k) for {Psi} is given by (see Supplementary material for the derivation)


Formula

where {xi} = {{alpha}1, ß1, {alpha}2, ß2, µ} and


Formula 5

(5)


Formula 6

(6)


Formula 7

(7)
where, on ignoring terms not involving the {nu},


Formula

M-stepQ({Psi}; {Psi}(k)) is then maximized as a function of {Psi}. The estimate for Formula exists in closed form by maximizing Equation (5):


Formula

To obtain the estimates of {xi} = {{alpha}1, ß1, {alpha}2, ß2, µ} and {Sigma}, we need to take the derivative of Q3({xi}, {Sigma}; {Psi}(k)) with respect to {xi}i (ith parameter in {xi}), and set it equal to zero. Set Formula in Q3({xi}, {Sigma}; {Psi}(k)) of Equation (7). From {partial} Q3({xi}, {Sigma}; {Psi}(k))/ {partial}{alpha}i = 0 we get


Formula 8

(8)

On plugging ßi of Equation (8) into the equation {partial}Q3({xi}, {Sigma}; {Psi}(k))/ {partial}{alpha}i = 0, we then get


Formula 9

(9)
where {psi}(s) = {partial} log {Gamma}(s)/{partial}s is the digamma function. We solve Equations (8) and (9) for {alpha}i and ßi until convergence, and denote the estimates as Formula and Formula .

For the foreground mean, by solving the equation {partial}Q3({xi}, {Sigma}; {Psi}(k))/ {partial}µ = 0, we obtain


Formula

The foreground mean should not be less than that of the background, thus we choose the estimate as Formula , where Formula .

From the equation Formula we have


Formula

On calculating the left-hand side of equation Formula it follows that Formula is a solution of the equation


Formula

where Formula .

The estimates of the parameters in {Psi}, Formula are then input into the E-step. These E-step and M-step are alternated until convergence is reached. We refer to this model as the gamma-t mixture model (GTMM); see also Supplementary material for the EM algorithm of the three-parameter gamma-t mixture model where the {gamma}i are unknown and estimated iteratively.

Let Formula be the estimate of the posterior probability obtained after implementing the EM algorithm. The rectangle containing a spot consists of the foreground at its center and the background surrounding the foreground, and the neighboring pixels in each segment must belong to the same class. Therefore, in order to encourage neighboring pixels to have the same class, we obtain the nonparametric kernel estimate Formula by smoothing neighboring Formula in Equation (4) of the final step: Define


Formula

where K(·) is a bivariate nonnegative kernel function. For simplicity we assume that K(·) is a multivariate probability density function with Formula and Formula H is a nonsingular 2 x 2 bandwidth matrix, and |H| denotes its determinant. Then the modified maximum likelihood estimate of the posterior probability with this nonparametric kernel smoothing is given by


Formula

where Formula is the weight, the kernel K(·) is the standard bivariate normal p.d.f., and xl is the 2D coordinate for the lth pixel in the spot rectangle. We used the 2 x 2 matrix H = (h2, 0; 0, h2) as the bandwidth matrix with a constant h. Since the nonparametric kernel smoothing estimator is consistent (Simonoff, 1996) and accounts for the spatial dependency, Formula reflects the posterior probability better than Formula when there exist a few high-intensity pixels in the background region or low-intensity pixels in the foreground region.

The bandwidth h controls the extent to how many neighboring pixels' information is needed to estimate the posterior probability of the pixel to be classified. The proper kernel bandwidth is needed for good estimation in the smoothing step. Though there are many methods to choose the optimal bandwidth (Simonoff, 1996), we select the smoothing parameter by the following simple step. Let x1 and x2 be the coordinates of the pixel to be classified, and let x1j and x2j be the coordinates of the jth neighboring pixel with intensity yj. Let n1 and n2 be the number of pixels in the row and the column of the spot rectangle image, respectively, then the total number of pixels in the spot rectangle is n1 x n2, and xi, xij can take the values from 1, ... , ni, i = 1, 2. As we use the standard Gaussian kernel, we give different weights to the pixels at xij's, which are within xi ± 1.96h approximately according to their closeness to xi under the 95% confidence level, i = 1, 2. Therefore, roughly speaking, if we want to use the information of the nearest pixel (|xij xi| = 1) on each direction of each axis for smoothing, we need h = 0.51. Also, we need h = 1.02, 1.53 if we smooth up to second and third nearest pixel's posterior probabilities on each direction of each axis, respectively. We used h = 1.02 for the segmentation of the experimental image, and it worked satisfactorily.

The pixels are segmented according to the values of the smoothed estimates Formula ; the jth pixel is classified as background if Formula , and foreground if Formula .

Due to system imperfection and microarray image generation process, there are some undesired spots, such as blanks, low-expressed spots and high-expressed artifacts, all of which are common but difficult to segment in typical microarray images. We flag these unusual spots and label them in the output text file. The user can then decide whether to keep or dismiss the spots in the analysis.

We flag blank and low-expressed spots by the Bayesian Information Criterion (BIC). Let m be the number of components in the mixture model, and let BICm be the value of BIC with m components in the mixture model. Then the case m = 1 corresponds to the situation where there is no foreground, i.e. a blank since background pixels would be the only group in the spot rectangle. Thus, if BIC1 < BIC2, we treat all pixels as the background, flag the spot as blank, and do not estimate the spot intensity.

The case m = 2 would correspond to the typical situation where there exist both foreground and background. When a spot is not flagged as blank but BIC2 is not significantly less than BIC1, it may be uncertain whether there are two groups in the spot rectangle. If Formula for Formula , i.e. there is not much difference between BIC1 and BIC2 relative to the absolute magnitude of BIC1, the spot is flagged as having low-expression, which denotes the pixels of uncertain classification. We estimate the intensities for these uncertain spots. These intensity estimates should be used with caution for further analysis. We used {delta} = 0.01 in producing the results in this paper.

We flag a spot as being valid if it is neither blank nor has uncertain classification (low-expression). Finally, the high-intensity artifact region is detected on the foreground in the valid spots. Since the t distribution has long tails, the segmented foreground may contain the very high-expressed pixels, which consists of artifacts. For each channel, we rearrange the intensities of the pixels segmented as foreground in ascending order. Let Q3R and Q3G be the 3rd quartile of the foreground pixel intensities of R and G, respectively, and let I Q RR and I Q RG be the interquartile ranges of the same foreground pixel intensities for each color. Then the pixel in the foreground whose intensities are Ri and Gi is classified as a high-intensity artifact if Formula and Formula . These high-intensity artifacts are excluded from the foreground when the foreground intensity is estimated.

Having effected segmentation, we now estimate the spot intensity using the intensity information of the segmented pixels. Since the zij's in Equation (3) are now specified, we maximize the log likelihood to estimate the parameters in {Psi}. That is, we obtain the m.l.e.'s Formula of the parameters with the constraint Formula using the intensities of the pixels classified into each class. To estimate the background-corrected intensities, we use Formula where Formula is the location parameter estimate from the data, i = 1, 2, and estimate the log-ratio for each spot by Formula Note it is guaranteed that both Formula and Formula are nonnegative. The whole procedure for segmentation and intensity estimation using the GTMM algorithm is described as follows:

  1. Set Formula for small positive number {varepsilon}, i = 1, 2;
  2. apply GTMM algorithm to obtain m.l.e.'s Formula for two-parameter gamma distribution using the shifted data Formula , and Formula for the t distribution with the constraint Formula ;
  3. smooth Formula s which are calculated with the m.l.e.'s in step 2 to obtain the smoothed posterior probability Formula ;
  4. segment each pixel into the background or into the foreground according to Formula ;
  5. obtain Formula and m.l.e.'s Formula using the segmented pixel intensities with the constraint Formula ;
  6. estimate Formula .

We can also estimate the parameters even when the gamma location parameter {gamma}i's are unknown by solving more nonlinear equations iteratively, which requires longer computation time. We refer to this model as the three-parameter gamma-t mixture model (G3TMM); see Supplementary material for details.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION AND CONCLUSIONS
 REFERENCES
 
3.1 Simulated datasets
We generated artificial spot data, which have similar background and foreground locations as in one of the real spots in the CSIRO example image. That is, we assume a spot rectangle consisting of 17 x 17 pixels. The background distributions are assumed to be gamma({alpha}i, ßi, {gamma}i), i = 1, 2, where {alpha}1 = 1, ß1 = 0.1, {gamma}1 = 7, {alpha}2 = 2, ß2 = 0.1, {gamma}2 = 10. Therefore, the background mean is Formula . The two independent t distributions with the location parameters µfi = µbi + 20, degrees of freedom vi = 20 and the dispersion parameters Formula are assumed for the foreground channels, i =1, 2, respectively. Let (x1(ij), x2(ij)) be the coordinate for the (i, j)th pixel in the spot rectangle located at ith column and jth row of the rectangle lattice, where x1(ij), x2(ij) = 1, 2, ... , 17. Then the intensity of the (i, j)th pixel is randomly generated from the foreground distribution if Formula , and from the background distribution otherwise.

Our method uses the kernel smoothing technique to increase the chances that neighboring pixels are assigned to the same class after implementing the EM algorithm. Figure 2 shows the more effective segmentation according to the smoothed posterior probability Formula compared to that with Formula .


Figure 2
View larger version (33K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 (a and b): true image and scatter plot of the intensities indicating their true classes; (c and d): image and scatter plot of the segmented intensities based on non-smoothed posterior probability Figure 2; (e and f): image and scatter plot of the segmented intensities based on Figure 2. The circle in the scatter plot corresponds to the foreground, and * to the background.

 
Figure 2a shows the combined image of the two channel intensity images which were randomly generated according to the above procedure, and Figure 2b shows the scatter plot of the 2D intensities indicating their true classes. Some of the foreground and background pixels are mixed together around the area where the tails of the two distributions overlap, and two particular pixels are located deep inside of the other distribution's domain. Figure 2c and d give the pixel segmentation results according to the nonsmoothed posterior probability Formula The classification boundary can be recognized along the line where the two p.d.f.'s meet each other, as shown in Figure 2d. This, however, misclassifies some foreground pixels into the background as well as misclassifying some background pixels into foreground, which is shown in Figure 2c. On the other hand, the smoothed posterior probability Formula produces the desired segmentation of the spot image as shown in Figure 2e. The high-intensity pixel in the background and the low-intensity pixel in the foreground are classified correctly as well as many pixels located on the border of two distributions (Fig. 2f). The correct classification of each pixel is due to utilizing the neighboring pixels' spatial information on their class.

In order to see how well our method works for estimation, we generated artificial spot data from the distributions with different sets of true parameter values. Then we estimated the parameters for each dataset, and checked the compatibility of the parameter estimates to the true values for each set of parameters.

Each spot rectangle consists of 17 x 17 pixels. We assume the same three-parameter gamma distribution as in Figure 2 for the background intensity. Therefore, the background mean is Formula The foreground intensities are generated from the t distributions with location parameter Formula , degrees of freedom vi = 20, dispersion parameters Formula , i = 1,2, and no correlation, where µf is Formula (Euclidean) distant from µb in an upper right direction. We set {Delta} = 20, 30, 40 and arranged µf1 = µb1 + {varphi}1 and µf2 = µb2 + {varphi}2 to have log2 ({varphi}1/{varphi}2) = –1, 0, 1 for each {Delta}. For each parameter {theta} = (µb1, µb2, µf1, µf2, {varphi}1, {varphi}2)', let {theta}0 be the true parameter value for each combination of {Delta} and log2({varphi}1/{varphi}2). We set {theta} = {theta}0, generated 100 spot rectangles, estimated the parameter Formula using the whole procedure for segmentation and estimation with the GTMM algorithm for each spot, and calculated the mean Formula of the estimates Formula . To check the compatibility of the parameter estimates with the true values, we performed a t-test for the hypothesis Formula versus the two-tailed alternative with 100 parameter estimates and obtained the P-value for each parameter. The results are summarized in Table 2 for the case where log2({varphi}1/{varphi}2) = 1; see Supplementary Table 1 for other cases. The estimate Formula is close to the true parameter value {theta}0 on average for each set. The P-values for the estimates are well above the usual significance level 0.05 except for some estimates in the case where the two class means are relatively close ({Delta} = 20). The estimates become closer to the true parameter values (larger P-value) as the two class central locations grow apart (larger {Delta}). The G3TMM algorithm also gave similar results.


View this table:
[in this window]
[in a new window]

 
Table 2 The mean of the parameter estimates and the P-value for the t-test on the parameter estimate, obtained from 100 synthetic spot data

 
3.2 Real datasets
We compared our proposed method (using GTMM) with the Spot software of CSIRO (http://spot.cmis.csiro.au/spot/) and the spotSegmentation of Li et al. (2005) in its application to two real microarray images. The first image dataset consists of the files spot1cy5.tif and spot1cy3.tif from http://spot.cmis.csiro.au/spot/demodownload/SpotExamples.zip for Spot, the commercial image analysis software of CSIRO Mathematical and Information Sciences. The second dataset is the spotSegTest dataset of spotSegmentation1.4.0.zip from http://www.bioconductor.org/packages/bioc/html/spotSegmentation.html for the spotSegmentation package (Li et al., 2005), which consists of a portion of the first block from the first microarray slide image from van't Wout et al. (2003).

Figure 3 displays the original top left 24 spot images of the first block in the example microarray of Spot software with the segmentation results of the spots using our approach (Fig. 3b); spotSegmentation (spotSeg) of Li et al. (2005) (Fig. 3c); and Spot of CSIRO (Fig. 3d), respectively. In Figure 3a, the original images contain spots with various shapes as well as blank spots. Our method recognizes them correctly as shown in Figure 3b. The white pixels correspond to the foreground, the black to the background, and the gray to the low-expressed uncertain foreground. The irregular shapes including doughnut are detected in the low-expressed spots. In each scatter plot of the original (R, G) intensity points whose spot is classified as blank, there exist no two clusters but one. The bandwidth of the kernel for calculating the smoothed posterior probability was fixed at h = 1.02. The package spotSegmentation detected blank spots, but generated noisy foreground segmentations in many spot rectangles (Fig. 3c). On the other hand, the Spot software segmented every spot rectangle into foreground and background including the blank spots (Fig. 3d).


Figure 3
View larger version (68K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 Segmentation results of spots in the Spot first example dataset. (a) original image; (b) GTMM; (c) spotSegmentation package; (d) Spot software.

 
The spotSegTest dataset consists of 24 spots arranged in 4 x 6 array. Figure 4 shows the segmentation results of four particular spot rectangles obtained by our method (GTMM) and the Spot and spotSegmentation methods. GTMM used the log transformed data. Each row represents a specific spot rectangle, the first two columns show the R and G intensities of the corresponding spot rectangle, respectively, and the remaining three columns contain the segmentations obtained by three different methods. The spots shown in consecutive rows of Figure 4 are (1, 3)th, (2, 1)th, (2, 5)th, (3, 6)th spot rectangles from upper left corner in the spotSegTest dataset image. The light gray region in GTMM (3rd row and 3rd column) represents the bright artifact, and the dark gray region (4th row and 3rd column) represents the low-expressed uncertain spot. The dark gray region in spotSegmentation (4th column) indicates the uncertain classification determined by Li et al. (2005). The result of Spot (5th column) contains the original image overlayed by the segmented image.


Figure 4
View larger version (52K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4 Segmentation results of spots in the spotSegTest dataset. Each row corresponds to a specific spot rectangle. The first two columns represent the R and G intensities, respectively, and the next three columns are the segmentations obtained by GTMM, spotSegmentation and Spot.

 
For the first spot, the three methods produced similar segmentations. As shown in the first and second columns, there is no expression, however, in the red channel, while there is high-expression in the green channel. Thus the background-corrected mean intensity of R, {varphi}1, should be 0 and the value of log2({varphi}1/{varphi}2) for this spot is –{infty}. The estimate Formula by GTMM is 0 since both foreground and background means of R were estimated as the same value of 5.6 (log scaled intensity), and Formula is –{infty}. The estimate of the background R mean by spotSegmentation is 303.0, while that of the foreground mean is 259.3. This produces a negative value for the estimate of the background-corrected mean, which is the common logical error arising when two means are estimated separately. Spot also generates a negative estimate for the background-corrected R mean since the estimates of foreground and background R mean are 256.6 and 347.9, respectively. The third spot (3rd row) contains the bright region on the bottom right corner of both R and G foreground, which indicates the existence of high-expression artifact. GTMM detected the artifact, spotSegmentation classified the region into uncertain area, but Spot did not detect it. The fourth spot (4th row) is blank as shown in the R and G columns. GTMM and spotSegmentation classified it as an uncertain spot, while Spot obtained the segmentation from the blank spot with the negative background-corrected mean intensity estimates for R and G. The rest of the spots in the spotSegTest dataset were segmented well with respect to the original images as the second spot (2nd row) in Figure 4 by all three methods.


    4 DISCUSSION AND CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION AND CONCLUSIONS
 REFERENCES
 
We have described a mixture model approach based on component gamma and t distributions to spot segmentation and intensity estimation for microarray images. The gamma distribution in the model is suited for the background intensity in the sense that it accounts for the intensities with a positive lower bound and is very flexible in its shape including asymmetric exponential type to symmetric normal type. It is bivariate by taking the R and G intensities to be independent in the background. For the underlying distribution for the foreground, we assume the bivariate t distribution. The t distribution provides a longer-tail alternative to the normal distribution, and its location parameter estimate is less affected by atypical observations. We implemented the EM algorithm to estimate the pixels' posterior probabilities, and then applied a nonparametric kernel smoothing technique that utilizes the neighborhood information in forming the posterior probabilities for the final segmentation. Our model constrains the mean intensity for the foreground to be greater than that for the background. Thus we can estimate the two mean intensities avoiding the common logical error that estimates of foreground may be less than those of the corresponding background.

We demonstrated our method on some simulated and real datasets in which it showed superior performance to the existing procedures considered. Our method identified various shapes of spots and blank spots as well as bright artifacts in the real experimental microarray images. In the simulations, it was also shown to estimate the background subtracted mean intensity closely to its true value.


    Acknowledgments
 
The authors are grateful for helpful comments from two referees that greatly improved the paper. The first author appreciates the hospitality of Department of Mathematics, University of Queensland, Australia for providing the facilities for this research during his visit. The work of J.B. and Y.S.S. was supported by Grant no. RTI04-03-03 from the Regional Technology Innovation Program of the Ministry of Commerce, Industry and Energy (MOCIE). The work of G.J.M. was supported by the Australian Research Council.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Satoru Miyano

Received on September 20, 2006; revised on November 30, 2006; accepted on December 6, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION AND CONCLUSIONS
 REFERENCES
 

    Adams, R. and Bischof, L. (1994) Seeded region growing. IEEE Trans. Pattern Anal. Mach. Intell, . 16, 641–647[CrossRef].

    Axon Instruments Inc. GenePix Pro 5.0, (2003) User's Guide & Tutorial.

    Brändle, N., et al. (2003) Robust DNA microarray image analysis. Mach. Vision Appl, . 15, 11–28[CrossRef].

    Chen, Y., et al. (1997) Ratio-based decisions and the quantitative analysis of cDNA microarray images. J. Biomed. Optics, 2, 364–374[CrossRef].

    Cleveland, W.S. and Devlin, S.J. (1988) Locally weighted regression: an approach to regression analysis by local fitting. J. Am. Stat. Assoc, . 83, 596–610[CrossRef][ISI].

    Demirkaya, O., et al. (2005) Segmentation of cDNA microarray spots using markov random field modeling. Bioinformatics, 21, 2994–3000[Abstract/Free Full Text].

    Eisen, M. Scanalyze, User manual, (1999) , Stanford, CA Stanford University.

    Glasbey, C.A. and Ghazal, P. (2003) Combinatorial image analysis of DNA microarray features. Bioinformatics, 19, 194–203[Abstract/Free Full Text].

    Gottardo, R., et al. (2006) Probabilistic segmentation and intensity estimation for microarray images. Biostatistics, 7, 85–99[Abstract/Free Full Text].

    GSI Lumonics. QuantArray Analysis Software, Operator's Manual, (1999) .

    Jain, A.N., et al. (2002) Fully automatic quantification of microarray image data. Genome Res, . 12, 325–332[Abstract/Free Full Text].

    Johnson, N.L. and Kotz, S. Continuous Univariate Distributions-1, (1970) , NY John Wiley & Sons.

    Li, Q., et al. (2005) Donuts, scratches and blanks: robust model-based segmentation of microarray images. Bioinformatics, 21, 2875–2882[Abstract/Free Full Text].

    McLachlan, G.J. and Peel, D. Finite Mixture Models, (2000) , NY John Wiley & Sons.

    Newton, M.A., et al. (2001) On differential variability of expression ratios: Improving statistical inference about gene expression changes from microarray data. J. Comput. Biol, . 8, 37–52[CrossRef][ISI][Medline].

    Simonoff, J.S. Smoothing Methods in Statistics, (1996) , NY Springer.

    Steinfath, M., et al. (2001) Automated image analysis for array hybridization experiments. Bioinformatics, 17, 634–641[Abstract/Free Full Text].

    van't Wout, A.B., et al. (2003) Cellular gene expression upon human immunodeficiency type 1 infection of CD4(+)-T-cell lines. J. Virology, 77, 1392–1402.

    Yang, Y.H., et al. (2002) Comparison of methods for image analysis on cDNA microarray data. J. Comput. Graph. Stat, . 11, 108–136[CrossRef].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
A. Daskalakis, D. Cavouras, P. Bougioukos, S. Kostopoulos, D. Glotsos, I. Kalatzis, G. C. Kagadis, C. Argyropoulos, and G. Nikiforidis
Improving gene quantification by adjustable spot-image restoration
Bioinformatics, September 1, 2007; 23(17): 2265 - 2272.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/4/458    most recent
btl630v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal