Skip Navigation


Bioinformatics Advance Access originally published online on August 4, 2008
Bioinformatics 2008 24(19):2254-2255; doi:10.1093/bioinformatics/btn407
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/19/2254    most recent
btn407v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Li, W.
Right arrow Articles by Wang, X.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Li, W.
Right arrow Articles by Wang, X.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

BLogo: a tool for visualization of bias in biological sequences

Wencheng Li 1,{dagger}, Bo Yang 1,{dagger}, Shaoguang Liang 2, Yonghua Wang 3, Chris Whiteley 4, Yicheng Cao 1 and Xiaoning Wang 1,*

1School of Bioscience and Bioengineering, South China University of Technology, Guangzhou 510641, 2National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, 3College of Light Industry and Food Sciences, Key Lab of Fermentation and Enzyme Engineering, South China University of Technology, Guangzhou 510641, China and 4Department of Biochemistry, Microbiology and Biotechnology, Rhodes University, Grahamstown 6139, South Africa

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 INPUT AND GRAPHICAL...
 ACKNOWLEDGEMENTS
 REFERENCES
 

Summary: Blogo is a web-based tool that detects and displays statistically significant position-specific sequence bias with reduced background noise. The over-represented and under-represented symbols in a particular position are shown above and below the zero line. When the sequences are in open reading frames, the background frequency of nucleotides could be calculated separately for the three positions of a codon, thus greatly reducing the background noise. The {chi}2-test or Fisher's exact test is used to evaluate the statistical significance of every symbol in every position and only those that are significant are highlighted in the resulting logo. The perl source code of the program is freely available and can be run locally.

Availability: http://acephpx.cropdb.org/blogo/, http://www.bioinformatics.org/blogo/

Contact: lwcbio{at}yahoo.com.cn; xnwang{at}21cn.net

Supplementary information: Supplementary data are available atBioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 INPUT AND GRAPHICAL...
 ACKNOWLEDGEMENTS
 REFERENCES
 
Sequence Logo and WebLogo were created and developed, respectively, by Schneider and Stephens (1990) and Crooks et al. (2004) as ways to visualize sequence conservation. Though the sequence logo was used as a user-friendly generator there were two major drawbacks to note: (i) every symbol was assumed to have equal distribution. When the sequences were from a biased genome (i.e. high G+C or A+T content), or open reading frames (ORFs) where the G+C content were different for the three positions of a codon, the sequence logo had a high background noise and the informative signal was not well defined (Hasan and Schreiber, 2006); (ii) the traditional sequence logos were designed to show ‘conservation’ rather than ‘bias’ of sequences. Sometimes one needed to study the over-represented or under-represented symbols in a region of sequences, which could not be presented in any traditional logo. We have developed a new sequence logo to overcome these two limitations, and by using it we have reported a study about the bias of nucleotides/amino acids in the 5'/N-terminal of genes/proteins in prokaryotic genomes (Li et al., 2007). Here, a website and a Perl source code have been created making the method freely available both online and for local installation.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 INPUT AND GRAPHICAL...
 ACKNOWLEDGEMENTS
 REFERENCES
 
In a biased genome, the formula of information content was modified in at least two solutions. Schreiber and Brown (2002) and Hasan and Schreiber (2006) applied two concepts from IT, distortion and patterned interference (a type of noise) to correct signals. Gorodkin (1997) and Stormo (1998) calculated the information content relative to a background distribution. Here, the algorithm was in accord with that of Stormo's.

Information content was calculated for each position of sequences using the formula:


Formula

where, L is the position in the sequences; i are symbols (A, T, C and G for nucleotides or 20 amino acids for protein sequences); P(i,L) is the average probability of symbol i at position L; and Pi is the background probability of symbol i. H(i,L) is positive when P(i,L) is bigger than Pi, and negative when P(i,L) is smaller than Pi.

For type 1 logo, the total height of the symbols in a position L is equal to H(L), and the height of every symbol i is proportional to its observed frequency. For type 2 logo, all the letters from each stack are ordered from the biggest H(i,L) to the smallest, and all letters with a positive H(i,L) are stacked above zero. The height of a symbol i equals to H(i,L).

For each symbol in every position, the {chi}2-test (in large samples) or Fisher's exact test (in small samples) was used to evaluate the statistical significance of the difference between the frequency of that symbol in that position and the background (expected). The difference was assumed to be significant if P-value was less than the threshold value (the default value was 0.05 and this can be modified by user).


    3 INPUT AND GRAPHICAL OUTPUT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 INPUT AND GRAPHICAL...
 ACKNOWLEDGEMENTS
 REFERENCES
 
The input sequences could be DNA, RNA, protein or codon in flat or fasta format. The background frequency of symbols could be calculated from the input sequences or by the user. When the sequences were codons, the background frequency of nucleotides could be calculated separately for the three positions of a codon. Blogo could create two types of logos: the type 1 logo was similar to WebLogo and the type 2 logo was designed to show sequence bias, for which the over-represented and under-represented symbols were stacked above and below the zero line, and the height of every symbol was proportional to its information content. The color of symbols was the same as the default setting of WebLogo; unless when the statistical test was used, a symbol with a P-value larger than a threshold (default is 0.05) was colored gray. In the case of a type 1 logo (Fig. 1a), setting the equal background frequency of A, G, C and T made very high ‘noise’, especially for the third position of the codon in which the nucleotide frequency were highly unequal. On the contrary, an example of the type 2 logo (Fig. 1b) showed that the nucleotides A, C and T are over-represented and C, T, G under-represented, respectively, for the three positions (numbered 1, 2 and 3) of the first codon. This information was not shown with the type 1 logo (Fig. 1a).


Figure 1
View larger version (35K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Logos created from the 21 nt downstream of the start codon of 642 genes of Streptomyces coelicolor (an organism of high G + C content and unbalanced nucleotide contents for the three positions of codon). (a) Type 1 logo with setting the background frequency of A, G, C and T to 0.25. (b) Type 2 logo with the background frequency of nucleotides for the three positions of codon calculated separately.

 

    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 INPUT AND GRAPHICAL...
 ACKNOWLEDGEMENTS
 REFERENCES
 
We thank the Computing Cluster of South China University of Technology.

Funding: China Postdoctoral Science Foundation (20070410239 to W.L.); National Natural Science Foundation of China (20506007, 20706021); the Research Fund for the Doctoral Program of Higher Education (20070561073).

Conflict of Interest: none declared.


    FOOTNOTES
 
{dagger}The authors wish it to be known that, in their opinion, the first two authors are regarded as joint First Authors. Back

Associate Editor: Limsoon Wong

Received on May 21, 2008; revised on July 10, 2008; accepted on July 29, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 INPUT AND GRAPHICAL...
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Crooks GE, et al. WebLogo: a sequence logo generator. Genome Res (2004) 14:1188–1190.[Abstract/Free Full Text]

    Gorodkin J, et al. Displaying the information contents of structural RNA alignments: the structure logos. Comput. Appl. Biosci (1997) 13:583–586.[Abstract/Free Full Text]

    Hasan Sand, Schreiber M. Recovering motifs from biased genomes: application of signal correction. Nucleic Acids Res (2006) 34:5124–5132.[Abstract/Free Full Text]

    Li W, et al. Sequences downstream of the start codon and their relations to G + C content and optimal growth temperature in prokaryotic genomes. Antonie Van Leeuwenhoek (2007) 92:417–427.[CrossRef][Web of Science][Medline]

    Schneider TD, Stephens MR. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res (1990) 18:6097–6100.[Abstract/Free Full Text]

    Schreiber M, Brown C. Compensation for nucleotide bias in a genome by representation as a discrete channel with noise. Bioinformatics (2002) 18:507–512.[Abstract/Free Full Text]

    Stormo GD. Information content and free energy in DNA–protein interactions. J. Theor. Biol (1997) 195:135–137.[Web of Science]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/19/2254    most recent
btn407v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Li, W.
Right arrow Articles by Wang, X.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Li, W.
Right arrow Articles by Wang, X.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?