nature genetics
    A user's guide to the human genome

Return to TOC
Previous Article AbstractFULL TEXTNext Article Abstract
Full Text PDF

volume 32 supplement pp 44 - 48

Question 7
How would an investigator easily find compiled information describing the structure of a gene of interest? Is it possible to obtain the sequence of any putative promoter regions?

One place to initiate this search is at UCSC's Genome Browser, at For purposes of this example, consider the gene encoding pendrin (PDS), a protein associated with developmental abnormalities of the cochlea, sensorineural hearing loss and diffuse thyroid enlargement (goiter).

From the UCSC home page, choose Human from the pull-down Organism list, and click on Browser. The user is now at the Human Genome Browser Gateway. The search in this case is simple: select Dec. 2001 from the assembly pull-down menu, type pendrin into the position box, and then click Submit. The returned results indicate one known gene and two mRNA sequences; click on the accession number of the mRNA sequence AF030880 to continue. The user will now be presented with a graphic overview of the region containing this mRNA. To gain a better perspective of the region, click on the 1.5times button next to zoom out. Finally, click the reset all button on the middle of the page to reset the tracks to their default settings.

Carrying out these steps will produce an output similar to that shown in Fig. 7.1. For the purpose of this question, however, the default settings are not ideal. Using the Track Controls at the bottom of the figure, and following the example in Fig. 7.2, set some tracks to hide mode (not shown), others to dense (all data condensed onto one line) and some to full (a separate line for each feature, up to 300). Before considering the actual data within these tracks, a brief discussion of the content and representation of these tracks is warranted. Many were provided to UCSC by outside individuals. Further information on the gene prediction methods briefly discussed below can be found elsewhere15.


The general convention for the Known Genes and predicted gene tracks (Fig. 7.1) is that each coding exon is shown as a tall, vertical bar or block. 5' and 3' untranslated regions are shown as shorter vertical bars or blocks.

Connecting introns are shown as very thin lines. The direction of transcription is indicated by the arrows along that thin line.

Known Genes are taken from mRNA reference sequences within LocusLink10. These reference sequences have been aligned against the genome using BLAT.

The Acembly Gene Predictions With Alt-splicing track is derived from the alignment of human mRNA and EST sequence data against the genome, using the program Acembly. This program attempts to find the best alignment of each mRNA against the genome and considers alternative splice models. If more than one gene model with statistical significance can be produced, each of these is shown in the display. Additional information on Acembly can be found on the NCBI web site at

The Ensembl Gene Predictions track7 is provided by Ensembl. The Ensembl genes are predicted by a range of methods, including homology to known mRNAs and proteins, ab initio gene prediction using GENSCAN and gene prediction HMMs.

The Fgenesh++ Gene Predictions come from a method that predicts internal exons by looking for structural features such as donor and acceptor splice sites, putative coding regions and intronic regions both 5' and 3' to a putative exon using a dynamic programming algorithm; the method also takes into account protein similarity data16.

The Genscan Gene Predictions derive from a method called GENSCAN, through which introns, exons, promoter sites and poly(A) signals can be identified. Here, the method does not expect the query sequence to represent one and only one gene, so it can make accurate predictions for either partial genes or multiple genes separated by intergenic DNA11.

The Human mRNAs from Genbank track shows alignments between human mRNAs in GenBank and the genome sequence.

The Spliced ESTs and Human EST tracks show the alignment of ESTs from GenBank against the genome. Because ESTs usually represent fragments of transcribed genes, there is high likelihood that an EST corresponds to an exonic region.


Finally, the Repeating Elements by RepeatMasker track shows, as its name would suggest, repetitive elements such as short and long interspersed nuclear elements (SINEs and LINEs), long terminal repeats (LTRs) and low-complexity regions (
). It is customary to remove or 'mask' these elements before applying a gene prediction method to a nucleotide sequence.

Returning to the example shown in Fig. 7.2, notice that most of the tracks return a nearly identical gene prediction; as a rule, exons predicted by multiple methods increase the likelihood that the prediction is actually correct and does not represent a 'false positive'. Most of the methods show a 3' untranslated region, indicated by the heavy, shorter block at the left of the predictions. The Acembly track shows three possible alternative splices in addition to the full-length product shown in the third line of that section, a prediction that agrees with those shown in most of the other tracks. The Genscan track extends off to both the right and the left: GENSCAN can be used to predict multiple genes, and this display implies that the method has been applied in this fashion.

Although these graphical overviews are useful, the investigator will more often than not want the actual sequence corresponding to these blocks. For this example, the Fgenesh++ prediction will be used as the basis for obtaining raw sequence data, but the steps will be identical regardless of which track is chosen. Click on the track labeled Fgenesh++ Gene Predictions to go to a summary page describing the prediction (Fig. 7.3). The region has sequence similarity to the pendrin gene (which was already known at the beginning of the example). The size and the beginning- and end-points of the prediction are given, and it is indicated that the prediction lies on the minus strand; this was also indicated in Fig. 7.2 by the left-pointing arrows in the intronic regions. To obtain the sequence, click on Genomic Sequence. The user will be taken to a query page entitled Get Genomic Sequence Near Gene, from which the transcript, coding region, promoter, or both the transcript and promoter can be obtained (Fig. 7.4). For each of the options, the sequence is returned in FASTA format, with the nucleotide coordinates being given in the definition line.

Transcript returns the sequence of the entire transcript, with exons shown in upper-case letters.

Coding Region Only returns just the coding region, with exons shown in upper-case letters.

Transcript + Promoter appends the promoter sequence to the 5' end of the sequence that the user would have obtained by using the Transcript option, with exons shown in upper-case letters. The length of the promoter can be indicated in the text box.

Promoter returns just the promoter region, as shown in Fig. 7.5 .

  1. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860-921 (2001). | Article | PubMed |
  2. Collins, F.S. and McKusick, V.A. Implications of the Human Genome Project for medical science. J. Am. Med. Assoc. 285, 540-544 (2001).
  3. Watson, J.D. & Crick, F.H.C. Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid. Nature 171, 737-738 (1953).
  4. Green, E.D. Strategies for the systematic sequencing of complex genomes. Nature Rev. Genet. 2, 573-583 (2001). | Article | PubMed |
  5. Ouellette, B.F.F. & Boguski, M.S. Database divisions and homology search files: a guide for the perplexed. Genome Res. 7, 952-955 (1997). | PubMed |
  6. Bairoch, A. & Apweiler, R. The SWISS-PROT Protein Sequence Database and its supplement TREMBL in 2000. Nucleic Acids Res. 28, 45-48 (2000). | Article | PubMed |
  7. Hubbard, T. et al. The Ensembl Genome Database Project. Nucleic Acids Res. 30, 38-41 (2002). | Article | PubMed |
  8. Kent, W.J. BLAT--the BLAST-like Alignment Tool. Genome Res. 12, 656-664 (2002). | Article | PubMed |
  9. Stein, L. Genome annotation: from sequence to biology. Nature Rev. Genet. 2, 493-503 (2001). | Article | PubMed |
  10. Pruitt, K.D. & Maglott, D.R. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 29, 137-140 (2001). | Article | PubMed |
  11. Burge, C.B. & Karlin, S. Finding the genes in genomic DNA. Curr. Opin. Struct. Biol. 8, 346-354 (1998). | Article | PubMed |
  12. Schuler, G.D. Electronic PCR: bridging the gap between genome mapping and genome sequencing. Trends Biotechnol. 16, 456-459 (1998). | Article | PubMed |
  13. Sherry, S.T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308-311 (2001). | Article | PubMed |
  14. Hamosh, A. et al. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 30, 52-55 (2002). | Article | PubMed |
  15. Baxevanis, A.D. & Ouellette, B.F.F. (eds.) Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins (John Wiley & Sons, New York, 2001).
  16. Solovyev, V.V., Salamov, A.A. & Lawrence, C.B. Identification of human gene structure using linear discriminant functions and dynamic programming. Proc. Int. Conf. Intell. Syst. Mol. Biol. 3, 367-375 (1995). | PubMed |
  17. Yeh, R.F., Lim, L.P. & Burge, C.B. Computational inference of homologous gene structures in the human genome. Genome Res. 11, 803-816 (2001). | Article | PubMed |
  18. Marchler-Bauer, A. et al. CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res. 30, 281-283 (2002). | Article | PubMed |
  19. Apweiler, R. et al. InterPro--an integrated documentation resource for protein families, domains and functional sites. Bioinformatics 16, 1145-1150 (2000). | Article | PubMed |
  20. Rebhan, M., Chalifa-Caspi, V., Prilusky, J. & Lancet, D. GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics 14, 656-664 (1998). | Article | PubMed |
  21. Blake, J.A., Richardson, J.E., Bult, C.J., Kadin, J.A. & Eppig, J.T. The Mouse Genome Database (MGD): the model organism database for the laboratory mouse. Nucleic Acids Res. 30, 113-115 (2002). | Article | PubMed |
  22. Hudson, T.J. et al. A radiation hybrid map of mouse genes. Nature Genet. 29, 201-205 (2001). | Article | PubMed |
  23. Bateman, A. et al. The Pfam protein families database. Nucleic Acids Res. 30, 276-280 (2002). | Article | PubMed |
  24. Letunic, I. et al. Recent improvements to the SMART domain-based sequence annotation resource. Nucleic Acids Res. 30, 242-244 (2002). | Article | PubMed |
  25. Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389-3402 (1997). | Article | PubMed |
  26. Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge Univ. Press, Cambridge, 1998).
  27. Peri, S., Ibarrola, N., Blagoev, B., Mann, M. & Pandey, A. Common pitfalls in bioinformatics-based analyses: look before you leap. Trends Genet. 17, 541-545 (2001) [erratum Trends Genet. 18, 218 (2002)]. | Article | PubMed |
  28. Ponting, C. Issues in predicting protein function from sequence. Brief. Bioinform. 2, 19-29 (2001). | PubMed |
  29. Aparicio, S.A.J.R. How to count ... human genes. Nature Genet. 25, 129-130 (2000). | Article | PubMed |
  30. Beadle, G.W. & Tatum, E.L. Genetic control of biochemical reactions in Neurospora. Proc. Natl Acad. Sci. USA 27, 499-506 (1941).
  31. Jeffery, C.J., Bahnson, B.J., Chien, W., Ringe, D. & Petsko, G.A. Crystal structure of rabbit phosphoglucose isomerase, a glycolytic enzyme that moonlights as neuroleukin, autocrine motility factor, and differentiation mediator. Biochemistry 39, 955-964 (2000). | Article | PubMed |
  32. Wistow, G. & Piatigorsky, J. Recruitment of enzymes as lens structural proteins. Science 236, 1554-1556 (1987). | PubMed |
  33. Jeffery, C.J. Moonlighting proteins. Trends Biochem. Sci. 24, 8-11 (1999). | Article | PubMed |
  34. Chothia, C. Proteins. One thousand families for the molecular biologist. Nature 357, 543-544 (1992). | PubMed |
  35. Hegyi, H. & Gerstein, M. The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J. Mol. Biol. 288, 147-164 (1999). | Article | PubMed |
  36. Jansen, R. & Gerstein, M. Analysis of the yeast transcriptome with structural and functional categories: characterizing highly expressed proteins. Nucleic Acids Res. 28, 1481-1488 (2000). | Article | PubMed |
  37. Brenner, S.E. Errors in genome annotation. Trends Genet. 15, 132-133 (1999). | Article | PubMed |
  38. Smith, R.F. Perspectives: sequence data base searching in the era of large-scale genomic sequencing. Genome Res. 6, 653-660 (1996). | PubMed |

Copyright 2002 Nature Publishing