Sie sind auf Seite 1von 7

progress

© 2002 Nature Publishing Group http://genetics.nature.com

A genomic view of alternative splicing

Barmak Modrek & Christopher Lee

Recent genome-wide analyses of alternative splicing indicate that 40–60% of human genes have alter- native splice forms, suggesting that alternative splicing is one of the most significant components of the functional complexity of the human genome. Here we review these recent results from bioinfor- matics studies, assess their reliability and consider the impact of alternative splicing on biological func- tions. Although the ‘big picture’ of alternative splicing that is emerging from genomics is exciting, there are many challenges. High-throughput experimental verification of alternative splice forms, func- tional characterization, and regulation of alternative splicing are key directions for research. We rec- ommend a community-based effort to discover and characterize alternative splice forms comprehensively throughout the human genome.

splice forms comprehensively throughout the human genome. Introduction The sequencing of the human genome has raised

Introduction

The sequencing of the human genome has raised important questions about the nature of genomic complexity. It was widely anticipated that the human genome would contain a much larger number of genes (estimates based on expressed-sequence clus- tering ran as high as 150,000 genes) than Drosophila (14,000 genes) or Caenorhabditis elegans (19,000 genes) 13 . The report of only 32,000 human genes thus came as a surprise 4,5 . This basic disparity indicated that the number of human expressed- sequence (mRNA) forms was much higher than the number of genes, suggesting a major role for alternative splicing in the pro- duction of complexity. Many groups have recently presented genomic analyses of alternative splicing that strongly support this hypothesis, raising intriguing questions about the identifica- tion, functional roles and regulation of alternative splice forms across the whole genome. The study of alternative splicing has long been a valuable subfield of molecular biology, but has received comparatively little attention compared with major fields such as the discovery of new genes or transcriptional regulation. Only several hundred alternatively spliced genes have been identified so far by molecular biologists (see Table 1 for database resources). After the discovery of exons and introns in the Adenovirus hexon gene in 1977 (ref. 6), Walter Gilbert proposed that different combinations of exons could be spliced together (‘alternative splicing’) to produce different mRNA isoforms of a gene 7 . By the early 1980s, alternative splicing was well documented in several genes 8,9 , and researchers estimated that 5% of genes in higher eukaryotes might have alternative splicing 10 . A range of processes from sex determination to apoptosis use alterna- tive splicing 11,12 . Its regulatory mechanisms have recently been dis- covered in several genes 11,13 .

Genome-scale analyses of alternative splicing

High-throughput sequencing of the human genome and espe- cially of expressed sequence tag (EST) sequences has enabled a completely different approach based on bioinformatics. Because ESTs are derived from fully processed mRNA (after 5capping, splicing and polyadenylation), they provide a broad sample of mRNA diversity. This diversity can be analyzed computationally.

In the last two years, bioinformatics studies have identified an order of magnitude more alternatively spliced genes than were found in the past 20 years and are beginning to provide a global view of alternative splicing in humans. We will first describe these studies and then assess the evidence. Bioinformatics approaches. Most bioinformatics stud- ies 4,1418 (Table 2) rely on identifying ESTs that come from the same gene and looking for differences between them that are consistent with alternative splicing, such as a large insertion or deletion in one EST (Fig. 1a). Each candidate splice can be fur- ther assessed by aligning the ESTs exactly to their gene sequence in the draft genome (Fig. 1b). This reveals candidate exons (matches to the genomic sequence) separated by candidate splices (large gaps in the EST-genomic alignment; Fig. 1b). As intronic sequences at splice junctions are highly conserved (99.24% of introns have a GT-AG at their 5and 3ends, respec- tively), they can be used to verify candidate splices 19 . In the earliest large-scale discovery of new alternative splicing, Mironov et al. 14 aligned ESTs to genomic sequence for 392 known genes and found alternative splicing in 133 of these genes 14 . Croft et al. 15 took a different approach that did not rely on aligning ESTs to the complete genomic sequence: they created a database of individual intron sequences annotated in GenBank and searched for EST sequences that matched intronic sequence. They found matches to introns from 582 genes, suggesting an alternative splice. Brett et al. 16 looked for insertions or deletions in ESTs relative to a set of known mRNAs, indicative of alterna- tive splices, but without EST alignment to the genomic sequence. This work identified 3,011 alternatively spliced genes 16 . The International Human Genome Sequencing Consortium reported 145 alternatively spliced genes from a comprehensive analysis of chromosome 22 based on aligning ESTs to the genomic sequence 4 . Modrek et al. 18 aligned available human EST and mRNA sequences (2.1 million) to the whole draft genome, applying strict matching, splice site and alternative splice detec- tion criteria, to identify 6,201 alternative splices in 2,272 genes. Alternative splicing frequency. These studies have consis- tently reported a high rate of alternative splicing in the human genome, with 35–59% of human genes showing evidence of at

progress

© 2002 Nature Publishing Group http://genetics.nature.com

Table 1 • Description and URLs for some alternative splicing databases

Resource

Literature-based alternative splicing databases

Description

URL

ASDB 36

alternative splicing database using Genbank and SWISS-PROT annotation

http://cbcg.nersc.gov/asdb

AsMamDB 37

database of alternative splices in human, mouse and rat

http://166.111.30.65/ASMAMDB.html

Alternative Splicing Database 30

database of alternative splices from literature

http://cgsigma.cshl.org/new_alt_exon_db2/

Yeast Intron Database 38

database of introns in yeast

http://www.cse.ucsc.edu/research/compbio/yeast_introns.html

New alternative splicing discovery databases

The Intronerator 39

alternative splicing in C. elegans based on analysis of EST data

Intron Sequence Information System has section covering detected human alternative splices

Transcript Assembly Program result of alternative splicing

database of alternative splices detected in human EST data

ISIS 15

TAP 17

HASDB 18

http://www.cse.ucsc.edu/kent/intronerator

http://isis.bit.uq.edu.au/

http://stl.wustl.edu/zkan/TAP/

http://www.bioinformatics.ucla.edu/HASDB

∼ zkan/TAP/ http://www.bioinformatics.ucla.edu/HASDB least one alternative splice form 4 , 1 4 , 1 6 –

least one alternative splice form 4,14,1618 . Moreover, given that only a few ESTs have been sequenced for most genes, it seems possible that even more alternative splicing exists that is not yet detectable in the available ESTs. These studies indicate that alter- native splicing is far more abundant, ubiquitous and functionally important than previously thought. And there are more types of mRNA isoforms. For example, bioinformatics studies have reported that about 25% of genes have alternative polyadenyla- tion forms, that is, mRNAs that are cleaved and polyadenylated at different sites 4,20 . Functional impact. How do these newly discovered alternative mRNA forms affect protein function? Despite an early report that most alternative splices occur within the 5untranslated region 14 , recent studies indicate that 70–88% of alternative splices change the protein product 4,17,18 . The majority of these changes appear to be functionally interesting, such as replacement of the amino or carboxy terminus, or in-frame addition and removal of a func- tional unit (Fig. 2b) 18 . Only 19% of the alternative protein forms were shortened due to frameshift 18 . Fig. 2c shows an alternative isoform of a new FC receptor β-like protein, whose C-terminal

transmembrane domain (TM) and cytoplasmic tail (important for signal transduction in this class of receptors) is neatly replaced with a new TM domain and tail by alternative polyadenylation 18 . What is the functional pattern of alternative splicing across the genome? A random sample of 50 alternatively spliced genes showed that over three-quarters were involved in signaling and regulation (such as receptors, signal transduction, transcription factors, and so on). Moreover, the systemic categories most highly represented in this sample were genes specific to the immune and nervous systems 18 . This should be interpreted cau- tiously, as the overall breakdown of gene functions in the whole genome is still unclear. However, alternative splicing may be most important in complex systems where information must be processed differently at different times (such as immune toler- ance, or development) or a very high level of diversity is required (such as axonal guidance). Notable examples of combinatorial alternative splicing of multiple cassettes of exons, generating up to 40,000 isoforms of a single gene, have recently been discovered in the nervous system, including Dscam (axonal guidance recep- tor in Drosophila) and neurexin (neuropeptide receptor) 21 .

Table 2 • Summary of some recent large scale alternative splicing papers

Paper

Summary

Mironov et al. (1999) 14

Used TIGR Human Gene Index. ESTs were aligned to genomic and the genomic was used to create superstructures of EST clusters. Identified 133 alternatively spliced genes. Estimated at least 35% of human genes are alternatively spliced.

Brett et al. (2000) 16

Aligned EST to mRNA to identify insertion and deletions. Identified 3,011 alternatively spliced genes. Verified 16 of the 20 genes with predicted alternative splicing. Estimated at least 38% of human genes are alternatively spliced.

Croft et al. (2000) 15

Created a database of Introns from GenBank. Identified EST sequences that matched region previously only designated as intronic. Identified 582 alternatively spliced genes. Estimated at least 22% of human genes are alternatively spliced.

Human Genome Consortium (2001) 4

Reconstructed the alignment of 642 different transcripts (mRNA and EST) covering 245 genes on chromosome 22 genomic sequence. 145 genes had alternative splicing. Estimated at least 59% of human genes are alternatively spliced.

Kan et al. (2001) 17

Aligned EST and genomic sequence to create transcript. Analysis of conservation of alternative splicing between human and mouse. Identified 374 alternative spliced genes. Estimated 55% of human genes are alternatively spliced.

Modrek et al. (2001) 18

Used UNIGENE. ESTs were aligned to Genomic and alternative splices were detected. Identified 6,201 alternative splice relationships in 2,272 clusters. Provides functional analysis of alternative splicing. Estimated at least 42% of human genes are alternatively spliced.

progress

© 2002 Nature Publishing Group http://genetics.nature.com

© 2002 Nature Publishing Group http://genetics.nature.com a mRNA ESTs perfect matches insert candidate
a mRNA ESTs perfect matches insert candidate alternative splice
a
mRNA
ESTs
perfect matches
insert
candidate alternative splice

AAA

AAA

AAA

Fig. 1 Computational identification of alternative splicing. a, Insertion and deletion in ESTs relative to mRNA are identified as potential alternative splices. b, Splices are identified and intronic splice junction donor and acceptor sites are checked. Alterna- tive splices are detected when two splices are mutually exclusive (intron inclusions are not identified as alter- native splices).

b

EST gap boundaries match known splice site patterns short internal long 3’ exons terminal exon
EST gap boundaries
match known splice site patterns
short internal
long 3’
exons
terminal exon
donor acceptor d
a
d
a
d
a
genomic
GT
AG
GT
AG
GT
AG
GT AG
AAA
perfect matches
to genomic exons
AAA
insert
EST gaps match
genomic introns
matches
EST poly-A tails
confirms 3' exon terminus
exon boundaries

Bioinformatics evidence for alternative splicing

It is essential that biologists understand the forms of evidence and problems that underlie this new ‘big picture’ view of alternative splicing. Bioinformatics is an automated analysis of high-through- put experimental data and follows a very different process than tra- ditional molecular biology. It can be simultaneously ‘more rigorous’ (much more detailed, mathematical measures of evidence are required for a computer to do this analysis at all) and much less rigorous (bioinformaticists typically cannot order a new set of experimental tests for all the isoforms they detect, as is common in

molecular biology labs studying a specific isoform). Two kinds of problems must be distin- guished: (i) a false negative, the failure to detect a real splice form, and (ii) a false positive, a reported result that is not a true, functional splice form. Analyz- ing the causes of these problems during cDNA library construc- tion, EST sequencing and sequence comparison suggests many interesting questions for

the next stage of this research (Table 3). Detection of alterna- tive splicing through bioinfor- matics depends on finding deviant EST forms within the mass of data produced by undirected EST sequencing, raising a fundamen- tal question: when an analysis is used to look for some form of devi- ation in a very large data set, other causes of deviation, even if infrequent, could add up to a substantial fraction of the result. How can we be sure this is real alternative splicing? The bioinformatics studies have tried carefully to screen out many possible sources of false positives. Simple forms of EST deviation, such as random variation in where a given EST sequence begins or ends within a gene, and potential vector

a

begins or ends within a gene, and potential vector a exon inclusion/exclusion alternative 5 ′ site

exon inclusion/exclusion

alternative 5site

alternative 3site

b

UAA UAA
UAA
UAA
5 ′ site alternative 3 ′ site b UAA UAA alternative N terminus (alternative initiation)
5 ′ site alternative 3 ′ site b UAA UAA alternative N terminus (alternative initiation)
5 ′ site alternative 3 ′ site b UAA UAA alternative N terminus (alternative initiation)
5 ′ site alternative 3 ′ site b UAA UAA alternative N terminus (alternative initiation)
5 ′ site alternative 3 ′ site b UAA UAA alternative N terminus (alternative initiation)
5 ′ site alternative 3 ′ site b UAA UAA alternative N terminus (alternative initiation)
5 ′ site alternative 3 ′ site b UAA UAA alternative N terminus (alternative initiation)
5 ′ site alternative 3 ′ site b UAA UAA alternative N terminus (alternative initiation)

alternative N terminus (alternative initiation)

alternative C terminus (truncation/extension)

in-frame insertion/deletion

Fig. 2 Types of alternative splicing and possible effects on protein. a, Alternative splicing can lead to either the inclu- sion or exclusion of an exon, use of a different 5site, or use of a different 3site. b, Alternative splicing can lead to use of a different site for translation initiation (alternative initi- ation), a different translation termination site due to a frameshift (truncation or extension), or the addition or removal of a stop codon in the alternative coding sequence (alternative termination). Alternative splicing can also change the internal region because of an in-frame insertion or deletion. c, Alternative splicing of Hs.11090, a putative FCε receptor β chain homolog: genomic structure and two alternative spliced (and polyadenylated) mRNA forms. The differential RNA processing results in substitution of one transmembrane domain instead of another. However, one form has a different cytoplasmic tail (involved in signaling in this family), whereas the other does not.

nature genetics • volume 30 • january 2002

c

genomic Poly A Poly A exon I IIa,b III IV Va,b VI VII
genomic
Poly A
Poly A
exon I
IIa,b
III
IV
Va,b
VI
VII

inferred mRNA and protein

I

IIa

IIb

III

IV

Va

I IIa IIb III IV Va VI VII

VI

VII

AAAA

TM TM TM TM
TM
TM
TM
TM

alternative polyadenylation

I

IIa

IIb

III

IV

Va

I IIa IIb III IV Va Vb

Vb

AAAA

TM TM TM TM
TM
TM
TM
TM
N C 15
N
C
15

progress

© 2002 Nature Publishing Group http://genetics.nature.com

contamination at the ends of ESTs, are excluded. The most important screen is provided by mapping (aligning) ESTs to the draft human genome sequence. Chimeric ESTs can be easily excluded by requiring that each EST align completely to a single genomic locus. The genomic location found by homology search and alignment can often be checked against radiation hybrid mapping data. As the genomic regions that match the ESTs should be exons and the alignment gaps between them should be introns, the putative splice sites at their boundaries can be carefully checked. Because the splice-site motifs (GT-AG, polypyrimidine tract, and so on) are primarily in the intron, this provides a validation that is independent of the EST evi- dence. Reverse transcriptase artifacts or other problems causing imperfect cDNA construction may be screened out in this way. Improper inclusion of genomic sequence in ESTs (due to either mRNA purification problems or incomplete splicing) can also be excluded by requiring pairs of mutually exclusive splices in different ESTs. Observing a given splice in one EST but not in a second EST may be insufficient, because the latter could be an un-spliced EST rather than a biologically significant intron inclusion. This problem can be eliminated by focusing on mutually exclusive splices, two different splices seen in different ESTs, that overlap in the genomic sequence. One can make this even stricter by requiring that the two splices share one splice site but differ at the other. This approach detects the classic forms of alternative splicing, such as alternative exon usage and

alternative 5or 3splicing (Fig. 2a). Detection of valid intron inclusions will probably require further statistical analysis. The presence in the human genome of many pseudogenes and paralogous genes resembling other genes greatly complicates the problem. Correct alternative splice detection depends on clustering the EST data into separate groups representing individual genes. EST clustering (such as UniGene) is well known to have both exces- sive ‘splitting’ of genes (there are 80,000 UniGene clusters, versus the estimate of 32,000 human genes) and excessive ‘lumping’, in which paralogous gene sequences are mixed together 4,22 . This mix- ing can suggest spurious alternative splices that are actually just dif- ferences between similar but distinct genes 23 . Methods that map the ESTs onto genomic sequence with a high level of identity (95–98%) probably exclude much of this paralog mixing, but not all. Ultimately, mapping ESTs to their unique gene location in the genomic sequence is the only way to sort out paralogs. Requiring that the consensus sequence for an EST cluster match completely, over its full length, to its genomic contig can help exclude artifacts where the genomic sequence has been misassembled. Instead of getting false positives (incorrect alternative splices), this may cause false negatives due to refusing to map the EST cluster at all. A high rate of false negatives is the greatest disadvantage of methods that require mapping ESTs to the draft genome sequence. Despite these sources of uncertainty, the agreement among many studies on a high frequency of alternatively spliced genes (35–60%) suggests that this result is valid. These studies support

Table 3 • Types of problems inherent in high-throughput alternative splice detection

inherent in high-throughput alternative splice detection Experimental factors Type of errors caused : (+) false

Experimental factors

Type of errors caused: (+) false positive (–) false negative

EST coverage limitations, bias

(–). Most genes have very few ESTs, from even fewer tissues. The main barrier to alternative splice detection.

RT / PCR artifacts

(+) in methods that don’t screen for fully valid splice sites (which requires genomic mapping, intronic sequence).

Genomic coverage, assembly errors

(–) in methods that map ESTs on the genome. Short contigs may cause >25% false negatives.

Chimeric ESTs

(+) in methods that simply compare ESTs.

Genomic contamination

(+) in methods that don’t screen for pairs of mutually exclusive splices.

EST orientation error, uncertainty

(+/–) in methods that don’t correct misreported orientation, or don’t distinguish overlapping genes on opposite strands.

Sequencing error

(+/–). Single-pass EST sequencing error can be very high locally (e.g. >10% at the ends). Need chromatograms.

EST fragmentation

Where ESTs end cannot be treated as significant.

Bioinformatics factors

Mapping ESTs to the genome

(–) in methods that map genomic location for each EST.

Paralogous genes

(+) in all current methods, but mostly in those that don’t map genomic location or don’t check all possible locations.

Rigorous measures of evidence

(+/–). How can the strength of experimental evidence for a specific splice form be measured rigorously?

Arbitrary cutoff thresholds

(+/–) in methods that use cutoffs (such as ‘99% identity’).

Alignment size limitations

(–) in methods that can’t align >10 2 , >10 3 sequences.

‘Pathological’ assemblies

(+/–). What should assembly programs do when the assembled reads disagree in regions (such as alt. splicing)? Programs vary.

Nonstandard splice sites?

(+) in methods that don’t fully check splice sites; (–) in methods that do restrict to standard splice sites.

Alignment degeneracy

(+/–). Alignment of ESTs to genomic is frequently degenerate around splice sites.

Biological interpretation factors

Comments

Spliceosome errors?

Is splicing perfect? That is, does it only make correct forms?

What is truly functional?

Just because a splice form is real (i.e. present in the cell) doesn’t mean it’s biologically functional. Conversely, even an mRNA isoform that makes a truncated, inactive protein might be a biologically valid form of functional regulation.

Defining the coding region

Predicting ORF in newly discovered genes; splicing may change ORF.

Predicting impact in protein

Motif, signal, domain prediction, and functional effects.

Predicting impact in UTR

Knowledge about effects of mRNA stability, localization, and other possible UTR sequences is incomplete.

Assessing and correcting for bias

Our genome-wide view of function is under construction. Until then, we have unknown selection bias.

We have listed items in each category loosely in descending order of importance.

16

nature genetics • volume 30 • january 2002

progress

© 2002 Nature Publishing Group http://genetics.nature.com

© 2002 Nature Publishing Group http://genetics.nature.com b microarray-based- Fig. 3 Experimental analysis of alter-

b microarray-based-

Fig. 3 Experimental analysis of alter- native splicing. a, Alternative splicing can be verified by RT–PCR using

primers that flank the alternatively

spliced region. The relative abun-

dance of different isoforms in various tissue sample can be assessed from gel. b, High-throughput identifica-

tion of alternative splicing can be car-

ried by using microarrays. The microarray probes would consist of exon–exon junction sequence, as dif- ferent alternative splice forms will have different exon–exon junctions. By analyzing the tissue distribution of various splice forms, clues regarding the regulation of alternative splicing

can be obtained.

a PCR-based detection 1 3 PCR primers 12 3 1 2 3 brain lung kidney
a PCR-based detection
1 3
PCR primers
12
3
1
2
3
brain
lung
kidney
250 nt 1 2 3 150 nt 1 3
250
nt
1
2
3
150
nt
1 3

detection

probes (60 nt) A 1 3 12 3 B C 1 2 3 microarray tissue
probes (60 nt)
A
1 3
12
3
B
C
1
2
3
microarray
tissue samples
A
B
brain
C
lung
kidney

junctions to be up- or down- regulated in different tissues. Rapid printing of such ‘splicing chips’ will enable cataloging of splice forms for all genes, in different tissues, developmen- tal states and conditions. Com-

bined with the human genome sequence, this data can in turn be used to identify cis elements that regulate these forms. Recently, the Affymetrix microarray design has also been used to identify potential alternative splices

within the rat genome. The Affymetrix array uses 20 probe pairs (25 nt) representing different exons of a gene. Whereas the inten- sities of most probes for a gene varied together in different tis- sues, probes for certain exons were anomalously depressed in some tissues, indicating potential alternative splices 25 . Other methodologies that use microarray technology to assess alterna- tive splicing have also been developed (X.-D. Fu and M. Ares Jr, personal communication). Rigorous measures of evidence. It should be emphasized that microarray approaches will not settle the question of identifying alternative splices independent of bioinformatics analysis. If any- thing, these data are likely to increase the need for bioinformat- ics, to measure rigorously the strength of the evidence for alternative splices in all the raw experimental data (ESTs, microarrays, and so on). For example, the original inkjet microarray paper treated differences in probe hybridization

each other persuasively, because they differ not only in the sets of genes sampled (ranging from well-characterized mRNAs, to spe- cific chromosomes, to a whole-genome study), but also in their specific criteria for reporting an alternative splice. It is impor- tant, however, to emphasize that there has only been one study so far verifying alternative splices detected by bioinformatics. Twenty genes with putative alternative splices were amplified from a multiple tissue cDNA panel by RT–PCR, with primers flanking the alternative splice (Fig. 3a). Sixteen were confirmed to be alternatively spliced, although thirteen of them were already recognized in the literature 16 .

Future Challenges High-throughput validation. Large-scale experimental verifica- tion of alternative splicing will be needed to assess the accuracy of the bioinformatics-based analyses. One promising technology is inkjet printing of long probes (up to 60 nt) to make rapidly cus- tomizable microarrays. Shoemaker et al. 24 used this technology to monitor the coordinate expression of 8,183 exons annotated on chromosome 22q. This technology could easily be adapted to detect alternative splicing, by designing probes that span specific exon–exon junctions. As alterna- tive splicing of a given gene creates different exon–exon junctions, it can be detected by mea-

suring hybridization of mRNA samples from different tissues to these probes (Fig. 3b). Whereas the hybridization ratios of most exon–exon junction probes for a given gene will be constant, alternative splicing will cause some

Fig. 4 Cooperative roles for bioinformatics and experimen- tation in an alternative splicing annotation project (ASAP). Individual researchers interested in specific genes can find computationally derived alternative splices in the database, allowing them to characterize and report results back into the database. High-throughput experi- ments can be designed using the database, and after com- putational analysis of the experimental data, the database can be updated with new results.

bioinformatics

ASAP community annotation DB validation probe design
ASAP
community
annotation DB
validation probe design

gene structures alternative mRNA isoforms tissue, disease specificity predicted functional impact suggested experimental tests

EST, mRNA sequencing microarrays mass spectrometry

tests EST, mRNA sequencing microarrays mass spectrometry individual experiments mRNA, protein isoform detection
individual experiments
individual
experiments

mRNA, protein isoform detection functional characterization

isoforms of interest

high throughput experimental data production

nature genetics • volume 30 • january 2002

17

progress

© 2002 Nature Publishing Group http://genetics.nature.com

© 2002 Nature Publishing Group http://genetics.nature.com among exons in a gene as indicators that low-expressed

among exons in a gene as indicators that low-expressed probes were not real exons but simply gene prediction errors. By con- trast, the Affymetrix study treated such differences as evidence of alternative splicing. The assessment of both competing interpre- tations is a bioinformatics analysis problem. This will require moving beyond simple ‘rules’ for filtering out potentially mis- leading data to probabilistic measurement of the relative strength of the evidence for the competing interpretations. Cataloguing alternative splice forms. Although the new bioinformatics results are based on data from the whole genome, it is important to understand they are highly incomplete. They detect many new splice forms but miss many known isoforms. This is a result of both the incompleteness and fragmentation of the EST and genomic sequence data, as well as many causes of false negatives in the bioinformatics methods (Table 3). In Mod- rek et al. 18 , at least 50% of the EST data (and their potential alter- native splices) were excluded by these problems. These studies are just the beginning of an accelerating process of mRNA isoform discovery. The EST sequence data are growing rapidly, the draft genome sequence is being completed and new streams of high-throughput data (such as splice-detection microarrays) are beginning. Thus, a worthwhile goal is simply to build a catalog of alternative splice forms, just as the human genome sequence is being used to build a catalog of the genes. The development of new high-throughput technologies for detecting the protein products of alternative splicing will be needed to streamline this process. What is truly functional? Although bioinformatics and high- throughput experiments can have a key role in building a catalog, in our view this can only succeed as a community annotation process involving all molecular biology researchers. For example, how can one prove that a particular splice form is actually carry- ing out an important biological function? Even with strong evi- dence that a form is real (that it was actually made by the spliceosome in a living cell), it does not seem safe to assume that it has a biological function. If the spliceosome had a 0.1% rate of mis-splicing, it could produce over 4,000 meaningless ‘alterna- tively spliced’ ESTs among the approximately 4 million ESTs. Bioinformatics can partly address this by discerning that a large subset of alternative splice forms (47%) are observed in multiple ESTs (often from different libraries) and thus are unlikely to be low-frequency error products 18 . At the same time, it is also not safe to dismiss a given form as ‘functionless’ simply because it has no obvious function. For example, even an alternative splice form that causes early translational termination (and an inactive protein product) can act as an important form of regulation of biological activity 13 . Only detailed functional studies can resolve these questions. Bioinformatics can infer likely functional impacts, however, by detecting the addition or removal of known domains, and can predict how experimenters could verify the presence of these forms and their likely disease or tissue specificity. Biologists interested in some of these putative forms could then use a vari- ety of techniques (PCR, northern and western blots) to test these predictions. This process will be best served by a central reposi- tory for both the bioinformatics predictions and subsequent experimental verification and functional studies, which would act as a community annotation database (Fig. 4). We hope this process can evolve rapidly into an active partnership between prediction and experiment. Alternative splicing regulation. One intriguing new area is the study of alternative splicing regulation. Regulation of splicing could be involved in 15% of genetic diseases 26 and may con- tribute to cancer by mis-splicing of exon 18 in BRCA1, which is caused by a polymorphism in an exonic enhancer 27 . If alternative

18

splicing is as widespread as bioinformatics studies indicate, how different splice forms are turned on and off may become a major research area, like transcriptional regulation. So far, mol- ecular biology has identified some cis regulatory elements (such as exonic splicing enhancers) and trans factors (SR proteins, PTB, and so on) 11,13 . Bioinformatics could make important contributions, for example, in the identification of cis regula- tory elements 2831 . Recently, Brudno et al. 31 analyzed intronic sequence upstream and downstream of 25 alternatively spliced brain specific exons. They detected the motif UGCAUG at a much higher frequency downstream of alternatively spliced exons (relative to constitutive exons), for both brain-specific and muscle-specific alternative splicing 31 . This motif had previ- ously been implicated in the alternative splicing of several genes including c-src, fibronectin, calcitonin/CGRP, and nonmuscle myosin II heavy chain-B 3235 , so this result is very suggestive. It bodes well for genome-wide studies that combine the flood of new alternative splicing data with complete genome sequences for multiple organisms.

Acknowledgments

We are grateful to D. Black, S. Galbraith and K. Ke for their critical comments and suggestions. C.L. was supported by a grant from the Department of Energy. B.M. was supported by National Science Foundation Integrative Graduate Education and Research Training award.

Received 16 August; accepted 20 November 2001.

1. Pennisi, E. Human genome project: and the gene number is

1146–1147 (2000).

?

Science 288,

2. Adams, M.D. et al. The genome sequence of Drosophila melanogaster. Science 287, 2185–2195 (2000).

3. The C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282, 2012–2018 (1998).

4. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

5. Venter, J.C. et al. The sequence of the human genome. Science 291, 1304–1351

(2001).

6. Sambrook, J. Adenovirus amazes at Cold Spring Harbor. Nature 268, 101–104

(1977).

7. Gilbert, W. Why genes in pieces? Nature 271, 501 (1978).

8. Early, P. et al. Two mRNAs can be produced from a single immunoglobulin m gene by alternative RNA processing pathways. Cell 20, 313–319 (1980).

9. Rosenfeld, M.G. et al. Calcitonin mRNA polymorphism: peptide switching associated with alternative RNA splicing events. Proc. Natl Acad. Sci. USA 79, 1717–1721 (1982).

10. Sharp, P.A. Split genes and RNA splicing. Cell 77, 805–815 (1994).

11. Lopez, A.J. Alternative splicing of pre-mRNA: developmental consequences and mechanisms of regulation. Annu. Rev. Genet. 32, 279–305 (1998).

12. Boise, L.H. et al. bcl-x, a bcl-2-related gene that functions as a dominant regulator of apoptotic cell death. Cell 74, 597–608 (1993).

13. Smith, C.W.J. & Valcarcel, J. Alternative pre-mRNA splicing: the logic of combinatorial control. Trends. Biochem. Sci. 25, 381–388 (2000).

14. Mironov, A.A., Fickett, J.W. & Gelfand, M.S. Frequent alternative splicing of human genes. Genome Res. 9, 1288–1293 (1999).

15. Croft, L. et al. ISIS, the intron information system, reveals the high frequency of alternative splicing in the human genome. Nature Genet. 24, 340–341 (2000).

16. Brett, D. et al. EST comparison indicates 38% of human mRNAs contain possible alternative splice forms. FEBS Lett. 474, 83–86 (2000).

17. Kan, Z., Rouchka, E.C., Gish, W.R. & States, D.J. Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. Genome Res. 11, 889–900 (2001).

18. Modrek, B., Resch, A., Grasso, C. & Lee, C. Genome-wide analysis of alternative splicing using human expressed sequence data. Nucleic Acids Res. 29, 2850–2859

(2001).

19. Burset, M., Seledtsov, I.A. & Solovyev, V.V. Analysis of canonical and non-canonical splice sites in mammalian genomes. Nucleic Acids Res. 28, 4364–4375 (2000).

20. Beaudoing, E., Freier, S., Wyatt, J.R., Claverie, J. & Gautheret, D. Patterns of variant polyadenylation signal usage in human genes. Genome Res. 10, 1001–1010 (2000).

21. Graveley, B.R. Alternative splicing: increasing diversity in the proteomic world. Trends Genet. 17, 100–107 (2001).

22. Wheeler, D.L. et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 28, 10–14 (2000).

23. Burke, J., Wang, H., Hide, W. & Davison, D.B. Alternative gene form discovery and candidate gene selection from gene indexing projects. Genome Res. 8, 276–290

(1998).

24. Shoemaker, D.D. et al. Experimental annotation of the human genome using microarray technology. Nature 409, 922–927 (2001).

25. Hu, G.K. et al. Predicting splice variant from DNA chip expression data. Genome Res. 11, 1237–1245 (2001).

26. Krawzczak, M., Reiss, J. & Cooper, D.N. The mutational spectrum of single base- pair substitutions in mRNA splice junctions of human genes: causes and consequences. Hum. Genet. 90, 41–54 (1992).

nature genetics • volume 30 • january 2002

progress

© 2002 Nature Publishing Group http://genetics.nature.com

© 2002 Nature Publishing Group http://genetics.nature.com 27. Liu, H.X., Cartegni, L., Zhang, M.Q. & Krainer, A.R.

27. Liu, H.X., Cartegni, L., Zhang, M.Q. & Krainer, A.R. A mechanism for exon skipping caused by nonsense or missense mutations in BRCA1 and other genes. Nature Genet. 27, 55–58 (2001).

28. Stamm, S., Zhang, M.Q., Marr, T.G. & Helfman, D.M. A sequence compilation and comparison of exons that are alternatively spliced in neurons. Nucleic Acids Res. 22, 1515–1526 (1994).

29. Kent, W.J. & Zahler, A.M. Conservation, regulation, synteny, and introns in a large-scale C. briggsaeC. elegans genomic alignment. Genome Res. 10, 1115–1125 (2000).

30. Stamm, S. et al. An alternative-exon database and its statistical analysis. DNA Cell Biol. 19, 739–756 (2000).

31. Brudno, M. et al. Computational analysis of candidate intron regulatory elements for tissue-specific alternative pre-mRNA splicing. Nucleic Acids Res. 29, 2338–2348

(2001).

32. Modafferi, E.F. & Black, D.L. A complex intronic splicing enhancer from the c-src pre-mRNA activates inclusion of a heterologous exon. Mol. Cell. Biol. 17, 6537–6545 (1997).

nature genetics • volume 30 • january 2002

33. Huh, G.S. & Hynes, R.O. Regulation of alternative pre-mRNA splicing by a novel repeated hexanucleotide element. Genes Dev. 8, 1561–1574 (1994).

34. Hedjran, F., Yeakley, J.M., Huh, G.S., Hynes, R.O. & Rosenfeld, M.G. Control of alternative pre-mRNA splicing by distributed pentameric repeats. Proc. Natl Acad. Sci. USA 94, 12343–12347 (1997).

35. Kawamoto, S. Neuron-specific alternative splicing of nonmuscle myosin II heavy chain-B pre-mRNA requires a cis-acting intron sequence. J. Biol. Chem. 271, 17613–17616 (1996).

36. Dralyuk, I., Brudno, M., Gelfand, M.S., Zorn, M. & Dubchak, I. ASDB: database of alternatively spliced genes. Nucleic Acids Res. 28, 296–297 (2000).

37. Ji, H. et al. AsMamDB: an alternative splice database of mammals. Nucleic Acids Res. 29, 260–263 (2001).

38. Spingola, M., Grate, L., Haussler, D. & Ares, M.J. Genome-wide bioinformatic and molecular analysis of introns in Saccharomyces cervisiae. RNA 5, 221–234 (1999).

39. Kent, W.J. & Zahler, A.M. The intronerator: exploring introns and alternative splicing in Caenorhabditis elegans. Nucleic Acids Res. 28, 91–93 (2000).

19