Sie sind auf Seite 1von 4


GENE-36174; No of Pages 4
Gene xxx (2008) xxx-xxx

Contents lists available at ScienceDirect

j o u r n a l h o m e p a g e : w w w. e l s e v i e r. c o m / l o c a t e / g e n e


What is a gene? An updated operational definition

Graziano Pesole ⁎
Dipartimento di Biochimica e Biologia Molecolare “E. Quagliariello”, Università di Bari, Via Orabona 4, 70126 Bari, Italy
Istituto Tecnologie Biomediche, Consiglio Nazionale delle Ricerche, via Amendola 122/D, 70125 Bari, Italy


Article history: A crucial pre-requisite for large-scale annotation of eukaryotic genomes is the definition of what constitutes a
Received 17 September 2007 gene. This issue is addressed here in the light of novel and surprising gene features that have recently
Received in revised form 28 February 2008 emerged from large-scale genomic and transcriptomic analyses. The updated operational definition proposed
Accepted 6 March 2008
here is: “a gene is a discrete genomic region whose transcription is regulated by one or more promoters and
Available online xxxx
distal regulatory elements and which contains the information for the synthesis of functional proteins or
non-coding RNAs, related by the sharing of a portion of genetic information at the level of the ultimate
Genomics products (proteins or RNAs)”. This definition is specifically designed for eukaryotic chromosomal genes and
Bioinformatics emphasizes the commonality of the genetic material that gives rise to final, functional products (ncRNAs or
Alternative splicing proteins) derived from a single gene. It may be useful in several applications and should help in the provision
Alternative transcription start sites of a comprehensive inventory of the genes of a given organism, finally allowing answers to the basic question
Alternative transcription termination of “how many genes” are encoded in its genome.
© 2008 Elsevier B.V. All rights reserved.


1. Problematic issues in eukaryotic chromosomal gene definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0

2. An updated operational gene definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0

1. Problematic issues in eukaryotic chromosomal gene definition identification of significantly large Open Reading Frames (ORFs) is an
obvious solution for the identification of the majority of protein
A major goal of a genome sequencing project for a specific coding prokaryotic genes. Short prokaryotic genes are more proble-
organism is the definition of its entire gene complement. To matic but can generally be identified with suitable bioinformatics
accomplish this important task several fairly accurate gene prediction approaches validated by transcription and translation evidence.
tools are generally used together with large-scale production of To date, 77 eukaryotic genome projects have been completed
expression evidence (e.g. cDNA and EST sequences). (Liolios et al., 2006) but for none of them are we able to answer the
In this review I will deal with the problem of the definition of what simple question of how many genes they contain. This is mostly due to
is a gene, a crucial pre-requisite for large-scale annotation of the presence of some gaps in the genome sequences and to the
eukaryotic genomes. Indeed, gene assessment in prokaryotic genomes incompleteness of gene annotation. However, even if all gaps were
is much simpler owing to their higher gene density (about 80% of a closed and a full gene annotation was available and validated by
prokaryotic genome is protein coding) and the lack of introns. The comprehensive transcriptional evidence, we could be still unable to
provide reliable estimate of the gene number — in part because of the
lack of a clear and unambiguous definition of what a gene is.
Abbreviations: AS, alternative splicing; miRNA, micro RNA; ncRNA, non-coding RNA; Several definitions have been proposed — such as this, from one of
ORF, open reading frame; TSS, transcription start site; TTS, transcription termination
the most widely used Molecular Biology textbooks: “A gene is the
site; TU, transcriptional unit; UTR, untranslated region
⁎ Dipartimento di Biochimica e Biologia Molecolare, University of Bari, Via Orabona,
segment of DNA specifying a polypeptide chain; it includes regions
4, 70125 Bari, Italy. Tel.: +39 080 5443588; fax: +39 080 5443317. preceding and following the coding region (leader and trailer) as well
E-mail address: as intervening sequences (introns) between individual coding

0378-1119/$ – see front matter © 2008 Elsevier B.V. All rights reserved.

Please cite this article as: Pesole, G., What is a gene? An updated operational definition, Gene (2008), doi:10.1016/j.gene.2008.03.010
2 G. Pesole / Gene xxx (2008) xxx-xxx

segments (exons).” ((Lewin, 2007), This nematodes and ascidians (Hastings, 2005) which may further
exemplar definition, apart from its ambiguous use of the term increase the complexity of the gene expression pattern.
“exon”, is barely satisfactory as it does not consider some problematic 5) Finally, recent computational and experimental analyses point to
gene features recently highlighted by work carried out at the RIKEN the existence of chimerical transcripts produced by the co-
Institute on the transcriptional landscape of mouse genome (Carninci transcription of tandem gene pairs, and potentially encoding
et al., 2005) and most recently by the International Encyclopedia of fusion proteins (Parra et al., 2006).
DNA elements (ENCODE) project (Gerstein et al., 2007) that strongly
challenge the conventional view of genes. Indeed, the classical “one 2. An updated operational gene definition
gene — one protein” definition is no longer acceptable and is also
impractical (Pearson, 2006). In the light of the above features one might ask if is still
In particular: appropriate to maintain a gene-centric view of molecular biology, or
it is better to just consider functional products (proteins and ncRNAs)
1) A large fraction of genes do not encode for proteins. Indeed, over that may be in some way related by the molecular processes involved
50% of the transcriptional units (TUs) identified in mouse do not in their expression, such as the sharing of a promoter (or TSS), a
appear to be coding and the majority of them are alternatively transcriptional termination (TTS) or one or more splicing sites. Indeed,
spliced and polyadenylated. to understand the relationships between the different cellular
2) The same gene locus may encode a large variety of transcripts and components in a system biology framework, it may be more
proteins through alternative transcription start sites (TSS), alter- appropriate to consider functional products rather than genes, in the
native transcription termination sites (TTS) and alternative splicing light of their specific expression in different conditions (i.e. tissue,
(AS). In some cases AS may generate mRNAs encoding for developmental stage or pathological status).
completely unrelated proteins using different coding frames. However, I believe that despite the many problems that have
3) Some genes have been found to overlap each other on the same or emerged in these last years it would be premature to announce the
opposite strands. The discontinuous structure of eukaryotic genes death of the gene concept, mostly because the tight connection
potentially allows “Russian doll” gene models, where one gene can between a functional product and its encoding genetic material
be completely contained inside one or more introns of another cannot be disregarded. However, an updated operational definition is
gene without sharing any exonic regions. needed to allow the unambiguous association between transcripts,
4) The ligation of two distinct mRNA molecules encoded by separate proteins, and their encoding genes.
gene loci through the trans-splicing mechanism is another In agreement with Gerstein et al. (2007) this updated definition
phenomenon widespread in some eukaryote lineages such as should adopt a bottom-up criterion, i.e. emphasize the ultimate

Fig. 1. The discrete genomic region depicted here encodes one non-coding and eight protein coding spliced transcripts (ncRNA in yellow; 5′UTR and 3′UTR in light and dark pink,
respectively; protein coding sequence in green; dotted lines represent RNA removed or spliced out by maturation). Four different genes (numbered 1–4) can be annotated according
to the gene definition proposed here. A specific set of transcripts can be clustered and assigned to the same gene if the transcript projections on the genome sequence — limited at the
regions encoding the final products (e.g. the green and the yellow boxes for the protein coding and non-coding RNA genes, respectively) — overlap each other. The clustering
procedure is iterated and may include in the same gene cluster non-overlapping transcripts. For example, in the case of gene 3, the transcript isoforms encoding for products DE and
FE are clustered because they overlap through the region E, then the transcript FG is added to this cluster because of the overlapping of the region F with one member of the cluster.
The transcript encoding the product AE can be identified as a chimerical transcript originated by the concatenation of two exons belonging to two different genes as these two exons
are prevalently expressed by two unrelated genes (i.e. genes 2 and 3). The gene coordinates, denoted by the arrowed lines, are the leftmost and rightmost mapping positions on the
genome of all transcripts belonging to the same gene cluster. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Please cite this article as: Pesole, G., What is a gene? An updated operational definition, Gene (2008), doi:10.1016/j.gene.2008.03.010
G. Pesole / Gene xxx (2008) xxx-xxx 3

functional gene products, either ncRNAs (e.g. miRNA) or proteins, and (i.e. the E region) would affect both products. It should be noted that
consider the regulatory regions involved in their expression at both the relationships between two products can be indirect as DE and FG
the transcriptional (i.e. promoter, enhancer, etc.) and post-transcrip- are related through FE (see also the legend of Fig. 1).
tional (i.e. 5′UTR and 3′UTR) level as “gene-related”. Thus, the Related proteins may also have completely different sequences, as
proposed operational definition can be summarized as: “a gene is a in the case of DE and FG, or if the expressed products should use a
discrete genomic region whose transcription is regulated by one or different reading frame.
more promoters and distal regulatory elements and which contains According to the gene definition proposed here the transcript
the information for the synthesis of functional proteins or non-coding encoding the product H should be assigned to a different gene (4),
RNAs, related by the sharing of a portion of genetic information at the even if it shares the same TSS with transcripts encoding ABC and AC,
level of the ultimate products (proteins or RNAs)”. This definition does given that H and ABC (or AC) are completely unrelated proteins, i.e.
not include cis-regulatory regions as sequence elements controlling encoded by non-overlapping genomic regions. This is in line with the
the expression of a gene are not necessarily located upstream of it and recent observation that different genes may share distal 5′UTRs,
may be dispersed throughout the genome (Gerstein et al., 2007) possibly providing a specific expression pattern (Denoeud et al., 2007).
making the accurate definition of their boundaries unfeasible. In Furthermore, the existence of trans-splicing — where exons from two
addition, some of the transcriptional regulatory elements may separate transcripts are spliced together to form a mature mRNA
themselves be transcribed (Zhu et al., 2007). molecule — has been shown in some eukaryotes (Hastings, 2005).
An example to illustrate the application of this definition is shown In the genomic region drawn in Fig. 1, we are also able to identify
in Fig. 1, where a genomic region encoding nine different transcripts — an additional gene (1) encoding a ncRNA giving rise to the mature
which give rise to one ncRNA and seven functional proteins — is product X. This situation accounts for miRNA genes, often expressed as
described. According to the above definition: i) ABC, AC and ii) DE, FE, polycistronic primary-miRNA and located in the introns of coding or
FG, form two clusters of related proteins, generated by alternatively non-coding RNAs (Kim and Nam, 2006).
spliced products of genes 2 and 3. I would suggest that two (or more) Finally, AE can be identified as a fusion protein originating from the
proteins are related (i.e. belong to the same gene cluster) if their co-transcription of two tandem genes (2 and 3, expressing non-
encoding genome sequences overlap each other. Indeed, products overlapping mature transcripts) through the formation of a chimerical
with overlapping encoding genome sequences, like DE and FE, have a transcript — on the basis that the prevalent expression forms of the
strict genetic relationship as a mutation in the shared genomic region genes which provide exons to this product form two unrelated transcript

Fig. 2. (A) Seven alternative mRNAs expressed by CDKN2A gene in human as determined by the ASPIC program (Castrignano et al., 2006) (RefSeq IDs are shown on the right of known
isoforms). (B) Alternative proteins encoded by transcript isoforms shown in (A).

Please cite this article as: Pesole, G., What is a gene? An updated operational definition, Gene (2008), doi:10.1016/j.gene.2008.03.010
4 G. Pesole / Gene xxx (2008) xxx-xxx

clusters, i.e. with the 3′ end of the transcripts of the first cluster lying ship between genotype and phenotype. An operational definition,
upstream of the 5′ end of the transcripts of the second cluster, and such as that proposed here may be extremely useful for the
encode unrelated and non-overlapping functional products. unambiguous classification of transcripts in discrete gene loci, such
Once the related mature products have been defined one can easily as those provided by the Unigene database (Wheeler et al., 2007) and
go back to the relevant precursor transcripts, and determine the gene may be more appropriate for computational analysis involving
coordinates on the genome as their leftmost and rightmost mapping alignment of genome and transcript sequences. By way of contrast,
positions (Fig. 1). In this way a single gene locus is defined to encode a the Gerstein et al. (2007) gene definition, which includes a
set of “related” products and its genomic coordinates established by discontinuous genome region with the exclusion of UTRs, cannot be
precursor transcripts. used to delineate the genome region to be considered in bioinfor-
The gene definition proposed here is different form the one matics analyses for the detection of novel splicing isoforms and of
proposed by Gernstein et al. (2007): “A gene is a union of genomic splicing events located in non-coding portion of mRNAs.
sequences encoding a coherent set of potentially overlapping The simple operational gene definition proposed here, while not
functional products” in that in the current proposal: i) each gene is universal — it is specifically designed for chromosomal eukaryotic
assigned a contiguous genomic region; ii) gene coordinates include 5′ genes (e.g. genes of RNA viruses do not fit this definition) — allows
and 3′ mRNA untranslated (UTR) sequences included in the precursor unambiguous definition of gene coordinates and of gene-related
transcript. Therefore, according to the proposed definition a genomic transcripts. It may have a wide range of applicability and help in the
tract encoding for a trans-spliced leader is not included in the genomic provision of a comprehensive inventory of the genes of a given
region assigned to a given gene as we assume that a gene is “a organism, finally allowing answers to the basic question of “how many
contiguous genome region” and furthermore the trans-leader corre- genes” are encoded in its genome.
sponds to an “untranslated” region of the transcript which do not
contribute to the final product. Acknowledgments
The definition provided in the current paper is not only simpler
but also operationally more appropriate as it unambiguously defines This work was supported by the “Italian Ministry of University and
the genomic region to be considered in the analysis of alternative Research” (Fondo Italiano Ricerca di Base: “Laboratorio Internazionale
splicing — usually carried out by aligning gene-related transcripts di Bioinformatica”), Associazione Italiana Ricerca sul Cancro and
(typically a Unigene cluster) to the relevant genomic region where Telethon. I thank David Horner (University of Milano) for stimulating
alternatively spliced 5′UTRs are frequently observed. discussions and critical reading of the manuscript.
To deal with a real example, Fig. 2 shows the splicing pattern of the
gene CDKN2A, as determined by the ASPIC program (Castrignano et al., References
2006). It should be noted that the first and second transcripts
Carninci, P., et al., 2005. The transcriptional landscape of the mammalian genome.
(CDKN2A.Ref and CDKN2A.Tr2 in Fig. 2A) encode two completely
Science 309, 1559–1563.
different proteins, 116 and 173 aa long respectively (Fig. 2B) and the Castrignano, T., et al., 2006. ASPIC: a web resource for alternative splicing prediction and
corresponding coding sequences use different reading frames. transcript isoforms characterization. Nucleic Acids Res. 34, W440–W443.
Denoeud, F., et al., 2007. Prominent use of distal 5¢ transcription start sites and
CDKN2A.Tr2, .Tr3 and .Tr4 encode the same product but differ in
discovery of a large number of additional exons in ENCODE regions. Genome Res. 17,
their 3′UTR. CDKN2A.Tr5, .Tr6 and .Tr7 encode different partially 746–759.
overlapping proteins of 105, 146 and 138 residues, respectively. Note Gerstein, M.B., et al., 2007. What is a gene, post-ENCODE? History and updated
that products of CDKN2A.Ref and CDKN2A.Tr5 are indirectly related definition. Genome Res. 17, 669–681.
Gingeras, T.R., 2007. Origin of phenotypes: genes and transcripts. Genome Res. 17,
through the product of CDKN2A.Tr6. 682–690.
This example highlights a possible problem that may arise with the Hastings, K.E., 2005. SL trans-splicing: easy come or easy go? Trends Genet. 21, 240–247.
proposed definition. Indeed, in most real gene predictions we know Kim, V.N., Nam, J.W., 2006. Genomics of microRNA. Trends Genet. 22, 165–173.
Lewin, B., 2007. Genes IX. Jones and Bartlett, Sudbury, Massachusetts.
neither the location of the coding sequence, if any, nor the function of Liolios, K., Tavernarakis, N., Hugenholtz, P., Kyrpides, N.C., 2006. The Genomes On Line
the encoded protein. In fact, in this case only CDKN2A.Ref, .Tr2 and . Database (GOLD) v.2: a monitor of genome projects worldwide. Nucleic Acids Res.
Tr6 correspond to known transcripts included in the RefSeq collection 34, D332–D334.
Parra, G., et al., 2006. Tandem chimerism as a means to increase protein complexity in
(Pruitt et al., 2007). A pragmatic solution to this problem is to annotate the human genome. Genome Res 16, 37–44.
the longest possible open reading frame as a functional product (even Pearson, H., 2006. Genetics: what is a gene? Nature 441, 398–401.
in the absence of strong supporting data). In this way all inferred Pruitt, K.D., Tatusova, T., Maglott, D.R., 2007. NCBI reference sequences (RefSeq): a
curated non-redundant sequence database of genomes, transcripts and proteins.
transcripts, CDKN2A.Tr1–.Tr7, will be assigned to the same gene locus.
Nucleic Acids Res. 35, D61–D65.
It is now quite clear that an unequivocal and universal gene Wheeler, D.L., et al., 2007. Database resources of the National Center for Biotechnology
definition is not possible and therefore it has been proposed that the Information. Nucleic Acids Res. 35, D5–D12.
Zhu, X., Ling, J., Zhang, L., Pi, W., Wu, M., Tuan, D., 2007. A facilitated tracking and
operational units of a genome could be better represented by the
transcription mechanism of long-range enhancer function. Nucleic Acids Res. 35,
different expressed transcripts as they actually relate the genome 5532–5544.
sequence to function and phenotype (Gingeras, 2007). However, the
gene concept, with suitable revision and update still remains a key
issue in Molecular Biology, underlying the centrality of the relation-

Please cite this article as: Pesole, G., What is a gene? An updated operational definition, Gene (2008), doi:10.1016/j.gene.2008.03.010