Sie sind auf Seite 1von 30

The Complexity of Eukaryotic Genomes

-genomes of most eukaryotes are larger and more complex than those of prokaryotes - However, the genome size of many eukaryotes does not appear to be related to genetic complexity. - Much of the complexity of eukaryotic genomes thus results from the abundance of several different types of noncoding sequences, which constitute most of the DNA of higher eukaryotic cells.

Figure 4.1. Genome size The range of sizes of the genomes of representative groups of organisms are shown on a logarithmic scale.

Introns and Exons

-large amounts of noncoding DNA are also found within most eukaryotic genes. Such genes have a split structure in which segments of coding sequence (called exons) are separated by noncoding sequences (intervening sequences, or introns) -some of the noncoding DNA in eukaryotes is accounted for by long DNA sequences that lie between genes (spacer sequences)

Figure 4.2. The structure of eukaryotic genes The introns are then removed by splicing to form the mature mRNA.

-On average, introns are estimated to account for about ten times more DNA than exons in the genes of higher eukaryotes. (eg.human gene that encodes the blood clotting protein factor VIII. This gene spans approximately 186 kb of DNA and is divided into 26 exons. The mRNA is only about 9 kb long, so the gene contains introns totaling more than 175 kb). -introns are clearly not required for gene function in eukaryotic cells. -most introns have no known cellular function, although a few have been found to encode functional RNAs or proteins. -generally thought to represent remnants of sequences that were important earlier in evolution. -introns may have helped accelerate evolution by facilitating recombination between protein-coding regions (exons) of different genesa process known as exon shuffling. (recombination between introns of different genes would result in new genes containing novel combinations of protein-coding sequences)

Table 9-1. Classification of Eukaryotic DNA

Protein-coding genes Solitary genes Duplicated and diverged genes (functional gene families and nonfunctional pseudogenes) Tandemly repeated genes encoding rRNA, 5S rRNA, tRNA, and histones Repetitious DNA Simple-sequence DNA Moderately repeated DNA (mobile DNA elements) Transposons Viral retrotransposons Long interspersed elements (LINES; nonviral retrotransposons) Short interspersed elements (SINES; nonviral retrotransposons) Unclassified spacer DNA

Gene Families and Pseudogenes -another factor contributing to the large size of eukaryotic genomes is that some genes are repeated many times. -many eukaryotic genes are present in multiple copies, called gene families while most prokaryotic genes are represented only once in the genome

Figure 4.5. Globin gene families Members of the human - and globin gene families are clustered on chromosomes 16 and 11, respectively. Each family contains genes that are specifically expressed in embryonic, fetal, and adult tissues, in addition to nonfunctional gene copies (pseudogenes).

Repetitive DNA Sequences -3 classes based on the number of times nucleotide sequence is repeated within the population of DNA fragments a. Highly Repeated DNA Sequences - sequences present in at least 105 copies per genome (1-10% of total DNA) - typically short and present in clusters in which the given sequence repeats itself over and over again w/out interruption - the sequence is said to be present in tandem (sequence arranged in end-to-end manner) 1. Satellite DNAs - short sequences about 5 to a few hundred base pairs in length(form very large clusters each containing up to several million base pairs of DNA) - base composition of the said DNA segments is different from the bulk of the DNA that fragments contg. the sequence that can be separated into a distinct satellite band during density gradient centrifugation - localization of satellite DNA within centromeres of chromosomes

-precise function of satellite DNA remains a mystery 2. Minisatellite DNAs -sequences range from about 12 to 100 base pairs in length and are found in clusters contg as many as 3000 repeats. -occupy considerably shorter stretches of the genome than the satellite sequences -tend to be unstable, and the number of copies of a particular sequence often increase or decrease from one generation to the next (the length of a particular minisatellite locus is highly variable in the population even among the same members of the same family -used to identify individuals in criminal or paternity cases thru the technique of DNA fingerprinting

Use of PCR in forensic science. Hypervariable microsatellite sequences (such as variable number of tandem repeats, or VNTR) are identified by PCR.

3. Microsatellite DNAs -shortest sequences (1 to 5 base pairs long) and are typically present in small clusters of about 10 to 40 base pairs in length -scattered quite evenly thru the DNA more than 100,000 different loci are present in the human genome
b. Moderately Repeated DNA Sequence - moderately repeated fraction of the genomes of plants and animals - vary from about 20 to more than 80% of the total DNA, depending on the organism - sequences that are repeated within the genome anywhere from a few times to tens of thousands of times. - sequences that code for known gene products, either RNAs or proteins, and those that lack a coding function 1. Repeated DNA Sequences with Coding Functions -includes genes that code for ribosomal RNA as well as those that code for histones (impt. chromosomal proteins) -the repeated sequences that code for each of these products are typically identical to one another and located in tandem array

2. Repeated DNAs that Lack Coding Functions -the bulk of the moderately repeated DNA fraction does not encode any type of product -repetitive DNA sequences are scattered throughout the genome rather than being clustered as tandem repeats (i.e. interspersed thruout the genome as individual elements). -can be grouped into two classes: a. SINEs (short intesrpersed elements) -major SINEs in mammalian genomes are Alusequences, so-called because they usually contain a single site for the restriction endonuclease AluI. -Alu sequences are approximately 300 base pairs long, and about a million such sequences are dispersed throughout the genome, accounting for nearly 10% of the total cellular DNA. -Although Alu sequences are transcribed into RNA, they do not encode proteins and their function is unknown. b. LINEs (long interspersed elements) -major human LINEs (which belong to the LINE 1, or L1, family) are about 6000 base pairs long and repeat approximately 50,000 times in the genome.

-L1 sequences are transcribed and at least some encode proteins, but like Alu sequences, they have no known function in cell physiology. -both Alu and L1 sequences are examples of transposable elements, which are capable of moving to different sites in genomic DNA. - some of these sequences may help regulate gene expression, but most Alu and L1 sequences appear not to make a useful contribution to the cell. -they may have played important evolutionary roles by contributing to the generation of genetic diversity.
3. Nonrepeated DNA Sequences - are present in a single copy in the genome which includes genes that exhibit Mendelian patterns of inheritance -always localize to a particular site on a particular chromosome -contains by far the greatest amount of genetic information -include DNA sequences that code for virtually all proteins other thanhistones -even if these sequenes are not present in multiple copies, genes that code for polypeptides are usually members of a family of related genes(this is true for globins, actins, myosins, collagens, tubulins, integrins and most other proteins in a eukaryotic cell.

The Number of Genes in Eukaryotic Cells

Table 4.1. The Numbers of Genes in Cellular Genomes


Genome size (Mb)a

Protein-coding sequence (percent)b

Number of genesb

E. coli S. cerevisiae C. elegans Drosophila

4.6 12 97 180

90 70 25 13

4,288 5,885 19,099 13,600




Mb = millions of base pairs. b Determined from the E. coli, S. cerevisiae, C. elegans, and Drosophila genomic sequences and estimated for the human genome.

*Nature 431, 931-945 (21 October 2004) | doi:10.1038/nature03001Finishing the euchromatic sequence of the human genome

Latest information: -2.85 billion nucleotides interrupted by only 341 gaps. -encode only 20,00025,000 protein-coding genes.

Chromosomes and Chromatin

Table 4.2. Chromosome Numbers of Eukaryotic Cells


Genome size (Mb)a

Chromosome numbera

Yeast (Saccharomyces cerevisiae) Slime mold (Dictyostelium) Arabidopsis thaliana Corn Onion Lily Nematode (Caenorhabditis elegans)
Fruit fly (Drosophila) Toad (Xenopus laevis) Lungfish Chicken Mouse Cow Dog Human

12 70 130 5,000 15,000 50,000 97

180 3,000 50,000 1,200 3,000 3,000 3,000 3,000

16 7 5 10 8 12 6
4 18 17 39 20 30 39 23

Both genome size and chromosome number are for haploid cells. Mb = millions of base pairs.

-genomes of prokaryotes are contained in single chromosomes, which are usually circular DNA molecules. -the genomes of eukaryotes are composed of multiple chromosomes, each containing a linear molecule of DNA. -DNA of eukaryotic cells is tightly bound to small basic proteins (histones) that package the DNA in an orderly way in the cell nucleus(total extended length of DNA in a human cell is nearly 2 m, but this DNA must fit into a nucleus with a diameter of only 5 to 10 m).

-complexes between eukaryotic DNA and proteins -major proteins of chromatin are the histones small proteins containing a high proportion of basic amino acids (arginine and lysine) that facilitate binding to the negatively charged DNA molecule. - five major types of histonescalled H1, H2A, H2B, H3, and H4which are very similar among different species of eukaryotes

Table 4.3. The Major Histone Proteins


Molecular weight

Number of amino acids

Percentage Lysine + Arginine

H1 H2A H2B H3 H4

22,500 13,960 13,774 15,273 11,236

244 129 125 135 102

30.8 20.2 22.4 22.9 24.5

Data are for rabbit (H1) and bovine histones.

-histones are extremely abundant proteins in eukaryotic cells; together, their mass is approximately equal to that of the cell's DNA.

-not found in eubacteria (e.g., E. coli), but Archaebacteria, do contain histones that package their DNAs in structures similar to eukaryotic chromatin. -the basic structural unit of chromatin, the nucleosome, consisting of DNA (146-bp segment of DNA )wrapped around a histone core.

Chromosomal DNA and Its Packaging in the Chromatin Fiber

Figure 4.8. The organization of chromatin in nucleosomes (A) The DNA is wrapped around histones in nucleosome core particles and sealed by histone H1. Nonhistone proteins bind to the linker DNA between nucleosome core particles. The linker DNA between the nucleosome core particles is preferentially sensitive, so limited digestion of chromatin yields fragments corresponding to multiples of 200 base pairs.

Solenoid structure

Figure 4.9. Structure of a chromatosome (A) The nucleosome core particle consists of 146 base pairs of DNA wrapped 1.65 turns around a histone octamer consisting of two molecules each of H2A, H2B, H3, and H4. A chromatosome contains two full turns of DNA (166 base pairs) locked in place by one molecule of H1.

Figure 4.10. Chromatin fibers The packaging of DNA into nucleosomes yields a chromatin fiber approximately 10 nm in diameter. The chromatin is further condensed by coiling into a 30-nm fiber, containing about six nucleosomes per turn.

Figure 4.11. Model for the packing of chromatin and the chromosome scaffold in metaphase chromosomes. In interphase chromosomes, long stretches of 30-nm chromatin loop out from extended scaffolds. In metaphase chromosomes, the scaffold is folded into a helix and further packed into a highly compacted structure, whose precise geometry has not been determined.

The chromatin in chromosomal regions that are not being transcribed exists predominantly in the condensed, 30-nm fiber form. The regions of chromatin actively being transcribed are thought to assume the extended beads-on-a-string form Euchromatin -Decondensed, transcriptionally active interphase chromatin. Most of the euchromatin in interphase nuclei appears to be in the form of 30-nm fibers, organized into large loops containing approximately 50 to 100 kb of DNA. About 10% of the euchromatin, containing the genes that are actively transcribed, is in a more decondensed state (the 10-nm conformation) that allows transcription Heterochromatin-Condensed, transcriptionally inactive chromatin. Heterochromatin is transcriptionally inactive and contains highly repeated DNA sequences, such as those present at centromeres and telomeres. As cells enter mitosis, their chromosomes become highly condensed so that they can be distributed to daughter cells. The loops of 30-nm chromatin fibers are thought to fold upon themselves further to form the compact metaphase chromosomes of mitotic cells, in which the DNA has been condensed nearly 10,000-fold (Figure 4.11). Such condensed chromatin can no longer be used as a template for RNA synthesis, so transcription ceases during mitosis. Electron micrographs indicate that the DNA in metaphase chromosomes is organized into large loops attached to a protein scaffold Metaphase chromosomes are so highly condensed that their morphology can be studied using the light microscope

However, euchromatin is interrupted by stretches of heterochromatin, in which 30-nm fibers are subjected to additional levels of packing that usually render it resistant to gene expression. Heterochromatin is commonly found around centromeres and near telomeres, but it is also present at other positions on chromosomes.

Figure 4-21. A comparison of extended interphase chromatin with the chromatin in a mitotic chromosome. (A) An electron micrograph showing an enormous tangle of chromatin spilling out of a lysed interphase nucleus. (B) A scanning electron micrograph of a mitotic chromosome: a condensed duplicated chromosome in which the two new chromosomes are still linked together . The constricted region indicates the position of the centromere. Note the difference in scales.

Each DNA Molecule That Forms a Linear Chromosome Must Contain a Centromere, Two Telomeres, and Replication Origins Figure 4-22. The three DNA
sequences required to produce a eucaryotic chromosome that can be replicated and then segregated at mitosis. Each chromosome has multiple origins of replication, one centromere, and two telomeres. Shown here is the sequence of events a typical chromosome follows during the cell cycle. The DNA replicates in interphase beginning at the origins of replication and proceeding bidirectionally from the origins across the chromosome. In M phase, the centromere attaches the duplicated chromosomes to the mitotic spindle so that one copy is distributed to each daughter cell during mitosis. The centromere also helps to hold the duplicated chromosomes together until they are ready to be moved apart. The telomeres form special caps at each chromosome end.

Figure 4.17. Assay of a centromere in yeast Both plasmids shown contain a selectable marker (LEU2) and DNA sequences that serve as origins of replication in yeast (ARS, which stands for autonomously replicating sequence). However, plasmid I lacks a centromere and is therefore frequently lost as a result of missegregation during mitosis. In contrast, the presence of a centromere (CEN) in plasmid II ensures its regular transmission to daughter cells.

replication origin- Location on a DNA molecule at which duplication of the DNA begins. Centromere- Constricted region of a mitotic chromosome that holds sister chromatids together. It is also the site on the DNA where the kinetochore forms that captures microtubules from the mitotic spindle. Telomere-End of a chromosome, associated with a characteristic DNA sequence that is replicated in a special way. Counteracts the tendency of the chromosome otherwise to shorten with each round of replication. (From Greek telos, end.) -perform another function: the repeated telomere DNA sequences, together with the regions adjoining them, form structures that protect the end of the chromosome from being recognized by the cell as a broken DNA molecule in need of repair.

Figure 4.19. Structure of a telomere Telomere DNA loops back on itself to form a circular structure that protects the ends of chromosomes

Figure 4-11. The banding patterns of human chromosomes. Chromosomes 122 are numbered in approximate order of size. A typical human somatic (non-germ line) cell contains two of each of these chromosomes, plus two sex chromosomestwo X chromosomes in a female, one X and one Y chromosome in a male. The chromosomes used to make these maps were stained at an early stage in mitosis, when the chromosomes are incompletely compacted. The horizontal green line represents the position of the centromere, which appears as a constriction on mitotic chromosomes; the knobs on chromosomes 13, 14, 15, 21, and 22 indicate the positions of genes that code for the large ribosomal RNAs (discussed in Chapter 6). These patterns are obtained by staining chromosomes with Giemsa stain, and they can be observed under the light microscope.

Karyotype- Full set of chromosomes of a cell arranged with respect to size, shape, and number.

The Sequences of Complete Genomes

Figure 1-30. The genome of E. coli. (A) A cluster of E. coli cells. (B) A diagram of the E. coli genome of 4,639,221 nucleotide pairs (for E. coli strain K-12). The diagram is circular because the DNA of E. coli, like that of other procaryotes, forms a single, closed loop. Protein-coding genes are shown as yellow or orange bars, depending on the DNA strand from which they are transcribed; genes encoding only RNA molecules are indicated by green arrows. Some genes are transcribed from one strand of the DNA double helix (in a clockwise direction in this diagram), others from the other strand (counterclockwise).

Figure 4-13. The genome of S. cerevisiae (budding yeast). (A) The genome is distributed over 16 chromosomes, and its complete nucleotide sequence was determined by a cooperative effort involving scientists working in many different locations, as indicated (gray, Canada; orange, European Union; yellow, United Kingdom; blue, Japan; light green, St Louis, Missouri; dark green, Stanford, California). The constriction present on each chromosome represents the position of its centromere . (B) A small region of chromosome 11, highlighted in red in part A, is magnified to show the high density of genes characteristic of this species. As indicated by orange, some genes are transcribed from the lower strand ,while others are transcribed from the upper strand. There are about 6000 genes in the complete genome, which is 12,147,813 nucleotide pairs long.

The Human Genome

Figure 4-16. Scale of the human genome. If each nucleotide pair is drawn as 1 mm as in (A), then the human genome would extend 3200 km (approximately 2000 miles), far enough to stretch across the center of Africa, the site of our human origins (red line in B). At this scale, there would be, on average, a protein-coding gene every 300 m. An average gene would extend for 30 m, but the coding sequences in this gene would add up to only just over a meter.

Figure 4-15. The organization of genes on a human chromosome. (A) Chromosome 22, one of the smallest human chromosomes, contains 48 106 nucleotide pairs and makes up approximately 1.5% of the entire human genome. Most of the left arm of chromosome 22 consists of short repeated sequences of DNA that are packaged in a particularly compact form of chromatin (heterochromatin), which is discussed later in this chapter. (B) A tenfold expansion of a portion of chromosome 22, with about 40 genes indicated. Those in dark brown are known genes and those in light brown are predicted genes. (C) An expanded portion of (B) shows the entire length of several genes. (D) The intron-exon arrangement of a typical gene is shown after a further tenfold expansion. Each exon (red) codes for a portion of the protein, while the DNA sequence of the introns (gray) is relatively unimportant. The entire human genome (3.2 109 nucleotide pairs) is distributed over 22 autosomes and 2 sex chromosomes (see Figures 4-10 and 4-11). The term human genome sequence refers to the complete nucleotide sequence of DNA in these 24 chromosomes. Being diploid, a human somatic cell therefore contains roughly twice this amount of DNA. Humans differ from one another by an average of one nucleotide in every thousand, and a wide variety of humans contributed DNA for the genome sequencing project. The published human genome sequence is therefore a composite of many individual sequences.

Table 4-1. Vital Statistics of Human Chromosome 22 and the Entire Human Genome

DNA length Number of genes Smallest protein-coding gene Largest gene Mean gene size Smallest number of exons per gene Largest number of exons per gene Mean number of exons per gene Smallest exon size Largest exon size Mean exon size Number of pseudogenes** Percentage of DNA sequence in exons (protein coding sequences) Percentage of DNA in high-copy repetitive elements Percentage of total human genome

48 106 nucleotide pairs* approximately 700 1000 nucleotide pairs 583,000 nucleotide pairs 19,000 nucleotide pairs 1 54 5.4 8 nucleotide pairs 7600 nucleotide pairs 266 nucleotide pairs more than 134 3% 42% 1.5%

3.2 109 approximately 30,000 not analyzed 2.4 106 nucleotide pairs 27,000 nucleotide pairs 1 178 8.8 not analyzed 17,106 nucleotide pairs 145 nucleotide pairs not analyzed 1.5% approximately 50% 100%

The nucleotide sequence of 33.8 106 nucleotides is known; the rest of the chromosome consists primarily of very short repeated sequences that do not code for proteins or RNA. ** A pseudogene is a nucleotide sequence of DNA closely resembling that of a functional gene, but containing numerous deletion mutations that prevent its proper expression. Most pseudogenes arise from the duplication of a functional gene

Figure 4-17. Representation of the nucleotide sequence content of the human genome. LINES, SINES, retroviral-like elements, and DNA-only transposons are all mobile genetic elements that have multiplied in our genome by replicating themselves and inserting the new copies in different positions. Mobile genetic elements are discussed in Chapter 5. Simple sequence repeats are short nucleotide sequences (less than 14 nucleotide pairs) that are repeated again and again for long stretches. Segmental duplications are large blocks of the genome (1000200,000 nucleotide pairs) that are present at two or more locations in the genome. Over half of the unique sequence consists of genes and the remainder is probably regulatory DNA. Most of the DNA present in heterochromatin, a specialized type of chromatin (discussed later in this chapter) that contains relatively few genes, has not yet been sequenced.