Browsing Genomes With Ensembl PDF

Browsing
Genomes
with Ensembl

www.ensembl.org
www.ensemblgenomes.org

Coursebook v74

URL

Place - date
1
TABLE OF CONTENTS

Introduction to Ensembl ............................................................................ 4
Exploring the Ensembl genome browser .............................................. 7

Demo: Ensembl species ............................................................................... 7
Exercises: Ensembl species ...................................................................... 12
Demo: The Region in detail view ........................................................... 12
Exercises: The Region in Detail view ................................................... 18
Genes and transcripts .............................................................................. 20

Demo: The gene tab ..................................................................................... 20
Demo: The transcript tab .......................................................................... 25
Exercises: Genes and transcripts ........................................................... 30
BioMart ......................................................................................................... 32
Demo: BioMart .............................................................................................. 32
Exercises: BioMart ....................................................................................... 36
Variation ....................................................................................................... 40
Demo: Exploring variants in Ensembl ................................................. 40
Exercises: Exploring variants in Ensembl ......................................... 48
Demo: The Variant Effect Predictor (VEP) ........................................ 50
Exercise: The Variant Effect Predictor (VEP) .................................. 52
Comparative genomics ............................................................................ 53

Demo: Gene trees and homologues ...................................................... 53
Exercises: Gene trees and homologues .............................................. 55
Demo: Whole genome alignments ........................................................ 55
Exercises: Whole genome alignments ................................................. 59
Regulation .................................................................................................... 62
Demo: Raw ChIPSeq data .......................................................................... 62
2
Demo: Regulatory features and segmentation ................................ 63
Exercises: Regulation ................................................................................. 65
Advanced Access ........................................................................................ 67

Demo: Upload small files .......................................................................... 67
Demo: Attach URLs of large files ........................................................... 70
Demo: REST API ............................................................................................ 72
Advanced exercise ..................................................................................... 75
Answers Exploring the Ensembl genome browser ..................... 77

Ensembl species ............................................................................................ 77
Region in detail ............................................................................................. 78
Answers Genes and Transcripts ........................................................ 81
Answers BioMart .................................................................................... 84
Answers Variation .................................................................................. 88

Finding variants in Ensembl .................................................................... 88
VEP ..................................................................................................................... 91
Answers Comparative Genomics ....................................................... 93

Gene trees and homologues .................................................................... 93
Whole genome alignments ....................................................................... 94
Answers Regulation ............................................................................... 97
Answers Advanced exercise .............................................................. 100
Quick Guide to Databases and Projects ............................................ 104

3
Introduction to Ensembl
Getting started with Ensembl
www.ensembl.org

Ensembl is a joint project between the EBI (European Bioinformatics
Institute) and the Wellcome Trust Sanger Institute that annotates
chordate genomes (i.e. vertebrates and closely related invertebrates
with a notochord such as sea squirt). Gene sets from model
organisms such as yeast and worm are also imported for comparative
analysis by the Ensembl compara team. Most annotation is updated
every two months, leading to increasing Ensembl versions (such as
version 74), however the gene sets are determined less frequently. A
sister browser at www.ensemblgenomes.org is set up to access non-
chordates, namely bacteria, plants, fungi, metazoa, and protists.

Ensembl provides genes and other annotation such as regulatory
regions, conserved base pairs across species, and sequence
variations. The Ensembl gene set is based on protein and mRNA
evidence in UniProtKB and NCBI RefSeq databases, along with
manual annotation from the VEGA/Havana group. All the data are
freely available and can be accessed via the web browser at
www.ensembl.org. Perl programmers can directly access Ensembl
databases through an Application Programming Interfaces (Perl
APIs). Gene sequences can be downloaded from the Ensembl
browser itself, or through the use of the BioMart web interface,
which can extract information from the Ensembl databases without
the need for programming knowledge by the user.

4

Synopsis What can I do with Ensembl?

View genes with other annotation along the chromosome.
View alternative transcripts (i.e. splice variants) for a given
gene.
Explore homologues and phylogenetic trees across more than
60 species for any gene.
Compare whole genome alignments and conserved regions
across species.
View microarray sequences that match to Ensembl genes.
View ESTs, clones, mRNA and proteins for any chromosomal
region.
Examine single nucleotide polymorphisms (SNPs) for a gene or
chromosomal region.
View SNPs across strains (rat, mouse), populations (human), or
breeds (dog).
View positions and sequence of mRNAs and proteins that align
with Ensembl genes.
Upload your own data.
Use BLAST, or BLAT against any Ensembl genome.
Export sequence or create a table of gene information with
BioMart.
Determine how your variants affect genes and transcripts using
the Variant Effect Predictor.
Share Ensembl views with your colleagues and collaborators.

5
Need more help?

Check Ensembl documentation
Watch video tutorials on YouTube
View the FAQs
Try some exercises
Read some publications
Go to our online course
Stay in touch!

Email the team with comments or questions at
helpdesk@ensembl.org
Follow the Ensembl blog
Sign up to a mailing list
Further reading

Flicek, P. et al
Ensembl 2013
Nucleic Acids Res. Advanced Access (Database Issue)
http://www.ncbi.nlm.nih.gov/pubmed/23203987

Ensembl Methods Series
http://www.biomedcentral.com/series/ENSEMBL2010
Xos M. Fernndez-Surez and Michael K. Schuster

Using the Ensembl Genome Server to Browse Genomic Sequence Data.
UNIT 1.15 in Current Protocols in Bioinformatics, Jun 2010.
Giulietta M Spudich and Xos M Fernndez-Surez

Touring Ensembl: A practical guide to genome browsing
BMC Genomics 2010, 11:295 (11 May 2010)
6
Exploring the Ensembl genome browser

Demo: Ensembl species

The front page of Ensembl is found at ensembl.org. It contains lots of
information and links to help you navigate Ensembl:

Link back to Blue bar remains visible

homepage Ensembl tools on every Ensembl page Search

Search
News
Drop-down list
of species
How-tos for
commonly used
Ensembl features

Click on View full list of all Ensembl species.

Click on the common name of your species of interest to go to the
species homepage. Well click on Human.

7
News
Search
Information
and statistics
Links to
example
features in
Ensembl

To find out more about the genome assembly and genebuild, click on
More information and statistics.

Tables of
Information statistics
8
Lets take a look at the Ensembl Genomes homepage at
ensemblgenomes.org.
Links to the taxa-
specific sites

Link back
to Ensembl
News

Click on the different taxa to see their homepages. Each one is colour-
coded.

Protists Fungi
9

Metazoa Plants

Bacteria

You can navigate most of the taxa in the same way as you would with
Ensembl, but Ensembl Bacteria has a large number of genomes, so
needs slightly different methods. Lets look at it in more detail.

Search for Search for

a gene a species
Information
on Ensembl
Bacteria
10

Theres no full species list for bacteria as it would be hard to navigate
with the number of species. To find a species, start to type the species
name into the species search box. A drop down list will appear with
possible species.

For example, to find a substrain of Clostridium difficile type in
Clostridium d.

The drop down contains various strains of Clostridium difficile. Lets
choose Clostridium difficile 630. This will take us to another species
homepage, where we can explore various features.
11

Exercises: Ensembl species

Exercise 1 Panda

(a) Go to the species homepage for Panda. What is the name of the
genome assembly for Panda?

(b) Click on More information and statistics. How long is the Panda
genome (in bp)? How many genes have been annotated?

Exercise 2 Zebrafish

(a) Whats new in release 74 for zebrafish?

(b) What previous assembly is available for zebrafish?

Exercise 3 Mosquitos

(a) Go to Ensembl Metazoa. How many species of the genus Anopheles
are there?

(b) Who published the genome sequence for Anopheles gambiae?

Exercise 4 Bacteria

Go to Ensembl Bacteria and find the species Belliella baltica. How
many coding and non-coding genes does it have?

Demo: The Region in detail view

Start at the Ensembl front page, ensembl.org. You can search for a
region by typing it into a search box, but you have to specify the
species.

Type (or copy and paste) human 4:123792818-123867893 into
either search box.

12
or

Press Enter or click Go to jump directly to the Region in detail Page.

Click on the button to view page-specific help.

The help pages provide links to Frequently Asked Questions, a
Glossary, Video Tutorials, and a form to Contact HelpDesk.

There is a help video on this page at http://youtu.be/tTKEvgPUq94.

Location
views Chromosome
Page-specific
help
Scrollable
1Mb view
Tool
buttons
Region of
interest in
detail
13
The Region in detail page is made up of three images, lets look at
each one on detail.

The first image shows the chromosome:

Haplotypes Our Chromosome
and patches position bands

You can jump to a different region by dragging out a box in this
image. Drag out a box on the chromosome, a pop-up menu will
appear.

Box dragged
out

If you wanted to move to the region, you could click on Jump to
region (###bp). For now, well close the pop-up by clicking on the X
on the corner.

The second image shows a 1Mb region around our selected region.
This view allows you to scroll back and forth along the chromosome.

Region of Scrolling
interest buttons
Blocks represent
genes. Names are
shown bottom left.

14
At the moment the gene track is set to a fixed height. Click on the
Automatic track height button to expand the image to include all
possible data in the track.

Scroll along the chromosome by clicking and dragging within the
image. As you do this youll see the image below grey out and two
blue buttons appear. Clicking on Update this image would jump the
lower image to the region central to the scrollable image. We want to
go back to where we started, so well click on Reset scrollable image.

You can also drag out and jump to a region. Either hold down shift
and drag in the image, or click on the Drag/Select button to
change the action of your mouse click, and drag out a box.

Click on the X to close the pop-up menu.

The third image is a detailed, configurable view of the region.

15
Forward-
stranded
transcripts
Blue bar is
the
genome
Track
names
Reverse-
stranded
Click and transcripts
drag the
position of
tracks
Legends

We can edit what we see on this page by clicking on the blue
Configure this page menu at the left.

This will open a menu that allows you to change the image.

You can put some tracks on in different styles; more details are in this
FAQ: http://www.ensembl.org/Help/Faq?id=335.

16
Configuration
tabs Search for
tracks
Track
categories
Track
information
Track
names
Turn tracks
on/off and
change style

Lets add some tracks to this image. Add:
Human proteins Labels
dbSNP variants Normal
1000 Genomes AMR Collapsed
Now click on the tick in the top left hand to save and close the menu.
Alternatively, click anywhere outside of the menu. We can now see
the tracks in the image.

We can also change the way the tracks appear by hovering over the
track name then the cog wheel to open a menu. We can move tracks
around by clicking and dragging on the bar to the left of the track
name.
17

Now that youve got the view how you want it, you might like to show
something youve found to a colleague or collaborator. Click on the
Share this page button to generate a link. Email the link to someone
else, so that they can see the same view as you, including all the
tracks youve added. These links contain the Ensembl release
number, so if a new release or even assembly comes out, your link
will just take you to the archive site for the release it was made on.

To return this to the default view, go to Configure this page and select
Reset configuration at the bottom of the menu.

Exercises: The Region in Detail view

Exercise 5 Exploring a genomic region in human

(a) Go to the region from 32,448,000 to 33,198,000 bp on human
chromosome 13. On which cytogenetic band is this region located?
How many contigs make up this portion of the assembly (contigs are
contiguous stretches of DNA sequence that have been assembled
solely based on direct sequencing information)?

(b) Zoom in on the BRCA2 gene.

(c) Turn on the Tilepath track in this view. What is this track? Are
there any Tilepath clones that contain the complete BRCA2 gene?

(d) Create a Share link for this display. Email it to yourself and open
the link.

(e) Export the genomic sequence of the region you are looking at in
FASTA format.

(f) Turn off all tracks you added to the Region in detail page.

18
Exercise 6 Exploring patches and haplotypes in human

(a) Go to the region 6:112294691-112624977 in human. What is the
green highlighted region? (Tip: if you see a word or phrase you dont
know in Ensembl, search for it to see help pages.)

(b) Can you see the patches in the chromosome view? Drag out a box
to jump to a region containing the leftmost patch on this
chromosome, named HG27_patch (note: you must drag out a region
smaller than 1Mb). What are the coordinates of the patch?

(c) Can you compare this patch with the reference? What has
changed between this patch and the sequence it replaced?

(d) Go back to the Region in detail and scroll to the right in the 1Mb
view until you reach a red highlighted region. What is this?
19
Genes and transcripts

Demo: The gene tab

If you click on any one of the transcripts in the Region in detail image,
a pop-up menu will appear, allowing you to jump directly to that gene
or transcript.
Links

Another way to go to a gene of interest is to search directly for it.

Were going to look at the human ESPN gene. This gene encodes a
multifunctional actin-bundling protein with a major role in mediating
sensory transduction in various mechanosensory and chemosensory
cells. Mutations in this gene are associated with deafness
(http://tinyurl.com/espn-ncbi-gene).

From ensembl.org, type ESPN into the search bar and click the Go
button. You will get a list of hits with the human gene at the top.

Where you search for something without specifying the species, or
where the ID is not restricted to a single species, the most popular
species will appear first, in this case, human, mouse and zebrafish
appear first. You can restrict your query to species or features of
interest using the options on the left.
20
Links

Click on the gene name or Ensembl ID. The Gene tab should open:

Gene tab
Option:
Open table
of transcripts
ESPN-001
transcript. Click
Gene for info
views
Forward-
stranded
transcripts
Blue bar is
the
genome
Reverse-
stranded

transcripts

21
Lets walk through some of the links in the left hand navigation
column. How can we view the genomic sequence? Click Sequence at
the left of the page.
Most recent human
genome assembly
GRCh37 = hg19
Click
Sequence
Upstream
sequence
Exon of an
overlapping gene

ESPN Exon

The sequence is shown in FASTA format. Take a look at the FASTA
header:

22
chromosome base pair end
name
of
genome
base pair start

assembly forward strand
(-1 is reverse)

Exons are highlighted within the genomic sequence. Variations can
be added with the Configure this page link found at the left. Click on it
now.
Show variants
Turn on line
numbers

Once you have selected changes (in this example, Show variations
and Line numbering) click at the top right.

Links to the
variation tab

Lets look at where our gene is expressed. Click on Expression in the
left-hand menu.

23

Hover over the column titles for a pop-up definition.

Can our gene be found in other databases? Go up the left-hand menu
to External references:

This contains links to the gene in other projects, such as EntrezGene.

To find out more about the individual transcripts of this gene, click
on Transcript comparison in the left-hand menu.

24
You must now choose the transcripts youd like to see, click on the
blue Select transcripts button.

Click on the + to add
a transcript
Select all transcripts

of a particular
biotype

Lets select all the protein-coding transcripts, then close the menu.

Legend
Gene
sequence
Transcript
sequence
s

Demo: The transcript tab

Lets now explore one splice isoform. Click on Show transcript table
at the top.

Click on the ID for the largest one, ESPN-001 (ENST00000377828).

25

You are now in the Transcript tab for ESPN-001. The left hand
navigation column provides several options for the transcript ESPN-
001. Click on the Exons link.

Green:
flanking
sequence
Grey:
coding
sequence
Purple:
UTR Blue:
introns

You may want to change the display (for example, to show more
flanking sequence, or to show full introns). In order to do so click on
Configure this page and change the display options accordingly.

26

If you would like to export the sequence, including the colours, click
Download view as RTF. A Rich Text Format document will be
generated that can be opened in word processor such as MS Word.

Now click on the cDNA link to see the spliced transcript sequence.

Click
cDNA
27

UnTranslated Regions (UTRs) are highlighted in dark yellow, codons
are highlighted in light yellow, and exon sequence is shown in black
or blue letters to show exon divides. Sequence variants are
represented by highlighted nucleotides and clickable IUPAC codes
are above the sequence.

Next, follow the General identifiers link at the left.

This page shows information from other databases such as RefSeq,
UniProtKB, CCDS and others, that match to the Ensembl transcript
and protein.

28

Click on Ontology table to see GO terms from the Gene Ontology
consortium. www.geneontology.org

Click on the to see a guide to the three-letter Evidence codes.

Now click on Protein summary to view domains from Pfam, PROSITE,
Superfamily, InterPro, and more.

Ensembl

ESPN protein
Protein
domains

Clicking on Domains & features shows a table of this information.

29

Exercises: Genes and transcripts

Exercise 7 Exploring the human MYH9 gene

(a) Find the human MYH9 (myosin, heavy chain 9, non-muscle) gene,
and go to the Gene tab.

On which chromosome and which strand of the genome is this
gene located?
How many transcripts (splice variants) are there?
How many of these transcripts are protein coding?
What is the longest transcript, and how long is the protein it

encodes?
Which transcript has a CCDS record associated with it?
Why is the CCDS important what does it tell us?

(b) Click on Phenotype at the left side of the page. Are there any
diseases associated with this gene, according to OMIM (Online
Mendelian Inheritance in Man)?

(c) In the transcript table, click on the transcript ID for MYH9-001,
and go to the Transcript tab.

How many exons does it have?
Are any of the exons completely or partially untranslated?
Is there an associated sequence in UniProtKB/Swiss-Prot?

Have a look at the General identifiers for this transcript.
What are some functions of MYH9-001 according to the Gene

Ontology consortium? Have a look at the Ontology table for this
transcript.
(d) Are there microarray (oligo) probes that can be used to monitor
ENST00000216181 expression?
30
Exercise 8 Finding a gene associated with a phenotype

Phenylketonuria is a genetic disorder caused by an inability to
metabolise phenylalanine in any body tissue. This results in an
accumulation of phenylalanine causing seizures and mental
retardation.

(a) Search for phenylketonuria from the Ensembl homepage. What
gene is associated with this disorder?

(b) What tissues is this gene expressed in? Is this surprising, given
the genes role in disease? What is meant by Intron-spanning reads
and RNASeq alignments?

(c) How many protein coding transcripts does this gene have? View
all of these in the transcript comparison view.

(d) What is the MIM disease identifier for this gene?

Exercise 9 Exploring a plant gene (Vitis vinifera, grape)

Start in http://plants.ensembl.org/index.html and select the Vitis
vinifera genome.

(a) What GO: biological process terms are associated with the MADS4
gene?

(b) Go to the transcript tab for the only transcript,
Vv01s0010g03900.t01. How many exons does it have? Which one is
the longest? How much of that is coding?

(c) What domains can be found in the protein product of this
transcript? How many different domain prediction methods agree
with each of these domains?

31
BioMart

Demo: BioMart

Follow these instructions to guide you through BioMart to answer the
following query:

You have three questions about a set of human genes:
ESPN, MYH9, USH1C, CISD2, THRB, DFNB31
(these are HGNC gene symbols. More details on the HUGO
Gene Nomenclature Committee can be found on
http://www.genenames.org)

1) What are the EntrezGene IDs for these genes?

2) Are there associated functions from the GO (gene
ontology) project that might help describe their function?

3) What are their cDNA sequences?

Step 1: Click on BioMart in the top header of a www.ensembl.org
page to go to: www.ensembl.org/biomart/martview

NOTE: These answers were determined using BioMart Ensembl 74.

STEP 2:

Choose Ensembl Genes 74
as the primary database.

STEP 3:
Choose Homo sapiens genes as the
dataset.
32

STEP 4:
Click Filters at the left.
Expand the GENE panel.

STEP 5:
In ID List Limit, paste in your
gene symbols. Change the
heading to read HGNC
symbol(s) [e.g. ZFY].

STEP 6:
Click Count to see BioMart is reading
6 genes out of 64,138 possible H.
sapiens genes. Since we entered 6
gene symbols, this confirms that our
filters have worked correctly.
33
STEP 7:
Click on Attributes to select
output options
(i.e. GO terms)
STEP 8:
Expand the EXTERNAL panel.

STEP 9:
Scroll down to select
EntrezGene ID
(to answer question 1)
STEP 10:
Also select HGNC symbol to see
the input gene symbols we
started with.

STEP 11:
Scroll back up to select GO term
fields
(to answer question 2)

STEP 12:
Click Results.
34
Why are there multiple rows for one gene ID? For example, look at the
first few rows.

STEP 13:
Click Attributes again
STEP 14:
Select Sequences at the top, then expand
SEQUENCES and choose the option cDNA
sequences (to answer question 3).
STEP 15:
Expand Header Information to select the
Associated Gene Name (this is the
official gene name, for human it is HGNC
which was our original input).

35
STEP 16:
Click Results to see the cDNA
sequences in FASTA format.
STEP 17:
Change View 10 rows to View All
rows so that you see the full table.

Note: Pop-up blocking must be
switched off in your browser.

Note: you can use the Go button to export a file.

What did you learn about the human genes in this exercise?
Could you learn these things from the Ensembl browser? Would it take
longer?

For more details on BioMart, have a look at these publications:

Smedley, D. et al
BioMart biological queries made easy
BMC Genomics 2009 Jan 14;10:22

Kinsella, R.J. et al
Ensembl BioMarts: a hub for data retrieval across taxonomic
space.
Database (Oxford) 2011:bar030

Exercises: BioMart

Exercise 10 Finding genes by protein domain

Find mouse proteins with transmembrane domains located on
chromosome 9.

36
Exercise 11 Convert IDs

BioMart is a very handy tool when you want to convert IDs from
different databases. The following is a list of 29 IDs of human
proteins from the NCBI RefSeq database
(http://www.ncbi.nlm.nih.gov/projects/RefSeq/):

NP_001218 NP_001220
NP_203125 NP_004338
NP_203124 NP_004337
NP_203126 NP_116786
NP_001007233 NP_036246
NP_150636 NP_116756
NP_150635 NP_116759
NP_001214 NP_001221
NP_150637 NP_203519
NP_150634 NP_001073594
NP_150649 NP_001219
NP_001216 NP_001073593
NP_116787 NP_203520
NP_001217 NP_203522
NP_127463

Generate a list that shows to which Ensembl Gene IDs and to which
HGNC symbols these RefSeq IDs correspond. Do these 29 proteins
correspond to 29 genes?

Hint: For this exercise, its easier to copy and paste the IDs from the
online exercise booklet (copy one column, then the other). See the
front cover for the URL.

Exercise 12 Export homologues

For a list of Ciona savignyi Ensembl genes, export the human
orthologues.

ENSCSAVG00000000002
ENSCSAVG00000000003
ENSCSAVG00000000006
ENSCSAVG00000000007
ENSCSAVG00000000009
37
ENSCSAVG00000000011

Exercise 13 Export structural variants

You can use BioMart to query variants, not just genes. (Make sure you
use the right Datasets.)

(a) Export the study accession, source name, chromosome, sequence
region start and end (in bp) of human structural variations (SV) on
chromosome 1, starting at 130,408 and ending at 210,597.

(b) In a new BioMart query, find the alleles, phenotype descriptions,
and associated genes for the human SNPs rs1801500 and rs1801368.
Can you view this same information in the Ensembl browser?

Exercise 14 Find genes associated with array probes

Forrest et al performed a microarray analysis of peripheral blood
mononuclear cell gene expression in benzene-exposed workers
(Environ Health Perspect. 2005 June; 113(6): 801807). The
microarray used was the human Affymetrix U133A/B (also called
U133 plus 2) GeneChip. The top 25 up-regulated probe-sets were:

207630_s_at 221641_s_at
221840_at 202055_at
219228_at 226743_at
204924_at 228393_s_at
227613_at 225120_at
223454_at 218515_at
228962_at 202224_at
214696_at 200614_at
210732_s_at 212014_x_at
212370_at 223461_at
225390_s_at 209835_x_at
227645_at 213315_x_at
226652_at

(a) Retrieve for the genes corresponding to these probe-sets the
Ensembl Gene and Transcript IDs as well as their HGNC symbols and
descriptions.
38

(b) In order to analyse these genes for possible promoter/enhancer
elements, retrieve the 2000 bp upstream of the transcripts of these
genes.

(c) In order to be able to study these human genes in mouse, identify
their mouse orthologues. Also retrieve the genomic coordinates of
these orthologues.
39
Variation

Demo: Exploring variants in Ensembl

In any of the sequence views shown in the Gene and Transcript tabs,
you can view variants on the sequence. You can do this by clicking on
Configure this page from any of these views.

Lets take a look at the Gene sequence view for MCM6 in human.
Search for MCM6 and go to the Sequence view.

If you cant see variants marked on this view, click on Configure this
page and select Show variations: Yes and show links.

Legend of variant
consequence
types
Links to
variants
Variants on
sequences shown
as IUPAC codes

Find out more about a variant by clicking on it.

You can add variants to all other sequence views in the same way.
40
You can go to the Variation tab by clicking on the variant ID. For now,
well explore more ways of finding variants.

To view all the sequence variations in table form, click the Variation
table link at the left of the gene tab.

The table is divided into consequence types.

Click on Show to expand a detailed table for any of the consequence
types available.

Lets expand Missense variants.
SIFT and
PolyPhen
scores
Transcript
Variant affected
IDs

The table contains lots of information about the variants. You can
click on the IDs here to go to the Variation tab too.

Lets look at Structural Variation in the Gene Tab. Youll find it in the
left-hand menu.

41
All larger SVs are
condensed into a
single bar
Smaller SVs
are shown
individually
Table of
all SVs

You can click on the structural variants (SVs) in the image, or on their
IDs in the table to go to the SV tab.
You can also see the phenotypes associated with a gene. Click on
Phenotypes in the left hand menu.

Phenotypes
associated with
the gene
Phenotypes
associated with
variants in the
gene
Click to see a
list of variants

42

Lets have a look at variants in the Location tab. Click on the Location
tab in the top bar.

Configure this page and open Variation from the left-hand menu.

There are various options for turning on variants. You can turn on
variants by source, by frequency, presence of a phenotype or by
individual genome they were isolated from. Turn on the following
sequence variants in Expanded with name.
1000 genomes All
1000 genomes All common
All phenotype-associated variants
ENSEMBL:Venter

Also turn on Larger and Smaller Structural variants (all sources) in
Expanded.

43
SNPs and indels
SVs
Variation
legends

Click on a variant to find out more information. It may be easier to
see the individual variants if you zoom in.

Lets zoom in on the region 2:136607850-136609811 by typing it
into the Location box.

Now that we are zoomed in, we can see the variant names. Click on
the variant rs4988235 to open a pop-up, then click on rs4988235
properties to open the Variation tab.

44
Variant
information
Variation views
Variation icons.
These go to the
same places as the
links on the left

The icons show you what information is available for this variant.
Click on Genes and regulation, or follow the link at the left.

This variant is found in three transcripts of the MCM6 gene. It has not
been associated with any regulatory features or motifs.

Lets look at population genetics. Either click on Explore this variant
in the left hand menu then click on the Population genetics icon, or
click on Population genetics in the left-hand menu.

45
Pie charts of allele
frequencies
Expand
subpopulations
Table of more
detailed data

These data are mostly from the 1000 genomes and HapMap
projects in human.

There are big differences in allele frequencies between populations.
Lets have a look at the phenotypes associated with this variant to see
if they are known to be specific to certain human populations. Either
click on Explore this variant in the left hand menu then click on the
Phenotype data icon, or click on Phenotype Data in the left-hand
menu.

This variant is associated with lactase persistence, which is known to
be common in European populations, and rare in Asian populations,
exactly as we saw in the allele frequencies in these populations.

Are there other variants in the genome that also cause lactase
persistence? Click on [View on Karyotype] to find out.

46
Hits on the
karyotype
Legend showing
hit significance
Table of
variants

Two variants are known to be associated with this phenotype. Both
are found with the MCM6 gene.

Click back to the Variation Tab. Click on Phylogenetic context to see
the variant in other species.
Choose your
alignment
Aligned regions
SNP of
interest
Alignment
between species

47
The variant is not marked in the other species. This means that the
variant arose in humans.

Exercises: Exploring variants in Ensembl

Exercise 15 Human population genetics and phenotype data

The SNP rs1738074 in the 5 UTR of the human TAGAP gene has been
identified as a genetic risk factor for a few diseases.

(a) In which transcripts is this SNP found?

(b) What is the least frequent genotype for this SNP in the Yoruba
(YRI) population from the HapMap set?

(c) What is the ancestral allele? Is it conserved in the 37 eutherian
mammals?

(d) With which diseases is this SNP associated? Are there any known
risk (or associated) alleles?

Exercise 16 Exploring a SNP in human

The missense variation rs1801133 in the human MTHFR gene has
been linked to elevated levels of homocysteine, an amino acid whose
plasma concentration seems to be associated with the risk of
cardiovascular diseases, neural tube defects, and loss of cognitive
function. This SNP is also referred to as A222V, Ala222Val as well
as other HGVS names.

(a) Find the page with information for rs1801133.

(b) Is rs1801133 a Missense variation in all transcripts of the MTHFR
gene?

(c) Why are the alleles for this variation in Ensembl given as G/A and
not as C/T, as in dbSNP and literature?
(http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=180113
3)

48
(d) What is the major allele in rs1801133?

(e) In which paper is the association between rs1801133 and
homocysteine levels described?

(f) According to the data imported from dbSNP, the ancestral allele
for rs1801133 is G. Ancestral alleles in dbSNP are based on a
comparison between human and chimp. Does the sequence at this
same position in four other primates, i.e. gorilla, orangutan, macaque
and marmoset, confirm that the ancestral allele is G?

(g) Were both alleles of rs1801133 already present in Neanderthal?
To answer this question, have a look at the individual reads at its
genomic position in the Neanderthal Genome Browser
(http://neandertal.ensemblgenomes.org/).

Exercise 17 Structural variation in human

In the paper The influence of CCL3L1 gene-containing segmental
duplications on HIV-1/AIDS susceptibility (Gonzalez et al Science.
2005 Mar 4; 307(5704):1434-40) it is shown that a higher copy
number of the CCL3L1 (Chemokine (C-C motif) ligand 3-like 1) gene is
associated with lower susceptibility to HIV infection.

(a) Find the human CCL3L1 gene.

(b) Have any CNVs been annotated for this gene? Note: In Ensembl,
CNVs are classified as structural variants.

Exercise 18 Exploring a SNP in mouse

Madsen et al in the paper Altered metabolic signature in pre-diabetic
NOD mice (PloS One. 2012; 7(4): e35445) have described several
regulatory and coding SNPs, some of them in genes residing within
the previously defined insulin dependent diabetes (IDD) regions. The
authors describe that one of the identified SNPs in the murine Xdh
gene (rs29522348) would lead to an amino acid substitution and
could be damaging as predicted as by SIFT (http://sift.jcvi.org/).

49
(a) Where is the SNP located (chromosome and coordinates)?

(b) What is the HGVS recommendation nomenclature for this SNP?

(c) Why does Ensembl put the C allele first (C/T)?

(d) Are there differences between the genotypes reported in
NOD/LTJ and BALB/cByJ?

Demo: The Variant Effect Predictor (VEP)

We have analysed a samples from a patient with a genetic disorder.
The patient presents with facial and limb deformities, mental
retardation and gastrointestinal reflux. Our genotyping has identified
a mutation that may be responsible for the phenotype:
An A->G mutation on chromosome 5 at 37,017,205 on the + strand.

We will use the Ensembl VEP to determine:

Has my variant already been annotated in Ensembl?
What genes are affected by my variant?
Does my variant result in a protein change?

Go to the front page of Ensembl and click on the VEP button.

This page contains information about the VEP, including links to
download the script version of the tool. Click on Launch the online
VEP tool!

This will open up a dialogue box. This allows us to input data on our
variant.

50
Give your
data a
name
Put your data

in here.
can also
You
upload a file.
The data is in the format:
Chromosome Start End alleles (reference/mutation) strand

Delete the writing already in the Paste data box and type in:
5 37017205 37017205 A/G +

Scroll down to see some of the options we can also choose.
Choose which
database to map
your variant to.
Find out if
variants already
exist in our
database.
Choose to see
scores for
protein changes.
Choose to only
see common or
rare variants

51
Select Prediction and Score for SIFT predictions and PolyPhen
predictions. These are algorithms that predict how deleterious a
mutation will be on a protein.

When youve selected everything you need, scroll right to the bottom
and click Next.

Click HTML to view your results with clickable links.

Our mutation affects two Our mutation causes an Our mutation is already

transcripts of one gene amino acid change in the Ensembl database

Exercise: The Variant Effect Predictor (VEP)

Exercise 19 VEP

Resequencing of the genomic region of the human CFTR (cystic
fibrosis transmembrane conductance regulator (ATP-binding
cassette sub-family C, member 7) gene (ENSG00000001626) has
revealed the following variants (alleles defined in the forward
strand):
G/A at 7:117,171,039
T/C at 7:117,171,092
T/C at 7:117,171,122

(a) Use the VEP tool in Ensembl and choose the options to see SIFT
and PolyPhen predictions. Do these variants result in a change in the
proteins encoded by any of the Ensembl genes? Which gene? Have
the variants already been found?

(b) Go to Region in detail for CFTR. Do you see the VEP track?

52
Comparative genomics

Demo: Gene trees and homologues

Lets look at the homologues of human BRCA2. Search for the gene
and go to the Gene tab.

Click on Gene tree (image), which will display the current gene in the
context of a phylogenetic tree used to determine orthologues and
paralogues.
Protein
Collapsed nodes alignments

Gene of
interest

Legend

Funnels indicate collapsed nodes. We can expand them by clicking on
the node and selecting Expand this sub-tree from the pop-up menu.

Expand this
sub-tree
53
We can look at homologues in the Orthologues and Paralogues pages,
which can be accessed from the left-hand menu. The numbers of
orthologues or paralogues available are indicated in brackets
alongside the name. If there are none, then the name will be greyed
out. Paralogues is greyed out for BRCA2 indicating that there are no
paralogues available.

Click on Orthologues to see the 61 orthologues available.

Orthologue types

Choose a
taxon of Information on
interest orthologues

Choose to see only Rodent orthologues by selecting the box. The
table below will now only show details of rodent orthologues. Lets
look at mouse.

Links from the orthologue allow you to go to alignments of the
orthologous proteins and cDNAs. Click on Alignment (protein) for the
mouse orthologue.

54
Information on
orthologue pair

Alignment in
Clustal W format

Protein IDs

Exercises: Gene trees and homologues

Exercise 20 Orthologues, paralogues and gene trees for the
human BRAF gene.

(a) How many orthologues are predicted for this gene in primates?
Note the Target %id and Query %id.
How much sequence identity does the Tarsius syrichta protein have
to the human one? Click on the Alignment link next to the Ensembl
identifier column to view a protein alignment in Clustal format.

(b) Go to the orthologue in marmoset. Is there a genomic alignment
between marmoset and human? Is there a gene for both species in
this region?

Demo: Whole genome alignments

Lets look at some of the comparative genomics views in the Location
tab. Go to the region 2:176914144-177094980 in human, which
contains the HoxD cluster which is involved in limb development and
is highly conserved between species.

In the Region in detail view, we can already see the Constrained
elements for 37 eutherian mammals EPO_LOW_COVERAGE track by
default. This track indicates regions of high conservation between
species, considered to be constrained by evolution.
55

This track has a matching conservation score track. Click on
Configure this page, then Comparative genomics and turn on the
track for Conservation score for 37 eutherian mammals
EPO_LOW_COVERAGE. Save and close the menu.

You can now see the conservation scores that were used to
determine the peaks indicated in the constrained elements track.

We can also look at individual species comparative genomics tracks
in this view by clicking on Configure this page.

Select BLASTz/LASTz alignments from the left-hand menu to choose
alignments between closely related species. Turn on the alignments
for Mouse and Chimpanzee in Normal. Go to Translated blat
alignments and turn on alignments with Zebrafish and Xenopus in
Normal. Save and close the menu.

Nucleotide alignments
in baby pink

Protein alignments

Filled boxes are aligned in magenta
sequences. Empty boxes

are no alignments

The alignment is greatest between closely related species.

We can also look at the alignment between species or groups of
species as text. Click on Alignments (text) in the left hand menu.

Select Mouse from the alignments list then click Go.
56
Choose an alignment
from the drop-down

Multiple alignments

Pairwise alignments

You will see a list of the regions aligned, followed by the sequence
alignment. Exons are shown in red.

This can also be viewed graphically. Click on Alignments (image) in
the left-hand menu.
Mouse is already selected
(from text view)

Human region

Mouse region,
rearranged to align
with human

57
In both alignment views the contig is the compared species is
rearranged to align to the species of interest. To compare with both
contigs in their natural order, go to Region comparison.

To add species to this view, click on the blue Select species or regions
button. Choose Mouse from the list then close the menu.

Human region

Aligned regions are

linked up

Mouse region

We can view large scale syntenic regions from our chromosome of
interest. Click on Synteny in the left hand menu.

58
Human
chromosome Choose another
species or
chromosome

Mouse chromosome
with syntenic region

Syntenic regions
Region of
interest

Table of
syntenic genes

Exercises: Whole genome alignments
Exercise 21 Zebrafish orthologues

Go to www.ensembl.org to find the dbh gene on the zebrafish
genome.
(a) Go to the Location page for this gene. View the Alignments
(image) and Alignments (text) for the 5 teleost fish. Which fish
genomes are represented in the alignment? Do all the fish show a
gene in these alignments?

(b) Export the alignments (as Clustal).

59
(c) Click on the Region in detail link at the left and turn on the tracks
for multiple alignments and conservation score for the 5 teleost fish
EPO by configuring the page.

What is the difference between the 5 teleost fish EPO multiple
alignment track and the Constrained elements track? Which regions
of the gene do most of the constrained element blocks match up to?

Can you find more information on how the constrained elements
track was generated?

Exercise 22 Synteny

Go to www.ensembl.org
Find the Rhodopsin (RHO) gene for Human. Go to the Location tab.

(a) Click Synteny at the left. Are there any syntenic regions in dog? If
so, which chromosomes are shown in this view?

(b) Stay in the Synteny view. Is there a homologue in dog for human
RHO? Are there more genes in this syntenic block with homologues?

Exercise 23 Whole genome alignments

(a) Find the Ensembl BRCA2 (Breast cancer type 2 susceptibility
protein) gene for human and go to the Region in detail page.

(b) Turn on the BLASTZ or LASTZ-net alignment tracks for chicken,
chimp, mouse and platypus and the Translated BLAT alignment
tracks for anole lizard and zebrafish. Does the degree of conservation
between human and the various other species reflect their
evolutionary relationship? Which parts of the BRCA2 gene seem to be
the most conserved? Did you expect this?

(c) Have a look at the Conservation score and Constrained elements
tracks for the set of 37 mammals and the set of 21 amniota
vertebrates. Do these tracks confirm what you already saw in the
tracks with pairwise alignment data?

60
(d) Retrieve the genomic alignment for a constrained element.
Highlight the bases that match in >50% of the species in the
alignment.

(e) Retrieve the genomic alignment for the BRCA2 gene for primates.
Highlight the bases that match in >50% of the species in the
alignment.
61
Regulation

Demo: Raw ChIPSeq data

Were going to add some regulation data to the Region in detail
view. Well start at the human region 11:2012486-2030153, which
contains the imprinted H19 gene.

Add regulation tracks using Configure this page. First, were going to
add ChIP-seq data for histone modifications and polymerase binding.
Click on Histones & polymerases under Regulation in the left-hand
menu.

Add tutorial
labels to help
use this view

Legend
Cell lines

Choose track styles

Histone Select
modifications boxes

You can turn on a single track by clicking on the box in the matrix.
Note that certain tracks are selected for all cell lines by default (PolII,
PolIII, H3K27me3, H3K36me3, H3K4me3, H3K9me3). These will
62
appear in the Region in detail view only if you specify a track style for
the cell lines.

Turn on all the tracks for GM12878. Hover over the cell line name
then select All.

Now choose the track style for the tracks youve switched on. Click on
the track style box for GM12878 and select Both.

There is a similar matrix for Open chromatin &TFBS. Use this to turn
on all tracks for GM12878 in Both.

Close the menu to see the tracks in the browser. Peaks of
histone
modifications

Histograms of
histone
Click for legend of modifications

histogram
colours

Demo: Regulatory features and segmentation

These data are used to construct the Reg-feats and Segmentation
features. The merged Reg-feats are switched on in the Region in
detail view by default.

63
Click on Configure this page. Then select Regulatory features. Turn on
the Reg. Feats: GM12878 and Reg. Segs: GM12878 tracks.

Save and close the menu. Reg feats are
shown as bar and
whisker plots

A single
coloured bar
represents the
segmentation

Legends of reg
feats and
segmentation
colour codes

Can you see correlations between the different kinds of regulatory
data representation?

You can also add methylation data using Configure this page. Find it
under DNA methylation and turn on GM12878 RRBS ENCODE and
GM12878 WGBS ENCODE.

64
Our regulatory data incorporates the ENCODE data. To see the raw
ENCODE data and the ENCODE segmentation, you need to add the
ENCODE hub.

From ensembl.org, click on the ENCODE icon.

This page contains information about the ENCODE data and how it is
incorporated into Ensembl.

Add the ENCODE hub by clicking on the Link to add the ENCODE
track hub.

This will take you directly to the matrices for adding ENCODE data to
the Region in detail view. The ENCODE matrices work in the same
way as the Open chromatin &TFBS and Histones & polymerases
matrices, except that some have multiple options (indicated by
numbers within the boxes).

Exercises: Regulation

Exercise 24 Gene regulation: Human STX7

(a) Find the Location tab (Region in detail page) for the STX7 gene.
Are there regulatory features in this gene region? If so, where in the
gene do they appear?

(b) Click Configure this page and on the Regulatory features menu in
the left hand side. Turn on Segmentation features for HUVEC, HeLa-
S3, and HepG2 cell types. Do any of these cells show predicted
enhancer regions in the STX7 region?

(c) Use Configure this page to add supporting data indicating open
chromatin for HeLa-S3 cells. Are there sites enriched for marks of
open chromatin (DNase1 and FAIRE) in HeLa cells at the 5 end of
STX7?
65

(d) Configure this page once again to add histone modification
supporting data for the same cell type as above (e.g.HeLa-S3). Which
ones are present at the 5 end of STX7?

(e) Is there any data to support methylated CpG sites in this region
(5 end) of STX7 in B-cells?

(f) Create a Share link for this display. Email it to yourself then open
the link.

Exercise 25 Regulatory features in human

The HLA-DRB1 and HLA-DQA1 genes are part of the human major
histocompatibility complex class II (MHC-II) region and are located
about 44 kb from each other on chromosome 6. In the paper The
human major histocompatibility complex class II HLA-DRB1 and HLA-
DQA1 genes are separated by a CTCF-binding enhancer-blocking
element (Majumder et al J Biol Chem. 2006 Jul 7;281(27):18435-43)
a region of high acetylation located in the intergenic sequences
between HLA-DRB1 and HLA-DQA1 is described. This region, termed
XL9, coincided with sequences that bound the insulator protein
CCCTC-binding factor (CTCF). Majumder et al hypothesise that the
XL9 region may have evolved to separate the transcriptional units of
the HLA-DR and HLA-DQ genes.

(a) Go to the region from 32,540,000 to 32,620,000 bp on human
chromosome 6

(b) Is there a regulatory feature annotated in the intergenic region
between the HLA-DRB1 and HLA-DQA1 genes that has CTCF binding
supporting data as (part of) its core evidence?

(c) Has the CTCF binding detected at this position been observed in
all cell/tissue types analysed?

(d) Have a look at the Regulatory supporting evidence - Histones &
Polymerases configuration matrix. For which cell/tissue type are the
most histone acetylation data sets available? In this cell/tissue type,
is the region that shows CTCF binding also a region of high
acetylation, as found by Majumder et al?
66
Advanced Access

Demo: Upload small files

We have some patients that present with microcephaly and
developmental delay. They all have large scale deletions on
chromosome five:

Patient Chromosome Start End
P1 5 36821632 37091234
P2 5 36731476 36978306
P3 5 36908552 37108671

We can turn them into a BED file and view them in the genome
browser.

To find out about BED format, click on Help & Documentation in the
top bar from any page in Ensembl:

Click on BED File Format to find out more:

67
This page describes the BED file format.

For our data, we have chromosome coordinates and a name for each
feature. Following the instructions on this page, we can put our data
into BED format as follows:

chr5 36821632 37091234 P1
chr5 36731476 36978306 P2
chr5 36908552 37108671 P3

To see this data in Ensembl, we need to go to a region of interest.
Well go to the region of these data. Put human 5:36700000-
37110000 into the top right search box to jump to the Region in
detail page.

Click on the Add your data button at the left. If youve previously
added data to Ensembl, this button will say Manage your data
instead.
or

A menu will appear:
Choose a name
for the data

Species is
human

Select BED

More options will now appear in the menu. Since upload is allowed
for BED, this option appears. You are still able to attach a URL if you
want to.

68

Paste the BED data into the box then click Upload.

You should get to a dialogue box telling you your upload has been
successful. Close the menu to go back to your region of interest.

The data in the

browser

Hover over the track

name to change its
appearance

To have a look at the file, click on Manage your data.

Save, share or
delete this data.

If youve got an Ensembl account, you can save this data to your
account. Accounts are free to set up and allow you to save
configurations and data, and share with groups.

69
Demo: Attach URLs of large files

Larger files, such as BAM files generated by NGS, need to be attached
by URL. Ive put a BAM file of human chromosome 20 RNASeq data
online at:
http://www.ebi.ac.uk/~emily/Workshops/BAM

Lets take a look at that URL.

Here you can see two files Illumina_reads_test.bam and
Illumina_reads_test.bam.bai (the files beginning with ._ are artefacts
of creating this folder on a Mac ignore them). These files are the
BAM file and the index file respectively. When attaching a BAM file to
Ensembl, there must be an index file in the same folder.

To attach the file, click on Manage your data, then click on Add your
data to add a new track.

We get to the same dialogue box as before. This time well name our
data Illumina reads and choose BAM as the data format.

Paste in the URL of the BAM file itself
(http://www.ebi.ac.uk/~emily/Workshops/BAM/Illumina_reads_tes
t.bam), then click Attach.

70

Close the menu.

To see this data, jump to a region on chromosome 20. Lets go to the
region of the CDH22 gene. Search for the gene and click on the
location.
BAM read
intensity

BAM reads

We can zoom in to see the sequence itself. Drag out boxes in the view
to zoom in, until you see a view like this.

Consensus BAM
read sequence

Sequence of
individual BAM
reads

Genomic
sequence

71
Demo: REST API

I have the coordinates of a particular protein motif with respect to
the protein that its in. I would like to find out where this motif lies on
the genome.

Im interested in a coiled-coil domain at position 116-216 in the
protein ENSP00000386200.

To do this I want to use the REST API. Ill start at the REST homepage
at http://beta.rest.ensembl.org/.

Here you can see a list of all the possible REST endpoints, with names
and short descriptions. Scroll down to find the section Mapping. The
endpoint GET map/translation/:id/:region does what we want. Click
on the link.

72
Description of
the endpoint
Required parameters:
what the endpoint
NEEDS to work
Optional paramaters:
allow you to choose
your output format
Example
requests
Code examples
The example in different
output shown languages for
by default accessing this
endpoint

If you wish to extract this data using a language such as Perl, Python,
Ruby or Java, or to get the data using command line tools such as Curl
or Wget, you can click on them to see code examples. Were just going
to do a simple lookup using a URL.

The top of the page shows us that the method is
map/translation/:id/:region. That means that we can get our data
using a URL in the format
beta.ensembl.rest.org/map/translation/:id/:region.

73
For our data we can use the URL
http://beta.rest.ensembl.org/map/translation/ENSP00000386200/
116..216. Put this into your internet browser.

This will take you to a text page:

From this we can see that our coiled-coil domain covers two different
regions, which will be two different exons of the transcript. They are
on chromosome 7 and span 114268607-114268732 and 114269860-
114270036.

If we were accessing this data programmatically, the standard output
format would allow us to extract the data.

74
Advanced exercise

This exercise requires you to combine the knowledge you have
gained about different aspects of Ensembl. It is designed to be
challenging and force you to come up with solutions yourself.

Methylation data in human

The human PDHA2 gene, that encodes for a subunit of the pyruvate
dehydrogenase complex, is exclusively expressed in spermatogenic
cells. In the paper Human testis-specific PDHA2 gene: Methylation
status of a CpG island in the open reading frame correlates with
transcriptional Activity (Pinheiro et al Mol Genet Metab. 2010
Apr;99(4):425-30), two CpG islands in the PDHA2 gene are reported,
one encompassing the core promoter region and extending into the
open reading frame, the other exclusively located in the coding
region. The latter CpG island was shown to be methylated in somatic
tissues but demethylated in testicular germ cells and has therefore
been proposed to play an important role in the tissue-specific
expression of the PDHA2 gene.

(a) Find the PDHA2 gene for human and go to the Region in detail
page. Zoom out one step, so that 5 kb around the PDHA2 gene is
shown.

(b) Turn on the CpG islands track. Two CpG islands are reported in
the PDHA2 gene by Pinheiro et al (2010). Do they appear in this
track? If not, why not? (Tip: turn on Display empty tracks to confirm
that a track is on but has no data.)

(c) Confirm the existence of the two CpG islands using the EMBOSS
program CpGPlot
(http://www.ebi.ac.uk/Tools/emboss/cpgplot/index.html) on the
sequence around the PDHA2 gene.

(d) Upload the CpG islands found by CpGPlot using Manage your data.
Use BED format, which in its simplest form just consists of the
chromosome and the start and end coordinates, separated by spaces
(as an optional fourth field, you can add a name/description). The
genomic start and end coordinates of the CpG islands can be
calculated from the genomic start coordinate of the sequence on
75
which the CpGPlot program was run and the relative location of the
CpG islands on this sequence as given by the CpGPlot output.

(e) Create a link to allow you to show your new BED track to
colleagues, compared to the %GC track.

(f) What is the methylation status of the two CpG islands in different
tissues? Is there any tissue in particular which is different to other
tissues?

(g) Turn on the RNASeq tracks for different tissues. Is there evidence
that PDHA2 is expressed in one tissue more than others? How does
this relate to the DNA methylation data you saw? What does this
suggest about the way this gene is regulated?

(h) How well conserved is the region of the PDHA2 gene amongst the
37 eutherian mammals? Are the CpG islands conserved?

(i) How many GO terms are associated with PDHA2? Can you export
the sequences of all human genes that are also associated with the
first of these terms?

(j) Can you fetch the gene sequence for PDHA2 in FASTA using the
Ensembl REST API?
76
Answers Exploring the Ensembl genome browser

Ensembl species

Exercise 1 Panda

(a) Select Panda from the drop down species list, or click on View full
list of all Ensembl species, then choose Panda from the list.
The assembly is ailMel1 or GCA 000004335.1

(b) Click on More information and statistics. Statistics are shown in
the tables on the left.
The length of the genome is 2,245,312,831 bp.
There are 19,343 coding genes.

Exercise 2 Zebrafish

(a) Click on Zebrafish on the front page of Ensembl to go to the
species homepage. News is in the top right.
Whats new in Zebrafish release 73:
Splicing events
Structural variations
Zebrafish knockout data

(b) Assembly Zv8 is available in the archived release 59.

Exercise 3 Mosquitos

(a) Go to metazoa.ensembl.org. Open the drop down list or click on
View full list of all Ensembl Metazoa species.
There are two Anopheles species: Anopheles gambiae and
Anopheles darlingi.

(b) Click on Anopheles gambiae, then on More information and
statistics.
The genome was published in 2002 by Holt et al and
updated in 2007 by Sharakhova et al.

77
Exercise 4 Bacteria

Go to bacteria.ensembl.org and start to type the name Belliella baltica
into the search species box. It will autocomplete, allowing you to
select Belliella baltica DSM 15883, (TaxID 866536) from the drop-
down list. Click on More information and statistics.
Belliella baltica has 3,680 coding genes and 53 non-coding.

Region in detail

Exercise 5 Exploring a genomic region in human

(a) Go to the Ensembl homepage (http://www.ensembl.org/).

Select Search: Human and type 13:32448000-33198000 in the text
box (or alternatively leave the Search drop-down list like it is and
type human 13:32448000-33198000 in the text box).
Click Go.
This genomic region is located on cytogenetic band q13.1. It is
made up of seven contigs, indicated by the alternating light and
dark blue coloured bars in the Contigs track.

(b) Draw with your mouse a box encompassing the BRCA2
transcripts. Click on Jump to region in the pop-up menu.

(c) Click Configure this page in the side menu (or on the cog wheel
icon in the top left hand side of the bottom image).

Type tilepath in the Find a track text box.
Select Tilepath.
Click on the (i) button to find out more
The tilepath track shows the BAC clones that the assembly was
based upon.
Save and close the new configuration by clicking on (or anywhere
outside the pop-up window).
There is not just one clone that contains the complete BRCA2
gene. The BAC clone RP11-37E23 contains most of the gene,
but not its very 3 end (contained in RP11-298P3). This was
reflected on the two contigs that make up the entire BRCA2
gene (the Contigs track is on by default).

78
(d) Click Share this page in the side menu.

Select the link and copy.
Compose an email to yourself, paste the link in and send the message.
Open the email and click on your link. You should be able to view the
page with the new configuration and data tracks you had added to in
the Location tab.

(e) Click Export data in the side menu. Leave the default parameters
as they are.
Click Next>.
Click on Text.

Note that the sequence has a header that provides information about
the genome assembly (GRCh37), the chromosome, the start and end
coordinates and the strand. For example:

>13 dna:chromosome
chromosome:GRCh37:13:32883613:32978196:1

(f) Click Configure this page in the side menu.
Click Reset configuration.
Click .

Exercise 6 Exploring patches and haplotypes in human


Select Search: Human and type 6:112294691-112624977 in the text
box (or alternatively leave the Search drop-down list like it is and
type human 6:112294691-112624977 in the text box).
Click Go.

You will see a green highlighted region in the middle of this region.
Click on the thin dark green bar in any of the three views to see the
label HG1304_PATCH. To learn about patches, open a new tab in
your internet browser, go to the Ensembl homepage and put patch
into the search box.

79
Choose Help & Docs from the left hand side. There are glossary terms
(Patch and Alternative sequence) and an FAQ (What haplotypes and
assembly patches can I see for human?) that explain patches.

(b) Patches are marked in green in the chromosome view at the top.
Click on the leftmost patch to confirm that it is definitely HG27_patch.
Drag a box around it (less than 1Mb) then click on Jump to region.

Scroll down to the Region in detail view and click on the thin dark
green bar at the top of the patch. A drop-down containing the
coordinates of the patch will appear.
6: 26585843-26859228

(c) Another option in this drop-down is Compare with reference.
Click on this.

Scroll down the page to see the comparison between the patch and
reference. Aligned sequences are highlighted in pink and linked
together in green.
The sequences in this region have been rearranged.

(d) Click the back button in your browser to return to the Region in
detail page. Using your mouse, click and drag within the 1Mb view to
move right. The red highlighted regions are all labelled HSCHR6_MHC
etc, which is the MHC haplotypic region. Search help again to
understand what haplotypes are, in the same way as you did for
patches.

80
Answers Genes and Transcripts

Exercise 7 - Exploring the human MYH9 gene

(a) Go to the Ensembl homepage (http://www.ensembl.org).

Select Search: Human and type MYH9. Click Go, then Human on the
results page. Click on Gene.

Click on either the Ensembl ID ENSG00000100345 or the HGNC
official gene name MYH9.

Chromosome 22 on the reverse strand.
Ensembl has 11 transcripts annotated for this gene.
Three transcripts are protein coding.
The longest transcript is MYH9-001 and it codes for a protein

of 1,960 amino acids
MYH9-001 has a CCDS record. CCDS is the consensus coding

sequence set, which coding sequences (CDS) are agreed upon
by Ensembl, Havana, NCBI and UCSC.
The CCDS set is a collection of reviewed, agreed-upon coding

sequences (for human and mouse). These sequences are of high
confidence, and unlikely to change in the future.
(b) These are some of the phenotypes associated to MYH9

according to MIM: autosomal dominant deafness, Epstein
syndrome, and Fechtner syndrome. Click on the records for
more information.

(c) Click on ENST00000216181

It has 41 exons. This is shown in the Transcript summary or in
the left hand side menu Exons.
Click on the Exons link in this side menu. Exon 1 is completely

untranslated, and exons 2 and 41 are partially untranslated
81
(UTR sequence is shown in purple). You can also see this in the
cDNA view if you click on the cDNA link in the left side menu.
MYH9-HUMAN from UniProt/Swiss-Prot matches the

translation of the Ensembl transcript. Click on MYH9-HUMAN
to go to UniProtKB, or click align for the alignment.
The Gene Ontology project (http://www.geneontology.org/)

maps terms to a protein in three classes: biological process,
cellular component, and molecular function. Meiotic spindle
organisation, cell morphogenesis, and cytokinesis are some of
the roles associated with MYH9-001.

(d) Click on Oligo probes in the side menu.
Probesets from Affymetrix, Agilent, Codelink, Illumina, and
Phalanx match to this transcript sequence. Expression analysis
with any of these probesets would reveal information about the
transcript. Hint: this information can sometimes be found in the
ArrayExpress Atlas: www.ebi.ac.uk/arrayexpress/

Exercise 8 Finding a gene associated with a phenotype

(a) Start at the Ensembl homepage (http://www.ensembl.org).

Type phenylketonuria into the search box then click Go. Choose Gene
from the left hand menu.
The gene associated with this disorder is PAH, phenylalanine
hydroxylase, ENSG00000171759.

(b) Click on the gene symbol to go to the Gene tab. Click on
Expression in the left hand menu.
The gene is expressed in all tissues listed. This is unsurprising
for a metabolic gene.

Hover over the column titles to view definitions.
Intron spanning reads are RNASeq reads that cover exon
junctions.
RNASeq alignments are RNASeq reads that align to the genome.

82
(c) If the transcript table is hidden, click on Show transcript table to
see it.
There are four protein coding transcripts.

Click on Transcript comparison in the left hand menu. Click on Select
transcripts. Either select all the transcripts labelled protein coding
one-by-one, or click on the drop down and select Protein coding.
Close the menu.

(d) Click on External references.
The MIM disease ID is 261600.

Exercise 9 Exploring a plant gene (Vitis vinifera, grape)

(a) Go to http://plants.ensembl.org/index.html

Select Vitis vinifera from the drop down menu All genomes select a
species or click on View full list of all Ensembl Plants species and then
choose V. vinifera.

Type MADS4 and click on the gene name link MADS4
[VIT_01s0010g03900 ].
Click on GO: biological process in the side menu.
There are nine terms listed including GO:0006351,
transcription, DNA-dependent, and GO:0006355, regulation of
transcription, DNA-dependent.

(b) Click on the transcript tab named Vv01s0010g03900.t01 (or on
the Transcript tab). Click on Exons in the left hand menu.
There are eight exons, of which exon 8 is longest with 303 bp,
of which 13 are coding.

c) Click on either Protein Summary or Domains & features in the left
hand menu to see graphically or as a table respectively.
A TF_MADSbox is identified by six domain prediction methods.
A TF_Kbox domain is identified by two. Two coiled-coils are
identified by one.

83
Answers BioMart

Exercise 10 Finding genes by protein domain

As with all BioMart queries you must select the dataset, set your
filters (input) and define your attributes (desired output). For this
exercise:
Dataset: Ensembl genes in mouse
Filters: Transmembrane proteins on chromosome 9
Attributes: Ensembl gene and transcript IDs and Associated gene
names
Go to the Ensembl homepage (http://www.ensembl.org) and click on
BioMart at the top of the page.
Select Ensembl genes as your database and Mus musculus genes as
the dataset.
Click on Filters on the left of the screen and expand REGION. Change
the chromosome to 9.
Now expand PROTEIN DOMAINS, also under filters, and select
Transmembrane domains and then Only. Clicking on Count should
reveal that you have filtered the dataset down to 420 genes.
Click on Attributes and expand GENE. Select Associated gene name.
Now click on Results. The first 10 results are displayed by default;
display all results by selecting ALL from the drop down menu.

The output will display the Ensembl gene ID, Ensembl Transcript ID
and Associated gene names of all proteins with a transmembrane
domain on mouse chromosome 9. If you prefer, you can also export
as an Excel sheet by using the Export all results to XLS option.

Exercise 11 Convert IDs

Click New.
Choose the ENSEMBL Genes 73 database.
Choose the Homo sapiens genes (GRCh37) dataset.

Click on Filters in the left panel.
Expand the GENE section by clicking on the + box.
Select ID list limit - RefSeq protein ID(s) and enter the list of IDs in
the text box (either comma separated or as a list).
HINT: You may have to scroll down the menu to see these.
84

Count shows 11 genes (remember one gene may have multiple splice
variants coding for different proteins, that is the reason why these 29
proteins do not correspond to 29 genes).

Click on Attributes in the left panel.
Select the Features attributes page.
Expand the External section by clicking on the + box.
Select HGNC symbol and RefSeq Protein ID from the External
References section.

Click the Results button on the toolbar.
Select View All rows as HTML or export all results to a file. Tick the
box Unique results only.

Exercise 12 Export homologues

Click New.
Choose the Ciona savignyi genes (CSAV2.0) dataset.

Enter the gene list in the ID List Limit box.

Select the Homologs attributes page.
Expand the Orthologs section by clicking on the + box.
Select Human Ensembl Gene ID.
Click Results (remember to tick the unique results only box).

Exercise 13 Export structural variants

(a) Choose Ensembl Variation 74 and Homo sapiens Structural
Variation.
Filters: Region: Chromosome 1, Base pair start: 130408, Base pair
end: 210597
Count shows 6 out of 3,577,025 structural variants.
Attributes: Structural Variation (SV) Information: DGVa Study
Accession and Source Name
85
Structural Variation (SV) Location: Chromosome name, Sequence
region start (bp) and Sequence region end (bp).

(b) Choose Ensembl Variation 74 and Homo sapiens Short Variation
(SNPs and indels).
Filters: Filter by Variation ID enter: rs1801500, rs1801368
Attributes: Variation Name, Variant Alleles, Phenotype description,
and Associated gene.
You can view this same information in the Ensembl browser.
Click on one of the variation IDs (names) in the result table. The
variation tab should open in the Ensembl browser. Click
Phenotype Data.

Exercise 14 Find genes associated with array probes

(a) Click New.
Choose the Homo sapiens genes (GRCh37) dataset.

Select ID list limit - Affy hg u133 plus 2 probeset ID(s) and enter the
list of probeset IDs in the text box (either comma separated or as a
list).

Count shows 25 genes match this list of probesets.

Select the Features attributes page.
In addition to the default selected attributes, select Description.
Expand the External section by clicking on the + box.
Select HGNC symbol from the External References section and AFFY
HG U133-PLUS-2 from the Microarray Attributes section.

Select View All rows as HTML or export all results to a file. Tick the
box Unique results only.
Your results should show that the 25 probes map to 25
Ensembl genes.

86
(b) Dont change Dataset and Filters- simply click on Attributes.

Select the Sequences attributes page.
Expand the SEQUENCES section by clicking on the + box.
Select Flank (Transcript) and enter 2000 in the Upstream flank text
box.
Expand the Header information section by clicking on the + box.
Select, in addition to the default selected attributes, Description and
Associated Gene Name.

Note: Flank (Transcript) will give the flanks for all transcripts of a
gene with multiple transcripts. Flank (Gene) will give the flanks for
one possible transcript in a gene (the most 5 coordinates for
upstream flanking).


(c) You can leave the Dataset and Filters the same, and go directly to
the Attributes section:

Select the Homologs attributes page.
Select Associated Gene Name.
Deselect Ensembl Transcript ID.
Expand the ORTHOLOGS section by clicking on the + box.
Select Mouse Ensembl Gene ID, Mouse Chromosome Name, Mouse
Chr Start (bp) and Mouse Chr End (bp).

Check the box Unique results only. Select View All rows as HTML or
export all results to a file.
Your results should show that for most of the human genes at
least one mouse orthologue has been identified.

87
Answers Variation

Finding variants in Ensembl

Exercise 15 Human population genetics and phenotype data

(a) Please note there is more than one way to get this answer. Either
go to the Variation Table for the human TAGAP gene, and Show
variants in the 5UTR, or search Ensembl for rs1738074 directly.

Once youre in the Variation tab, click on the Genes and regulation
link or icon. This SNP is found in three transcripts
(ENST00000326965, ENST00000338313, and ENST00000367066).

(b) Click on Population genetics at the left of the variation tab. (Or,
click on Explore this variation at the left and click the Population
genetics icon.)
In Yoruba (CSHL-HAPMAP:HapMap-YRI population), the least
frequent genotype is CC at the frequency of 9.7%. This is also
the least frequent genotype in in other populations (to find out
what the three letter population are, have a look at our FAQ
(http://www.ensembl.org/Help/Faq?id=328)

(c) Click on phylogenetic context.
The ancestral allele is T and its inferred from the alignment in
primates.

Select the 37 eutherian mammals EPO LOW COVERAGE alignment
and click on Go.
A region containing the SNP (highlighted in red and placed in
the centre) and its flanking sequence are displayed. The T allele
is conserved in all but three of the 37 eutherian mammals
displayed. Note that one species has no alignment in that region
and many other species have no variation database.

(d) Click Phenotype Data at the left of the Variation page.
This variation is associated with diabetes, multiple sclerosis
and coeliac. There are known risk alleles for both multiple
sclerosis and coeliac and the corresponding P values are
provided. The allele A is associated with coeliac disease. Note
that the alleles reported by Ensembl are T/C. Ensembl reports
88
alleles on the forward strand. This suggests that A was
reported on the reverse strand in the PubMed article.

You can view External Data sources that mirror data from
SNPedia and LOVD. We share information about the effects of
variations in DNA, citing peer-reviewed scientific publications.
Click on SNPedia and LOVD in the left hand menu to explore
further. No LOVD data was found for this variant so far.

Exercise 16 Exploring a SNP in human


Type rs1801133 in the Search box, then click Go.
Click on rs1801133.

(b) Click on Genes and Regulation in the side menu (or the Genes and
Regulation icon).
No, rs1801133 is Missense variant in four MTHFR transcripts.
It's a downstream gene variant of ENST00000418034.

(c) In Ensembl, the alleles of rs1801133 are given as G/A because
these are the alleles in the forward strand of the genome. In the
literature and in dbSNP, the alleles are given as C/T because the
MTHFR gene is located on the reverse strand. The alleles in the
actual gene and transcript sequences are C/T.

(d) Click on Population genetics in the side menu.
In all populations but two (from the 1000 genomes and
HapMap projects), the allele G is the major one. The two
exceptions are: CLM (Colombian in Medelin; 1000 Genomes),
HCB (Han Chinese in Beijing, China; HapMap).

(e) Click on Phenotype Data in the left hand side menu.
The specific study where the association was originally
described is given in the Phenotype Data table. Click on
pubmed/20031578 for more details.

The association between rs1801133 and homocysteine levels is
described in the paper Novel associations of CPS1, MUT, NOX4
and DPEP1 with plasma homocysteine in a healthy population:
89
a genome-wide evaluation of 13,974 participants in the
Womens Genome Health Study (Pare et al, Cir Cardiovasc
Genet. 2009 Apr;2(2):142-50).

(f) Click on Phylogenetic Context in the side menu.

Select Alignment: 6 primates EPO and click Go.
Gorilla, orangutan, chimp, macaque and marmoset all have a G
in this position. Please note that there is no variation database
for gorilla and marmoset though.

(g) Go to http://neandertal.ensemblgenomes.org/ and type
rs1801133 in the Search Neandertal text box.
Click Go.
Click on rs1801133 on the results page.
Click on Jump to region in detail.
Click on Configure this page in the side menu.
Click on Variation features.
Select All variations Normal.
SAVE and close.
Draw a box of about 50 bp around rs1801133 (shown in yellow in the
centre of the display).
Click on Jump to region on the pop-up menu.
The Sequences track shows that there are four reads for
Neanderthal at the position of rs1801133, all with a G, so based
on these (very limited) data there is no evidence that both
alleles were already present in Neanderthal.

Exercise 17 Structural variation in human

Select Search: Human and type ccl3l1 in the search box.
Click Go.
Click on CCL3L1 (Human Gene) at the top.

(b) Click on Structural Variation in the side menu.
Yes, CNVs have been annotated for this gene by multiple
studies, as indicated by the many bars in the larger and smaller
structural variants tracks in the display. Details are given in the
table below the display.

90
Note: Can you do this with BioMart?

Exercise 18 Exploring a SNP in mouse

(a) Go to www.ensembl.org, type rs29522348 in the search box. Click
on rs29522348 (Mouse Variation).
SNP rs29522348 is located on 17:73924993. In Ensembl, its
alleles are provided as in the forward strand.

(b) Click on HGVS names to reveal information about HGVS
nomenclature.
This SNP has got three HGVS names, one at the genomic DNA
level (17:g.73924993C>T), one at the transcript level
(c.721G>A) and one at the protein level (p.Val241Ile).

(c) In Ensembl, the allele that is present in the reference genome
assembly is always put first (C is the allele for the reference
mouse genome, strain C57BL/6J).

(d) Click on Individual genotypes is the left hand side menu. In the
summary of genotypes by population, click on Show for
PERLEGEN:MM_PANEL2, or search for the two strain names.
There are indeed differences between the genotypes reported
in those two different strains. The genotype reported in
NOD/LTJ is TT whereas in BALB/cByJ the genotype is CC.

VEP

Exercise 19 VEP (Variant Effect Predictor tool)

(a) Go to www.ensembl.org and click on the link tools at the top of
the page. Currently there are 5 tools listed in that page. Click on
Variant Effect Predictor and enter the three variants as below:
7 117171039 117171039 G/A
7 117171092 117171092 T/C
7 117171122 117171122 T/C

Note: Variation data input can be done in a variety of formats. See
more details here
91
http://www.ensembl.org/info/docs/variation/vep/vep_formats.htm
l

Under the non-synonymous SNP predictions option, select prediction
only for SIFT and PolyPhen, then click Next.
The output format is either in HTML or text. You will get a table
with the consequence terms from the Sequence Ontology
project (http://www.sequenceontology.org/) (i.e. synonymous,
missense, downstream, intronic, 5 UTR, 3 UTR, etc) provided
by VEP for the listed SNPs. You can also upload the VEP results
as a track and view them on Location pages in Ensembl. SIFT
and PolyPhen are available for missense SNPs only. For two of
the entered positions, the variations have been predicted to be
probably damaging/deleterious (coordinate 117171092) and
benign/tolerated (coordinate 117171122). All the three
variations have been already described and are known as in
rs1800078, rs1800077 and rs35516286 in dbSNP and other
sources (databases, literature, etc).

(b) In order to see your uploaded SNPs as a track in Region in detail,
you will need to choose a name for this upload (e.g. VEP) when
entering the data into the VEP tool. So you may need to enter the data
again. Once you have done that and given a name to the upload, click
on any link under the location column (in the VEP results table) to
see your newly added VEP track with the three variations in the
Location tab (or Region in detail view) in Ensembl.

92
Answers Comparative Genomics

Gene trees and homologues

Exercise 20 Orthologues, paralogues and gene trees for the
human BRAF gene.

(a) Go to www.ensembl.org, choose human and search for BRAF. Click
through to the Gene tab view.

On the gene tab, click on Orthologues at the left side of the page to see
all the 63 orthologous genes.
There are orthologues in 8 primates.

The percentage of identical amino acids in the Tarsier protein
(the orthologue) compared with the gene of interest. i.e. human
BRAF (the target species/gene) is 69%. This is known as the
Target %ID. The identity of the gene of interest (human BRAF)
when compared with the orthologue (Tarsier BRAF, the query
species/gene) is 62% (the query %ID).

Note the difference in the values of the Target and Query % ID
reflects the different protein lengths for the human and tarsier
BRAF genes.

(b) There is more than one way to get to the answer.
Option 1: Go to the orthologues page and click on the marmoset
orthologue to open the gene tab.
Click Genomic alignments at the left. Then select Alignment: Human
(Homo sapiens) lastz and click Go.
The red sequence is present in exons, so there is a gene in both
species in this region. You can find where the start and stop codons
are located if you configure this page and select START/STOP codons.

Option 2: Go to location tab of the marmoset BRAF gene and then
click on Region Comparison view at the left. Click on Select species or
regions at the left and click on the + to select Human (Homo sapiens)
lastz then save and close. You should see an alignment between the
human BRAF gene region and the BRAF gene region for the
marmoset.

93
(Note: To see a blue line connecting homologous genes in the
Region Comparison view page, click on configure this page and
under Comparative features select join genes. Zoom out on the
location view to see blue lines connecting all the homologous
genes between marmoset and human genes in that region).

Whole genome alignments

Exercise 21 Zebrafish orthologues

(a) Start in the Location tab (region in detail) for dbh
(ENSDARG00000069446). Click on Alignments (Image) at the left,
and select the 5 teleost fish EPO alignment in the pull-down menu in
the view. The zebrafish, stickleback, medaka, fugu, and tetradon are
shown in this region. All the species show a gene in the aligned
region. This can also be seen in the Alignments (text) page (the exons
are highlighted in red).

(b) You can export the alignments from either Alignments (images)
or Alignments (text) menus in the Location tab. Click on the blue
Export data button at the left, and choose Clustal from the list.

(c) Click on Region in detail in the left hand menu. Turn on the
multiple alignment and, constrained elements and conservation score
for 5 teleost fish EPO tracks, all under the Comparative genomics
menu by configuring the page.

The 5 teleost fish EPO track just shows that the whole region for the
dbh gene can be aligned among those five species of fish. The
Constrained elements and Conservation score tracks show the
conserved sequence is located where in the alignment.
Higher conservation regions match up with exonic regions
(exons tend to be highly conserved) of the gene. Note that there
are intronic regions that seem to be fairly conserved across the
species available.

Click on the Track name and the (information button) to read
more about constrained elements (or any other data track).

94
Exercise 22 Synteny

(a) Change the species to dog next to the image.
Yes, there are multiple syntenic regions in dog to human
chromosome 3, which is in the centre of this view. Dog
chromosomes 6, 20, 23, 31, 33, and 34 have syntenic regions to
human chromosome 3.

(b) Scroll down to the bottom of the page.
There is a homologue in dog of human RHO. Click Centre on
gene RHO to compare the genes between human and dog in this
syntenic block.

Exercise 23 Whole genome alignments

Select Search: Human and type brca2 in the search box.
Click Go.
Click on 13:32889611-32973805:1 below BRCA2 (Human Gene).

You may want to turn off all tracks that you added to the display in
the previous exercises as follows:
Click Configure this page in the side menu.
SAVE and close.

(b) Click Configure this page in the side menu
Click on BLASTZ/LASTz alignments under the Comparative genomics
menu. Select Chicken (Gallus gallus) - BLASTZ_NET Normal,
Chimpanzee (Pan troglodytes) BLASTZ_NET Normal, Mouse (Mus
musculus) BLASTZ_NET - Normal and Platypus (Ornithorhynchus
anatinus) - BLASTZ_NET - Normal.
Click on Translated blat alignments. Select Anole Lizard (Anolis
carolinensis) - TRANSLATED_BLAT_NET - Normal and Zebrafish
(Danio rerio) - TRANSLATED_BLAT_NET Normal.
SAVE and close.
Yes, the degree of conservation does reflect the evolutionary
relationship between human and the other species; the highest
degree of conservation is found in chimp, followed by mouse,
platypus, chicken, lizard and zebrafish, respectively. Especially
the exonic sequences of BRCA2 seem to be highly conserved
95
between the various species, which is what is to be expected
because these are supposed to be under higher selection
pressure than intronic and intergenic sequences.

(c) Click Configure this page in the side menu.
Click on Conservation regions under the Comparative genomics
menu.
Select Conservation score for 37 eutherian mammals
EPO_LOW_COVERAGE, Conservation score for 21 amniota
vertebrates Pecan and Constrained elements for 21 amniota
vertebrates Pecan.
SAVE and close.
Both the Conservation score and Constrained elements tracks
largely correspond with the data seen in the pairwise
alignment tracks; all exons of the BRCA2 gene show a high
degree of conservation (Note the UTRs which are not
conserved).

(d) Click on a constrained element (brown block).
Click on View alignments (text) in the pop-up menu.
Select Conservation regions: All conserved regions.
SAVE and close.

The conserved regions will be shown in light blue.

(e) Click on the Gene: BRCA2 tab.
Click on Genomic alignments under Comparative Genomics in the
side menu.
Select Alignment: 6 primates EPO.
Click Go.
Select Conservation regions: All conserved regions.
SAVE and close.

The conserved regions will be shown in light blue.
96
Answers Regulation

Exercise 24 Gene regulation: Human STX7

(a) Search for human gene STX7 from the home page. Click on
Location in the search results.
Regulatory features from the Ensembl regulatory build are
based on indicators of open chromatin such as CTCF binding
sites, DNase I hypersensitive sites, and Transcription Factor
binding sites. The Regulatory features are turned on by default
in the Region in detail view.

There are many regulatory features mapping to the STX7
transcripts, including the 5 end.

Click on the Reg. Feats track name to jump to an article
explaining the underlying data. Click and drag the Reg. Feats
track next to the Genes (Merged Ensembl/Havana) track to
better compare where the Regulatory features (grey boxes) are
in the gene.

(b) See the legend below the Region in detail view to find the
predicted enhancer segments are coloured in yellow. Two
appear in the HUVEC cell type only (out of the three cells
chosen).

(c) Configure this page and click on Open chromatin &TFBS. Turn on
both peaks and signal for DNase 1 and FAIRE in HeLa-S3 cells (the
boxes in this configure this page window will turn blue. For more
information on how to select and view the supporting data, click on
Show tutorial in the pop up window). Close the menu.
There are two DNase 1 hypersensitive sites in the 5 exon of
STX7. Click on the coloured block to find out that the DNase1
enriched sites in HeLa-S3 cells come from the ENCODE project.
There is no FAIRE site known in this region.

(d) Configure this page and click on Histones & polymerases. Change
the Filter by menu from All classes to Histone. Select the all the
histone modifications available for HeLa cells (some of them might be
on by default). Save and close the menu.
97
H3K4me3, H3K9ac and H3K27ac sites have been found in the 5
region of STX7 in HeLa-S3 cells.

(e) Click on configure this page and choose the DNA Methylation
menu. Scroll down to Enable/disable all External data then turn on
the first track in the list (MeDIP-chip B-cells). Save and close the
menu.
The CpG sites at the 5 end of STX7 are not highly methylated
(note the yellow/green bars). Yellow, green, and blue bars
represent unmethylated, intermediately methylated, and
methylated regions, respectively. For more information on
human DNA methylation DAS tracks, see:
www.ensembl.org/info/docs/funcgen/index.html

(f) Click Share this page in the side menu.
Go into your email account and compose an email to yourself.
Paste the link in, then send.
Open the email and click on your link.

Exercise 25 Regulatory features in human

Select Search: Human and type 6:32540000-32620000 in the search
box.
Click Go.


SAVE and close.

(b) You can click on all the regulatory features shown in the Reg.
Feats track that are located in the intergenic region of those genes.
The resulting pop-up window for each of those will show the core
attributes underlying the regulatory features.
Yes, there is one regulatory feature around coordinates
32589947-32591273 that has CTCF binding data as part of its
core evidence. Its ID is ENSR00000488025.
98

(c) Click Configure this page in the side menu.
Click on Regulation Open chromatin & TFBS.
Select MultiCell - Track style: Peaks.
SAVE and close.
CTCF binding has been detected at this position in eleven of the
cell/tissue types analysed. (CD4, GM06990, GM12878, H1ESC,
HMEC, HSMM, HUVEC, HeLa-S3, HepG2, NH-A, NHEK)

(d) Click Configure this page in the side menu.
Click on Regulation Histones & polymerases.
According to the Histones & Polymerases configuration matrix
the most information on histone acetylation is available for CD4
cells.

Hover over CD4 in the Histones & Polymerases configuration matrix.
Select Select features for CD4 - All.
SAVE and close.
Yes, the region that shows CTCF binding is also a region of high
acetylation of histone 2A, 2B, 3 and 4 in CD4 cells.

99
Answers Advanced exercise

Methylation data in human

Select Search: Human and type PDHA2 in the for text box.
Click Go.
Click on 4:96761239-96762625:1.
Zoom out one step, so that the 5kb region around the PDHA2 gene is
shown.


SAVE and close.

(b) Click Configure this page in the side menu.
Type cpg in the Find a track box.
Select CpG islands.
SAVE and close.
No CpG islands are shown. As for the inclusion of CpG islands
into the Ensembl database for human a minimum length of 400
bp is required, the reason for this could be that the CpG islands
in the PDHA2 gene are shorter than 400 bp. However, there is a
%GC track, which shows that the region that comprises the 5
part of the PDHA2 gene and the region directly upstream of the
gene has a high %GC (the red line in the %GC track indicates
50% GC). It is difficult / impossible to distinguish individual
CpG islands in this track, though.

(c) Click Export data in the side menu.
Click Next>.
Click on Text.
Select and copy the sequence.
Go to http://www.ebi.ac.uk/Tools/emboss/cpgplot/index.html.
Paste the sequence into the text box.
Click Run.
CpGPlot does confirm the existence of two CpG islands in the
PDHA2 gene region of lengths 200 and 263 bp, respectively. So,
100
it is indeed because of their length being less than 400 bp that
these CpG islands are not present in the Ensembl database.

(d) Click Add your data in the side menu (Note that if you have
previously uploaded data to Ensembl, this box will say Manage your
data instead).
Click on Upload Data.
Type CpG islands in the Name for this upload (optional) box.
Select Data format: BED.
Copy the following into the Paste file box:

chr4 96761176 96761375 cpg_island_1
chr4 96761500 96761762 cpg_island_2

Click Upload.
Click on Go to nearest region with data: 4:96701276-96811276.
The two CpG islands should now be shown on the Region in
detail page. They should coincide with the regions of high %GC.

Zoom in on the two CpG islands.

To display the names of the CpG islands:

Hover over the CpG islands track name.
Hover over the icon of the cog-wheel.
Select Labels.

(e) Drag your CpG islands track so that it is next to the %GC track.
Click Share this page in the side menu.
Paste into your internet browser to view.

(f) Click Configure this page in the side menu.
Click on Regulation DNA Methylation.
Select all MeDIP tracks in Normal mode.
SAVE and close.
Yellow, green and blue represent unmethylated, intermediately
methylated and methylated regions, respectively (see the
Methylation Legend at the bottom of the page). It can be seen
that the region around the 5 part of the PDHA2 gene is
methylated in all assayed tissues and cell lines, except in sperm.
101
The MeDIP-seq track for sperm shows that the unmethylated
regions coincide with the CpG islands found by CpGPlot.

(g) Click on Configure this page, then select RNASeq models. Turn on
the BAM files for all the tissues in Coverage only.
You will see histograms of RNASeq coverage for each of the
tissues. All of these histograms appear to be the same height,
but the numbers at the left indicate the peak. The largest
number is for the merged read, 10,048. For the tissue-specific
read, Testes have a peak of 1850, higher than all the other
tissues. There are also more wider peaks in the Testes track.
The unmethyated CpG islands in sperm suggest that this gene is
negatively regulated by CpG island methylation.

(h) Click on Configure this page, then select Comparative genomics.
Turn on the tracks for the Constrained elements for 37 eutherian
mammals and Conservation score for 37 eutherian mammals.
The region of the gene itself has high GERP scores, indicated by
constrained elements over most of the gene. There is no
apparent difference in the conservation score between the CpG
islands and their flanking regions.

(i) Click on the Transcript Tab, Transcript: PDAH2-001 and select
Ontology table.
There are ten terms in the table, the first being GO:0006090,
pyruvate metabolic process.

To export the list use BioMart.
Click on BioMart in the top bar.
Choose Ensembl Genes 73 and Homo sapiens genes (GRCh37).

Click on Filters.
Open the menu for GENE ONTOLOGY.
Select GO Term Accession and put GO:0006090 into the box.

Click on Attributes.
Choose Sequences.
Expand SEQUENCES and select Unspliced (Gene).
Expand Header information and deselect Ensembl Transcript ID.

Click Results.
You can export these results if you wish.
102

(j) Go to the REST API documentation page at
http://beta.rest.ensembl.org/documentation.
Click on GET sequence/id/:id to get the documentation for this
command.

You will need the stable ID of PDHA2, go to the browser page to find
that it is ENSG00000163114.

Use the documentation to construct a URL in the correct form, ie:
http://beta.rest.ensembl.org/sequence/id/:id?format=fasta

Add the ID to the URL to create:
http://beta.rest.ensembl.org/sequence/id/ENSG00000163114?form
at=fasta

This URL will give you the sequence.

103
Quick Guide to Databases and Projects

Here is a list of databases and projects you will come across in these
exercises. Google any of these to learn more. Projects include many
species, unless otherwise noted.

Other help:
The Ensembl Glossary: http://www.ensembl.org/Help/Glossary
Ensembl FAQs:
http://www.ensembl.org/Help/Faq
SEQUENCES
EMBL-Bank, NCBI GenBank, DDBJ Contain nucleic acid sequences
deposited by submitters such as wet-lab biologists and gene
sequencing projects. These three databases are synchronised with
each other every day, so the same sequences should be found in each.

CCDS coding sequences that are agreed upon by Ensembl, VEGA-
Havana, UCSC, and NCBI. (human and mouse).

NCBI Entrez Gene NCBIs gene collection
`
NCBI RefSeq NCBIs collection of reference sequences, includes
genomic DNA, transcripts and proteins. NM stands for Known mRNA
(eg NM_005476) and NP (eg NP_005467) are Known proteins.

UniProtKB the Protein knowledgebase, a comprehensive set of
protein sequences. Divided into two parts: Swiss-Prot and TrEMBL

UniProt Swiss-Prot the manually annotated, reviewed protein
sequences in the UniProtKB. High quality.

UniProt TrEMBL the automatically annotated, unreviewed set of
proteins (EMBL-Bank translated). Varying quality.

VEGA Vertebrate Genome Annotation, a selection of manually-
curated genes, transcripts, and proteins. (human, mouse, zebrafish,
gorilla, wallaby, pig, and dog).

VEGA-HAVANA The main contributor to the VEGA project, located
at the Wellcome Trust Sanger Institute, Hinxton, UK.

104
GENE NAMES

HGNC HUGO Gene Nomenclature Committee, a project assigning a
unique and meaningful name and symbol to every human gene.
(Human).

ZFIN The Zebrafish Model Organism Database. Gene names are only
one part of this project. (Z-fish).

PROTEIN SIGNATURES
InterPro A collection of domains, motifs, and other protein
signatures. Protein signature records are extensive, and combine
information from individual projects such as UniProt, along with
other databases such as SMART, PFAM and PROSITE (explained
below).

PFAM A collection of protein families

PROSITE A collection of protein domains, families, and functional
sites.

SMART A collection of evolutionarily conserved protein domains.

OTHER PROJECTS
NCBI dbSNP A collection of sequence polymorphisms; mainly
single nucleotide polymorphisms, along with insertion-deletions.

NCBI OMIM Online Mendelian Inheritance in Man a resource
showing phenotypes and diseases related to genes (human).
105

Browsing Genomes With Ensembl PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Browsing Genomes With Ensembl PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Browsing

Introduction to Ensembl ............................................................................ 4

Exploring the Ensembl genome browser .............................................. 7

Genes and transcripts .............................................................................. 20

Comparative genomics ............................................................................ 53

Advanced Access ........................................................................................ 67

Advanced exercise ..................................................................................... 75

Answers Exploring the Ensembl genome browser ..................... 77

Answers Genes and Transcripts ........................................................ 81

Answers BioMart .................................................................................... 84

Answers Variation .................................................................................. 88

Answers Comparative Genomics ....................................................... 93

Answers Regulation ............................................................................... 97

Answers Advanced exercise .............................................................. 100

Quick Guide to Databases and Projects ............................................ 104

Watch video tutorials on YouTube

View the FAQs

Try some exercises

Read some publications

Go to our online course

Follow the Ensembl blog

Sign up to a mailing list

Xos M. Fernndez-Surez and Michael K. Schuster

Giulietta M Spudich and Xos M Fernndez-Surez

Search for Search for

dbSNP variants Normal

1000 Genomes AMR Collapsed

Select all transcripts

How many transcripts (splice variants) are there?

How many of these transcripts are protein coding?

What is the longest transcript, and how long is the protein it

Which transcript has a CCDS record associated with it?

Why is the CCDS important what does it tell us?

Are any of the exons completely or partially untranslated?

Is there an associated sequence in UniProtKB/Swiss-Prot?

What are some functions of MYH9-001 according to the Gene

Put your data

Aligned regions are

Exercise 21 Zebrafish orthologues

Choose track styles

The data in the

Hover over the track

Ensembl has 11 transcripts annotated for this gene.

Three transcripts are protein coding.

The longest transcript is MYH9-001 and it codes for a protein

MYH9-001 has a CCDS record. CCDS is the consensus coding

The CCDS set is a collection of reviewed, agreed-upon coding

(b) These are some of the phenotypes associated to MYH9

Click on the Exons link in this side menu. Exon 1 is completely

MYH9-HUMAN from UniProt/Swiss-Prot matches the

The Gene Ontology project (http://www.geneontology.org/)

Das könnte Ihnen auch gefallen