Beruflich Dokumente
Kultur Dokumente
Genomes
with
Ensembl
www.ensembl.org
www.ensemblgenomes.org
Coursebook
v74
URL
Place
-
date
1
TABLE
OF
CONTENTS
BioMart
.........................................................................................................
32
Demo:
BioMart
..............................................................................................
32
Exercises:
BioMart
.......................................................................................
36
Variation
.......................................................................................................
40
Demo:
Exploring
variants
in
Ensembl
.................................................
40
Exercises:
Exploring
variants
in
Ensembl
.........................................
48
Demo:
The
Variant
Effect
Predictor
(VEP)
........................................
50
Exercise:
The
Variant
Effect
Predictor
(VEP)
..................................
52
Regulation
....................................................................................................
62
Demo:
Raw
ChIPSeq
data
..........................................................................
62
2
Demo:
Regulatory
features
and
segmentation
................................
63
Exercises:
Regulation
.................................................................................
65
3
Introduction
to
Ensembl
Getting
started
with
Ensembl
www.ensembl.org
Ensembl
is
a
joint
project
between
the
EBI
(European
Bioinformatics
Institute)
and
the
Wellcome
Trust
Sanger
Institute
that
annotates
chordate
genomes
(i.e.
vertebrates
and
closely
related
invertebrates
with
a
notochord
such
as
sea
squirt).
Gene
sets
from
model
organisms
such
as
yeast
and
worm
are
also
imported
for
comparative
analysis
by
the
Ensembl
compara
team.
Most
annotation
is
updated
every
two
months,
leading
to
increasing
Ensembl
versions
(such
as
version
74),
however
the
gene
sets
are
determined
less
frequently.
A
sister
browser
at
www.ensemblgenomes.org
is
set
up
to
access
non-
chordates,
namely
bacteria,
plants,
fungi,
metazoa,
and
protists.
Ensembl
provides
genes
and
other
annotation
such
as
regulatory
regions,
conserved
base
pairs
across
species,
and
sequence
variations.
The
Ensembl
gene
set
is
based
on
protein
and
mRNA
evidence
in
UniProtKB
and
NCBI
RefSeq
databases,
along
with
manual
annotation
from
the
VEGA/Havana
group.
All
the
data
are
freely
available
and
can
be
accessed
via
the
web
browser
at
www.ensembl.org.
Perl
programmers
can
directly
access
Ensembl
databases
through
an
Application
Programming
Interfaces
(Perl
APIs).
Gene
sequences
can
be
downloaded
from
the
Ensembl
browser
itself,
or
through
the
use
of
the
BioMart
web
interface,
which
can
extract
information
from
the
Ensembl
databases
without
the
need
for
programming
knowledge
by
the
user.
4
Synopsis
What
can
I
do
with
Ensembl?
View
genes
with
other
annotation
along
the
chromosome.
View
alternative
transcripts
(i.e.
splice
variants)
for
a
given
gene.
Explore
homologues
and
phylogenetic
trees
across
more
than
60
species
for
any
gene.
Compare
whole
genome
alignments
and
conserved
regions
across
species.
View
microarray
sequences
that
match
to
Ensembl
genes.
View
ESTs,
clones,
mRNA
and
proteins
for
any
chromosomal
region.
Examine
single
nucleotide
polymorphisms
(SNPs)
for
a
gene
or
chromosomal
region.
View
SNPs
across
strains
(rat,
mouse),
populations
(human),
or
breeds
(dog).
View
positions
and
sequence
of
mRNAs
and
proteins
that
align
with
Ensembl
genes.
Upload
your
own
data.
Use
BLAST,
or
BLAT
against
any
Ensembl
genome.
Export
sequence
or
create
a
table
of
gene
information
with
BioMart.
Determine
how
your
variants
affect
genes
and
transcripts
using
the
Variant
Effect
Predictor.
Share
Ensembl
views
with
your
colleagues
and
collaborators.
5
Need
more
help?
Check
Ensembl
documentation
Stay
in
touch!
Email
the
team
with
comments
or
questions
at
helpdesk@ensembl.org
Further
reading
Flicek,
P.
et
al
Ensembl
2013
Nucleic
Acids
Res.
Advanced
Access
(Database
Issue)
http://www.ncbi.nlm.nih.gov/pubmed/23203987
Ensembl
Methods
Series
http://www.biomedcentral.com/series/ENSEMBL2010
6
Exploring
the
Ensembl
genome
browser
Demo:
Ensembl
species
The
front
page
of
Ensembl
is
found
at
ensembl.org.
It
contains
lots
of
information
and
links
to
help
you
navigate
Ensembl:
Link
back
to
Blue
bar
remains
visible
homepage
Ensembl
tools
on
every
Ensembl
page
Search
Search
News
Drop-down
list
of
species
How-tos
for
commonly
used
Ensembl
features
Click
on
View
full
list
of
all
Ensembl
species.
Click
on
the
common
name
of
your
species
of
interest
to
go
to
the
species
homepage.
Well
click
on
Human.
7
News
Search
Information
and
statistics
Links
to
example
features
in
Ensembl
To
find
out
more
about
the
genome
assembly
and
genebuild,
click
on
More
information
and
statistics.
Tables
of
Information
statistics
8
Lets
take
a
look
at
the
Ensembl
Genomes
homepage
at
ensemblgenomes.org.
Links
to
the
taxa-
specific
sites
Link
back
to
Ensembl
News
Click
on
the
different
taxa
to
see
their
homepages.
Each
one
is
colour-
coded.
Protists Fungi
9
Metazoa
Plants
Bacteria
You
can
navigate
most
of
the
taxa
in
the
same
way
as
you
would
with
Ensembl,
but
Ensembl
Bacteria
has
a
large
number
of
genomes,
so
needs
slightly
different
methods.
Lets
look
at
it
in
more
detail.
10
Theres
no
full
species
list
for
bacteria
as
it
would
be
hard
to
navigate
with
the
number
of
species.
To
find
a
species,
start
to
type
the
species
name
into
the
species
search
box.
A
drop
down
list
will
appear
with
possible
species.
For
example,
to
find
a
substrain
of
Clostridium
difficile
type
in
Clostridium
d.
The
drop
down
contains
various
strains
of
Clostridium
difficile.
Lets
choose
Clostridium
difficile
630.
This
will
take
us
to
another
species
homepage,
where
we
can
explore
various
features.
11
Exercises:
Ensembl
species
Exercise
1
Panda
(a)
Go
to
the
species
homepage
for
Panda.
What
is
the
name
of
the
genome
assembly
for
Panda?
(b)
Click
on
More
information
and
statistics.
How
long
is
the
Panda
genome
(in
bp)?
How
many
genes
have
been
annotated?
Exercise
2
Zebrafish
(a)
Whats
new
in
release
74
for
zebrafish?
(b)
What
previous
assembly
is
available
for
zebrafish?
Exercise
3
Mosquitos
(a)
Go
to
Ensembl
Metazoa.
How
many
species
of
the
genus
Anopheles
are
there?
(b)
Who
published
the
genome
sequence
for
Anopheles
gambiae?
Exercise
4
Bacteria
Go
to
Ensembl
Bacteria
and
find
the
species
Belliella
baltica.
How
many
coding
and
non-coding
genes
does
it
have?
Demo:
The
Region
in
detail
view
Start
at
the
Ensembl
front
page,
ensembl.org.
You
can
search
for
a
region
by
typing
it
into
a
search
box,
but
you
have
to
specify
the
species.
Type
(or
copy
and
paste)
human
4:123792818-123867893
into
either
search
box.
12
or
Press
Enter
or
click
Go
to
jump
directly
to
the
Region
in
detail
Page.
Click
on
the
button
to
view
page-specific
help.
The
help
pages
provide
links
to
Frequently
Asked
Questions,
a
Glossary,
Video
Tutorials,
and
a
form
to
Contact
HelpDesk.
There
is
a
help
video
on
this
page
at
http://youtu.be/tTKEvgPUq94.
Location
views
Chromosome
Page-specific
help
Scrollable
1Mb
view
Tool
buttons
Region
of
interest
in
detail
13
The
Region
in
detail
page
is
made
up
of
three
images,
lets
look
at
each
one
on
detail.
The
first
image
shows
the
chromosome:
Haplotypes
Our
Chromosome
and
patches
position
bands
You
can
jump
to
a
different
region
by
dragging
out
a
box
in
this
image.
Drag
out
a
box
on
the
chromosome,
a
pop-up
menu
will
appear.
Box
dragged
out
If
you
wanted
to
move
to
the
region,
you
could
click
on
Jump
to
region
(###bp).
For
now,
well
close
the
pop-up
by
clicking
on
the
X
on
the
corner.
The
second
image
shows
a
1Mb
region
around
our
selected
region.
This
view
allows
you
to
scroll
back
and
forth
along
the
chromosome.
Region
of
Scrolling
interest
buttons
Blocks
represent
genes.
Names
are
shown
bottom
left.
14
At
the
moment
the
gene
track
is
set
to
a
fixed
height.
Click
on
the
Automatic
track
height
button
to
expand
the
image
to
include
all
possible
data
in
the
track.
Scroll
along
the
chromosome
by
clicking
and
dragging
within
the
image.
As
you
do
this
youll
see
the
image
below
grey
out
and
two
blue
buttons
appear.
Clicking
on
Update
this
image
would
jump
the
lower
image
to
the
region
central
to
the
scrollable
image.
We
want
to
go
back
to
where
we
started,
so
well
click
on
Reset
scrollable
image.
You
can
also
drag
out
and
jump
to
a
region.
Either
hold
down
shift
and
drag
in
the
image,
or
click
on
the
Drag/Select
button
to
change
the
action
of
your
mouse
click,
and
drag
out
a
box.
Click
on
the
X
to
close
the
pop-up
menu.
The
third
image
is
a
detailed,
configurable
view
of
the
region.
15
Forward-
stranded
transcripts
Blue
bar
is
the
genome
Track
names
Reverse-
stranded
Click
and
transcripts
drag
the
position
of
tracks
Legends
We
can
edit
what
we
see
on
this
page
by
clicking
on
the
blue
Configure
this
page
menu
at
the
left.
This
will
open
a
menu
that
allows
you
to
change
the
image.
You
can
put
some
tracks
on
in
different
styles;
more
details
are
in
this
FAQ:
http://www.ensembl.org/Help/Faq?id=335.
16
Configuration
tabs
Search
for
tracks
Track
categories
Track
information
Track
names
Turn
tracks
on/off
and
change
style
Lets
add
some
tracks
to
this
image.
Add:
Human
proteins
Labels
Now
click
on
the
tick
in
the
top
left
hand
to
save
and
close
the
menu.
Alternatively,
click
anywhere
outside
of
the
menu.
We
can
now
see
the
tracks
in
the
image.
We
can
also
change
the
way
the
tracks
appear
by
hovering
over
the
track
name
then
the
cog
wheel
to
open
a
menu.
We
can
move
tracks
around
by
clicking
and
dragging
on
the
bar
to
the
left
of
the
track
name.
17
Now
that
youve
got
the
view
how
you
want
it,
you
might
like
to
show
something
youve
found
to
a
colleague
or
collaborator.
Click
on
the
Share
this
page
button
to
generate
a
link.
Email
the
link
to
someone
else,
so
that
they
can
see
the
same
view
as
you,
including
all
the
tracks
youve
added.
These
links
contain
the
Ensembl
release
number,
so
if
a
new
release
or
even
assembly
comes
out,
your
link
will
just
take
you
to
the
archive
site
for
the
release
it
was
made
on.
To
return
this
to
the
default
view,
go
to
Configure
this
page
and
select
Reset
configuration
at
the
bottom
of
the
menu.
Exercises:
The
Region
in
Detail
view
Exercise
5
Exploring
a
genomic
region
in
human
(a)
Go
to
the
region
from
32,448,000
to
33,198,000
bp
on
human
chromosome
13.
On
which
cytogenetic
band
is
this
region
located?
How
many
contigs
make
up
this
portion
of
the
assembly
(contigs
are
contiguous
stretches
of
DNA
sequence
that
have
been
assembled
solely
based
on
direct
sequencing
information)?
(b)
Zoom
in
on
the
BRCA2
gene.
(c)
Turn
on
the
Tilepath
track
in
this
view.
What
is
this
track?
Are
there
any
Tilepath
clones
that
contain
the
complete
BRCA2
gene?
(d)
Create
a
Share
link
for
this
display.
Email
it
to
yourself
and
open
the
link.
(e)
Export
the
genomic
sequence
of
the
region
you
are
looking
at
in
FASTA
format.
(f)
Turn
off
all
tracks
you
added
to
the
Region
in
detail
page.
18
Exercise
6
Exploring
patches
and
haplotypes
in
human
(a)
Go
to
the
region
6:112294691-112624977
in
human.
What
is
the
green
highlighted
region?
(Tip:
if
you
see
a
word
or
phrase
you
dont
know
in
Ensembl,
search
for
it
to
see
help
pages.)
(b)
Can
you
see
the
patches
in
the
chromosome
view?
Drag
out
a
box
to
jump
to
a
region
containing
the
leftmost
patch
on
this
chromosome,
named
HG27_patch
(note:
you
must
drag
out
a
region
smaller
than
1Mb).
What
are
the
coordinates
of
the
patch?
(c)
Can
you
compare
this
patch
with
the
reference?
What
has
changed
between
this
patch
and
the
sequence
it
replaced?
(d)
Go
back
to
the
Region
in
detail
and
scroll
to
the
right
in
the
1Mb
view
until
you
reach
a
red
highlighted
region.
What
is
this?
19
Genes
and
transcripts
Demo:
The
gene
tab
If
you
click
on
any
one
of
the
transcripts
in
the
Region
in
detail
image,
a
pop-up
menu
will
appear,
allowing
you
to
jump
directly
to
that
gene
or
transcript.
Links
Another
way
to
go
to
a
gene
of
interest
is
to
search
directly
for
it.
Were
going
to
look
at
the
human
ESPN
gene.
This
gene
encodes
a
multifunctional
actin-bundling
protein
with
a
major
role
in
mediating
sensory
transduction
in
various
mechanosensory
and
chemosensory
cells.
Mutations
in
this
gene
are
associated
with
deafness
(http://tinyurl.com/espn-ncbi-gene).
From
ensembl.org,
type
ESPN
into
the
search
bar
and
click
the
Go
button.
You
will
get
a
list
of
hits
with
the
human
gene
at
the
top.
Where
you
search
for
something
without
specifying
the
species,
or
where
the
ID
is
not
restricted
to
a
single
species,
the
most
popular
species
will
appear
first,
in
this
case,
human,
mouse
and
zebrafish
appear
first.
You
can
restrict
your
query
to
species
or
features
of
interest
using
the
options
on
the
left.
20
Links
Click
on
the
gene
name
or
Ensembl
ID.
The
Gene
tab
should
open:
Gene
tab
Option:
Open
table
of
transcripts
ESPN-001
transcript.
Click
Gene
for
info
views
Forward-
stranded
transcripts
Blue
bar
is
the
genome
Reverse-
stranded
transcripts
21
Lets
walk
through
some
of
the
links
in
the
left
hand
navigation
column.
How
can
we
view
the
genomic
sequence?
Click
Sequence
at
the
left
of
the
page.
Most
recent
human
genome
assembly
GRCh37
=
hg19
Click
Sequence
Upstream
sequence
Exon
of
an
overlapping
gene
ESPN
Exon
The
sequence
is
shown
in
FASTA
format.
Take
a
look
at
the
FASTA
header:
22
chromosome
base
pair
end
name
of
genome
base
pair
start
assembly
forward
strand
(-1
is
reverse)
Exons
are
highlighted
within
the
genomic
sequence.
Variations
can
be
added
with
the
Configure
this
page
link
found
at
the
left.
Click
on
it
now.
Show variants
Turn
on
line
numbers
Once
you
have
selected
changes
(in
this
example,
Show
variations
and
Line
numbering)
click
at
the
top
right.
Links
to
the
variation
tab
Lets
look
at
where
our
gene
is
expressed.
Click
on
Expression
in
the
left-hand
menu.
23
Hover
over
the
column
titles
for
a
pop-up
definition.
Can
our
gene
be
found
in
other
databases?
Go
up
the
left-hand
menu
to
External
references:
This
contains
links
to
the
gene
in
other
projects,
such
as
EntrezGene.
To
find
out
more
about
the
individual
transcripts
of
this
gene,
click
on
Transcript
comparison
in
the
left-hand
menu.
24
You
must
now
choose
the
transcripts
youd
like
to
see,
click
on
the
blue
Select
transcripts
button.
Click
on
the
+
to
add
a
transcript
Lets
select
all
the
protein-coding
transcripts,
then
close
the
menu.
Legend
Gene
sequence
Transcript
sequence
s
Demo:
The
transcript
tab
Lets
now
explore
one
splice
isoform.
Click
on
Show
transcript
table
at
the
top.
Click
on
the
ID
for
the
largest
one,
ESPN-001
(ENST00000377828).
25
You
are
now
in
the
Transcript
tab
for
ESPN-001.
The
left
hand
navigation
column
provides
several
options
for
the
transcript
ESPN-
001.
Click
on
the
Exons
link.
Green:
flanking
sequence
Grey:
coding
sequence
Purple:
UTR
Blue:
introns
You
may
want
to
change
the
display
(for
example,
to
show
more
flanking
sequence,
or
to
show
full
introns).
In
order
to
do
so
click
on
Configure
this
page
and
change
the
display
options
accordingly.
26
If
you
would
like
to
export
the
sequence,
including
the
colours,
click
Download
view
as
RTF.
A
Rich
Text
Format
document
will
be
generated
that
can
be
opened
in
word
processor
such
as
MS
Word.
Now
click
on
the
cDNA
link
to
see
the
spliced
transcript
sequence.
Click
cDNA
27
UnTranslated
Regions
(UTRs)
are
highlighted
in
dark
yellow,
codons
are
highlighted
in
light
yellow,
and
exon
sequence
is
shown
in
black
or
blue
letters
to
show
exon
divides.
Sequence
variants
are
represented
by
highlighted
nucleotides
and
clickable
IUPAC
codes
are
above
the
sequence.
Next,
follow
the
General
identifiers
link
at
the
left.
This
page
shows
information
from
other
databases
such
as
RefSeq,
UniProtKB,
CCDS
and
others,
that
match
to
the
Ensembl
transcript
and
protein.
28
Click
on
Ontology
table
to
see
GO
terms
from
the
Gene
Ontology
consortium.
www.geneontology.org
Click
on
the
to
see
a
guide
to
the
three-letter
Evidence
codes.
Now
click
on
Protein
summary
to
view
domains
from
Pfam,
PROSITE,
Superfamily,
InterPro,
and
more.
Ensembl
ESPN
protein
Protein
domains
Clicking
on
Domains
&
features
shows
a
table
of
this
information.
29
Exercises:
Genes
and
transcripts
Exercise
7
Exploring
the
human
MYH9
gene
(a)
Find
the
human
MYH9
(myosin,
heavy
chain
9,
non-muscle)
gene,
and
go
to
the
Gene
tab.
On
which
chromosome
and
which
strand
of
the
genome
is
this
gene
located?
(d)
Are
there
microarray
(oligo)
probes
that
can
be
used
to
monitor
ENST00000216181
expression?
30
Exercise
8
Finding
a
gene
associated
with
a
phenotype
Phenylketonuria
is
a
genetic
disorder
caused
by
an
inability
to
metabolise
phenylalanine
in
any
body
tissue.
This
results
in
an
accumulation
of
phenylalanine
causing
seizures
and
mental
retardation.
(a)
Search
for
phenylketonuria
from
the
Ensembl
homepage.
What
gene
is
associated
with
this
disorder?
(b)
What
tissues
is
this
gene
expressed
in?
Is
this
surprising,
given
the
genes
role
in
disease?
What
is
meant
by
Intron-spanning
reads
and
RNASeq
alignments?
(c)
How
many
protein
coding
transcripts
does
this
gene
have?
View
all
of
these
in
the
transcript
comparison
view.
(d)
What
is
the
MIM
disease
identifier
for
this
gene?
Exercise
9
Exploring
a
plant
gene
(Vitis
vinifera,
grape)
Start
in
http://plants.ensembl.org/index.html
and
select
the
Vitis
vinifera
genome.
(a)
What
GO:
biological
process
terms
are
associated
with
the
MADS4
gene?
(b)
Go
to
the
transcript
tab
for
the
only
transcript,
Vv01s0010g03900.t01.
How
many
exons
does
it
have?
Which
one
is
the
longest?
How
much
of
that
is
coding?
(c)
What
domains
can
be
found
in
the
protein
product
of
this
transcript?
How
many
different
domain
prediction
methods
agree
with
each
of
these
domains?
31
BioMart
Demo:
BioMart
Follow
these
instructions
to
guide
you
through
BioMart
to
answer
the
following
query:
You
have
three
questions
about
a
set
of
human
genes:
ESPN,
MYH9,
USH1C,
CISD2,
THRB,
DFNB31
(these
are
HGNC
gene
symbols.
More
details
on
the
HUGO
Gene
Nomenclature
Committee
can
be
found
on
http://www.genenames.org)
1)
What
are
the
EntrezGene
IDs
for
these
genes?
2)
Are
there
associated
functions
from
the
GO
(gene
ontology)
project
that
might
help
describe
their
function?
3)
What
are
their
cDNA
sequences?
Step
1:
Click
on
BioMart
in
the
top
header
of
a
www.ensembl.org
page
to
go
to:
www.ensembl.org/biomart/martview
NOTE:
These
answers
were
determined
using
BioMart
Ensembl
74.
STEP
2:
Choose
Ensembl
Genes
74
as
the
primary
database.
STEP
3:
Choose
Homo
sapiens
genes
as
the
dataset.
32
STEP
4:
Click
Filters
at
the
left.
Expand
the
GENE
panel.
STEP
5:
In
ID
List
Limit,
paste
in
your
gene
symbols.
Change
the
heading
to
read
HGNC
symbol(s)
[e.g.
ZFY].
STEP
6:
Click
Count
to
see
BioMart
is
reading
6
genes
out
of
64,138
possible
H.
sapiens
genes.
Since
we
entered
6
gene
symbols,
this
confirms
that
our
filters
have
worked
correctly.
33
STEP
7:
Click
on
Attributes
to
select
output
options
(i.e.
GO
terms)
STEP
8:
Expand
the
EXTERNAL
panel.
STEP
9:
Scroll
down
to
select
EntrezGene
ID
(to
answer
question
1)
STEP
10:
Also
select
HGNC
symbol
to
see
the
input
gene
symbols
we
started
with.
STEP
11:
Scroll
back
up
to
select
GO
term
fields
(to
answer
question
2)
STEP
12:
Click
Results.
34
Why
are
there
multiple
rows
for
one
gene
ID?
For
example,
look
at
the
first
few
rows.
STEP
13:
Click
Attributes
again
STEP
14:
Select
Sequences
at
the
top,
then
expand
SEQUENCES
and
choose
the
option
cDNA
sequences
(to
answer
question
3).
STEP
15:
Expand
Header
Information
to
select
the
Associated
Gene
Name
(this
is
the
official
gene
name,
for
human
it
is
HGNC
which
was
our
original
input).
35
STEP
16:
Click
Results
to
see
the
cDNA
sequences
in
FASTA
format.
STEP
17:
Change
View
10
rows
to
View
All
rows
so
that
you
see
the
full
table.
Note:
Pop-up
blocking
must
be
switched
off
in
your
browser.
Note:
you
can
use
the
Go
button
to
export
a
file.
What
did
you
learn
about
the
human
genes
in
this
exercise?
Could
you
learn
these
things
from
the
Ensembl
browser?
Would
it
take
longer?
For
more
details
on
BioMart,
have
a
look
at
these
publications:
Smedley,
D.
et
al
BioMart
biological
queries
made
easy
BMC
Genomics
2009
Jan
14;10:22
Kinsella,
R.J.
et
al
Ensembl
BioMarts:
a
hub
for
data
retrieval
across
taxonomic
space.
Database
(Oxford)
2011:bar030
Exercises:
BioMart
Exercise
10
Finding
genes
by
protein
domain
Find
mouse
proteins
with
transmembrane
domains
located
on
chromosome
9.
36
Exercise
11
Convert
IDs
BioMart
is
a
very
handy
tool
when
you
want
to
convert
IDs
from
different
databases.
The
following
is
a
list
of
29
IDs
of
human
proteins
from
the
NCBI
RefSeq
database
(http://www.ncbi.nlm.nih.gov/projects/RefSeq/):
NP_001218
NP_001220
NP_203125
NP_004338
NP_203124
NP_004337
NP_203126
NP_116786
NP_001007233
NP_036246
NP_150636
NP_116756
NP_150635
NP_116759
NP_001214
NP_001221
NP_150637
NP_203519
NP_150634
NP_001073594
NP_150649
NP_001219
NP_001216
NP_001073593
NP_116787
NP_203520
NP_001217
NP_203522
NP_127463
Generate
a
list
that
shows
to
which
Ensembl
Gene
IDs
and
to
which
HGNC
symbols
these
RefSeq
IDs
correspond.
Do
these
29
proteins
correspond
to
29
genes?
Hint:
For
this
exercise,
its
easier
to
copy
and
paste
the
IDs
from
the
online
exercise
booklet
(copy
one
column,
then
the
other).
See
the
front
cover
for
the
URL.
Exercise
12
Export
homologues
For
a
list
of
Ciona
savignyi
Ensembl
genes,
export
the
human
orthologues.
ENSCSAVG00000000002
ENSCSAVG00000000003
ENSCSAVG00000000006
ENSCSAVG00000000007
ENSCSAVG00000000009
37
ENSCSAVG00000000011
Exercise
13
Export
structural
variants
You
can
use
BioMart
to
query
variants,
not
just
genes.
(Make
sure
you
use
the
right
Datasets.)
(a)
Export
the
study
accession,
source
name,
chromosome,
sequence
region
start
and
end
(in
bp)
of
human
structural
variations
(SV)
on
chromosome
1,
starting
at
130,408
and
ending
at
210,597.
(b)
In
a
new
BioMart
query,
find
the
alleles,
phenotype
descriptions,
and
associated
genes
for
the
human
SNPs
rs1801500
and
rs1801368.
Can
you
view
this
same
information
in
the
Ensembl
browser?
Exercise
14
Find
genes
associated
with
array
probes
Forrest
et
al
performed
a
microarray
analysis
of
peripheral
blood
mononuclear
cell
gene
expression
in
benzene-exposed
workers
(Environ
Health
Perspect.
2005
June;
113(6):
801807).
The
microarray
used
was
the
human
Affymetrix
U133A/B
(also
called
U133
plus
2)
GeneChip.
The
top
25
up-regulated
probe-sets
were:
207630_s_at
221641_s_at
221840_at
202055_at
219228_at
226743_at
204924_at
228393_s_at
227613_at
225120_at
223454_at
218515_at
228962_at
202224_at
214696_at
200614_at
210732_s_at
212014_x_at
212370_at
223461_at
225390_s_at
209835_x_at
227645_at
213315_x_at
226652_at
(a)
Retrieve
for
the
genes
corresponding
to
these
probe-sets
the
Ensembl
Gene
and
Transcript
IDs
as
well
as
their
HGNC
symbols
and
descriptions.
38
(b)
In
order
to
analyse
these
genes
for
possible
promoter/enhancer
elements,
retrieve
the
2000
bp
upstream
of
the
transcripts
of
these
genes.
(c)
In
order
to
be
able
to
study
these
human
genes
in
mouse,
identify
their
mouse
orthologues.
Also
retrieve
the
genomic
coordinates
of
these
orthologues.
39
Variation
Demo:
Exploring
variants
in
Ensembl
In
any
of
the
sequence
views
shown
in
the
Gene
and
Transcript
tabs,
you
can
view
variants
on
the
sequence.
You
can
do
this
by
clicking
on
Configure
this
page
from
any
of
these
views.
Lets
take
a
look
at
the
Gene
sequence
view
for
MCM6
in
human.
Search
for
MCM6
and
go
to
the
Sequence
view.
If
you
cant
see
variants
marked
on
this
view,
click
on
Configure
this
page
and
select
Show
variations:
Yes
and
show
links.
Legend
of
variant
consequence
types
Links
to
variants
Variants
on
sequences
shown
as
IUPAC
codes
Find
out
more
about
a
variant
by
clicking
on
it.
You
can
add
variants
to
all
other
sequence
views
in
the
same
way.
40
You
can
go
to
the
Variation
tab
by
clicking
on
the
variant
ID.
For
now,
well
explore
more
ways
of
finding
variants.
To
view
all
the
sequence
variations
in
table
form,
click
the
Variation
table
link
at
the
left
of
the
gene
tab.
The
table
is
divided
into
consequence
types.
Click
on
Show
to
expand
a
detailed
table
for
any
of
the
consequence
types
available.
Lets
expand
Missense
variants.
SIFT
and
PolyPhen
scores
Transcript
Variant
affected
IDs
The
table
contains
lots
of
information
about
the
variants.
You
can
click
on
the
IDs
here
to
go
to
the
Variation
tab
too.
Lets
look
at
Structural
Variation
in
the
Gene
Tab.
Youll
find
it
in
the
left-hand
menu.
41
All
larger
SVs
are
condensed
into
a
single
bar
Smaller
SVs
are
shown
individually
Table
of
all
SVs
You
can
click
on
the
structural
variants
(SVs)
in
the
image,
or
on
their
IDs
in
the
table
to
go
to
the
SV
tab.
You
can
also
see
the
phenotypes
associated
with
a
gene.
Click
on
Phenotypes
in
the
left
hand
menu.
Phenotypes
associated
with
the
gene
Phenotypes
associated
with
variants
in
the
gene
Click
to
see
a
list
of
variants
42
Lets
have
a
look
at
variants
in
the
Location
tab.
Click
on
the
Location
tab
in
the
top
bar.
Configure
this
page
and
open
Variation
from
the
left-hand
menu.
There
are
various
options
for
turning
on
variants.
You
can
turn
on
variants
by
source,
by
frequency,
presence
of
a
phenotype
or
by
individual
genome
they
were
isolated
from.
Turn
on
the
following
sequence
variants
in
Expanded
with
name.
1000
genomes
All
1000
genomes
All
common
All
phenotype-associated
variants
ENSEMBL:Venter
Also
turn
on
Larger
and
Smaller
Structural
variants
(all
sources)
in
Expanded.
43
SNPs
and
indels
SVs
Variation
legends
Click
on
a
variant
to
find
out
more
information.
It
may
be
easier
to
see
the
individual
variants
if
you
zoom
in.
Lets
zoom
in
on
the
region
2:136607850-136609811
by
typing
it
into
the
Location
box.
Now
that
we
are
zoomed
in,
we
can
see
the
variant
names.
Click
on
the
variant
rs4988235
to
open
a
pop-up,
then
click
on
rs4988235
properties
to
open
the
Variation
tab.
44
Variant
information
Variation views
Variation
icons.
These
go
to
the
same
places
as
the
links
on
the
left
The
icons
show
you
what
information
is
available
for
this
variant.
Click
on
Genes
and
regulation,
or
follow
the
link
at
the
left.
This
variant
is
found
in
three
transcripts
of
the
MCM6
gene.
It
has
not
been
associated
with
any
regulatory
features
or
motifs.
Lets
look
at
population
genetics.
Either
click
on
Explore
this
variant
in
the
left
hand
menu
then
click
on
the
Population
genetics
icon,
or
click
on
Population
genetics
in
the
left-hand
menu.
45
Pie
charts
of
allele
frequencies
Expand
subpopulations
Table
of
more
detailed
data
These
data
are
mostly
from
the
1000
genomes
and
HapMap
projects
in
human.
There
are
big
differences
in
allele
frequencies
between
populations.
Lets
have
a
look
at
the
phenotypes
associated
with
this
variant
to
see
if
they
are
known
to
be
specific
to
certain
human
populations.
Either
click
on
Explore
this
variant
in
the
left
hand
menu
then
click
on
the
Phenotype
data
icon,
or
click
on
Phenotype
Data
in
the
left-hand
menu.
This
variant
is
associated
with
lactase
persistence,
which
is
known
to
be
common
in
European
populations,
and
rare
in
Asian
populations,
exactly
as
we
saw
in
the
allele
frequencies
in
these
populations.
Are
there
other
variants
in
the
genome
that
also
cause
lactase
persistence?
Click
on
[View
on
Karyotype]
to
find
out.
46
Hits
on
the
karyotype
Legend
showing
hit
significance
Table
of
variants
Two
variants
are
known
to
be
associated
with
this
phenotype.
Both
are
found
with
the
MCM6
gene.
Click
back
to
the
Variation
Tab.
Click
on
Phylogenetic
context
to
see
the
variant
in
other
species.
Choose
your
alignment
Aligned regions
SNP
of
interest
Alignment
between
species
47
The
variant
is
not
marked
in
the
other
species.
This
means
that
the
variant
arose
in
humans.
Exercises:
Exploring
variants
in
Ensembl
Exercise
15
Human
population
genetics
and
phenotype
data
The
SNP
rs1738074
in
the
5
UTR
of
the
human
TAGAP
gene
has
been
identified
as
a
genetic
risk
factor
for
a
few
diseases.
(a)
In
which
transcripts
is
this
SNP
found?
(b)
What
is
the
least
frequent
genotype
for
this
SNP
in
the
Yoruba
(YRI)
population
from
the
HapMap
set?
(c)
What
is
the
ancestral
allele?
Is
it
conserved
in
the
37
eutherian
mammals?
(d)
With
which
diseases
is
this
SNP
associated?
Are
there
any
known
risk
(or
associated)
alleles?
Exercise
16
Exploring
a
SNP
in
human
The
missense
variation
rs1801133
in
the
human
MTHFR
gene
has
been
linked
to
elevated
levels
of
homocysteine,
an
amino
acid
whose
plasma
concentration
seems
to
be
associated
with
the
risk
of
cardiovascular
diseases,
neural
tube
defects,
and
loss
of
cognitive
function.
This
SNP
is
also
referred
to
as
A222V,
Ala222Val
as
well
as
other
HGVS
names.
(a)
Find
the
page
with
information
for
rs1801133.
(b)
Is
rs1801133
a
Missense
variation
in
all
transcripts
of
the
MTHFR
gene?
(c)
Why
are
the
alleles
for
this
variation
in
Ensembl
given
as
G/A
and
not
as
C/T,
as
in
dbSNP
and
literature?
(http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=180113
3)
48
(d)
What
is
the
major
allele
in
rs1801133?
(e)
In
which
paper
is
the
association
between
rs1801133
and
homocysteine
levels
described?
(f)
According
to
the
data
imported
from
dbSNP,
the
ancestral
allele
for
rs1801133
is
G.
Ancestral
alleles
in
dbSNP
are
based
on
a
comparison
between
human
and
chimp.
Does
the
sequence
at
this
same
position
in
four
other
primates,
i.e.
gorilla,
orangutan,
macaque
and
marmoset,
confirm
that
the
ancestral
allele
is
G?
(g)
Were
both
alleles
of
rs1801133
already
present
in
Neanderthal?
To
answer
this
question,
have
a
look
at
the
individual
reads
at
its
genomic
position
in
the
Neanderthal
Genome
Browser
(http://neandertal.ensemblgenomes.org/).
Exercise
17
Structural
variation
in
human
In
the
paper
The
influence
of
CCL3L1
gene-containing
segmental
duplications
on
HIV-1/AIDS
susceptibility
(Gonzalez
et
al
Science.
2005
Mar
4;
307(5704):1434-40)
it
is
shown
that
a
higher
copy
number
of
the
CCL3L1
(Chemokine
(C-C
motif)
ligand
3-like
1)
gene
is
associated
with
lower
susceptibility
to
HIV
infection.
(a)
Find
the
human
CCL3L1
gene.
(b)
Have
any
CNVs
been
annotated
for
this
gene?
Note:
In
Ensembl,
CNVs
are
classified
as
structural
variants.
Exercise
18
Exploring
a
SNP
in
mouse
Madsen
et
al
in
the
paper
Altered
metabolic
signature
in
pre-diabetic
NOD
mice
(PloS
One.
2012;
7(4):
e35445)
have
described
several
regulatory
and
coding
SNPs,
some
of
them
in
genes
residing
within
the
previously
defined
insulin
dependent
diabetes
(IDD)
regions.
The
authors
describe
that
one
of
the
identified
SNPs
in
the
murine
Xdh
gene
(rs29522348)
would
lead
to
an
amino
acid
substitution
and
could
be
damaging
as
predicted
as
by
SIFT
(http://sift.jcvi.org/).
49
(a)
Where
is
the
SNP
located
(chromosome
and
coordinates)?
(b)
What
is
the
HGVS
recommendation
nomenclature
for
this
SNP?
(c)
Why
does
Ensembl
put
the
C
allele
first
(C/T)?
(d)
Are
there
differences
between
the
genotypes
reported
in
NOD/LTJ
and
BALB/cByJ?
Demo:
The
Variant
Effect
Predictor
(VEP)
We
have
analysed
a
samples
from
a
patient
with
a
genetic
disorder.
The
patient
presents
with
facial
and
limb
deformities,
mental
retardation
and
gastrointestinal
reflux.
Our
genotyping
has
identified
a
mutation
that
may
be
responsible
for
the
phenotype:
An
A->G
mutation
on
chromosome
5
at
37,017,205
on
the
+
strand.
We
will
use
the
Ensembl
VEP
to
determine:
Has
my
variant
already
been
annotated
in
Ensembl?
What
genes
are
affected
by
my
variant?
Does
my
variant
result
in
a
protein
change?
Go
to
the
front
page
of
Ensembl
and
click
on
the
VEP
button.
This
page
contains
information
about
the
VEP,
including
links
to
download
the
script
version
of
the
tool.
Click
on
Launch
the
online
VEP
tool!
This
will
open
up
a
dialogue
box.
This
allows
us
to
input
data
on
our
variant.
50
Give
your
data
a
name
can
also
You
upload
a
file.
The
data
is
in
the
format:
Chromosome
Start
End
alleles
(reference/mutation)
strand
Delete
the
writing
already
in
the
Paste
data
box
and
type
in:
5
37017205
37017205
A/G
+
Scroll
down
to
see
some
of
the
options
we
can
also
choose.
Choose
which
database
to
map
your
variant
to.
Find
out
if
variants
already
exist
in
our
database.
Choose
to
see
scores
for
protein
changes.
Choose
to
only
see
common
or
rare
variants
51
Select
Prediction
and
Score
for
SIFT
predictions
and
PolyPhen
predictions.
These
are
algorithms
that
predict
how
deleterious
a
mutation
will
be
on
a
protein.
When
youve
selected
everything
you
need,
scroll
right
to
the
bottom
and
click
Next.
Click
HTML
to
view
your
results
with
clickable
links.
Our
mutation
affects
two
Our
mutation
causes
an
Our
mutation
is
already
transcripts
of
one
gene
amino
acid
change
in
the
Ensembl
database
Exercise:
The
Variant
Effect
Predictor
(VEP)
Exercise
19
VEP
Resequencing
of
the
genomic
region
of
the
human
CFTR
(cystic
fibrosis
transmembrane
conductance
regulator
(ATP-binding
cassette
sub-family
C,
member
7)
gene
(ENSG00000001626)
has
revealed
the
following
variants
(alleles
defined
in
the
forward
strand):
G/A
at
7:117,171,039
T/C
at
7:117,171,092
T/C
at
7:117,171,122
(a)
Use
the
VEP
tool
in
Ensembl
and
choose
the
options
to
see
SIFT
and
PolyPhen
predictions.
Do
these
variants
result
in
a
change
in
the
proteins
encoded
by
any
of
the
Ensembl
genes?
Which
gene?
Have
the
variants
already
been
found?
(b)
Go
to
Region
in
detail
for
CFTR.
Do
you
see
the
VEP
track?
52
Comparative
genomics
Demo:
Gene
trees
and
homologues
Lets
look
at
the
homologues
of
human
BRCA2.
Search
for
the
gene
and
go
to
the
Gene
tab.
Click
on
Gene
tree
(image),
which
will
display
the
current
gene
in
the
context
of
a
phylogenetic
tree
used
to
determine
orthologues
and
paralogues.
Protein
Collapsed
nodes
alignments
Gene
of
interest
Legend
Funnels
indicate
collapsed
nodes.
We
can
expand
them
by
clicking
on
the
node
and
selecting
Expand
this
sub-tree
from
the
pop-up
menu.
Expand
this
sub-tree
53
We
can
look
at
homologues
in
the
Orthologues
and
Paralogues
pages,
which
can
be
accessed
from
the
left-hand
menu.
The
numbers
of
orthologues
or
paralogues
available
are
indicated
in
brackets
alongside
the
name.
If
there
are
none,
then
the
name
will
be
greyed
out.
Paralogues
is
greyed
out
for
BRCA2
indicating
that
there
are
no
paralogues
available.
Click
on
Orthologues
to
see
the
61
orthologues
available.
Orthologue
types
Choose
a
taxon
of
Information
on
interest
orthologues
Choose
to
see
only
Rodent
orthologues
by
selecting
the
box.
The
table
below
will
now
only
show
details
of
rodent
orthologues.
Lets
look
at
mouse.
Links
from
the
orthologue
allow
you
to
go
to
alignments
of
the
orthologous
proteins
and
cDNAs.
Click
on
Alignment
(protein)
for
the
mouse
orthologue.
54
Information
on
orthologue
pair
Alignment
in
Clustal
W
format
Protein
IDs
Exercises:
Gene
trees
and
homologues
Exercise
20
Orthologues,
paralogues
and
gene
trees
for
the
human
BRAF
gene.
(a)
How
many
orthologues
are
predicted
for
this
gene
in
primates?
Note
the
Target
%id
and
Query
%id.
How
much
sequence
identity
does
the
Tarsius
syrichta
protein
have
to
the
human
one?
Click
on
the
Alignment
link
next
to
the
Ensembl
identifier
column
to
view
a
protein
alignment
in
Clustal
format.
(b)
Go
to
the
orthologue
in
marmoset.
Is
there
a
genomic
alignment
between
marmoset
and
human?
Is
there
a
gene
for
both
species
in
this
region?
Demo:
Whole
genome
alignments
Lets
look
at
some
of
the
comparative
genomics
views
in
the
Location
tab.
Go
to
the
region
2:176914144-177094980
in
human,
which
contains
the
HoxD
cluster
which
is
involved
in
limb
development
and
is
highly
conserved
between
species.
In
the
Region
in
detail
view,
we
can
already
see
the
Constrained
elements
for
37
eutherian
mammals
EPO_LOW_COVERAGE
track
by
default.
This
track
indicates
regions
of
high
conservation
between
species,
considered
to
be
constrained
by
evolution.
55
This
track
has
a
matching
conservation
score
track.
Click
on
Configure
this
page,
then
Comparative
genomics
and
turn
on
the
track
for
Conservation
score
for
37
eutherian
mammals
EPO_LOW_COVERAGE.
Save
and
close
the
menu.
You
can
now
see
the
conservation
scores
that
were
used
to
determine
the
peaks
indicated
in
the
constrained
elements
track.
We
can
also
look
at
individual
species
comparative
genomics
tracks
in
this
view
by
clicking
on
Configure
this
page.
Select
BLASTz/LASTz
alignments
from
the
left-hand
menu
to
choose
alignments
between
closely
related
species.
Turn
on
the
alignments
for
Mouse
and
Chimpanzee
in
Normal.
Go
to
Translated
blat
alignments
and
turn
on
alignments
with
Zebrafish
and
Xenopus
in
Normal.
Save
and
close
the
menu.
Nucleotide
alignments
in
baby
pink
Protein
alignments
Filled
boxes
are
aligned
in
magenta
sequences.
Empty
boxes
are
no
alignments
The
alignment
is
greatest
between
closely
related
species.
We
can
also
look
at
the
alignment
between
species
or
groups
of
species
as
text.
Click
on
Alignments
(text)
in
the
left
hand
menu.
Select
Mouse
from
the
alignments
list
then
click
Go.
56
Choose
an
alignment
from
the
drop-down
Multiple
alignments
Pairwise
alignments
You
will
see
a
list
of
the
regions
aligned,
followed
by
the
sequence
alignment.
Exons
are
shown
in
red.
This
can
also
be
viewed
graphically.
Click
on
Alignments
(image)
in
the
left-hand
menu.
Mouse
is
already
selected
(from
text
view)
Human
region
Mouse
region,
rearranged
to
align
with
human
57
In
both
alignment
views
the
contig
is
the
compared
species
is
rearranged
to
align
to
the
species
of
interest.
To
compare
with
both
contigs
in
their
natural
order,
go
to
Region
comparison.
To
add
species
to
this
view,
click
on
the
blue
Select
species
or
regions
button.
Choose
Mouse
from
the
list
then
close
the
menu.
Human
region
Mouse
region
We
can
view
large
scale
syntenic
regions
from
our
chromosome
of
interest.
Click
on
Synteny
in
the
left
hand
menu.
58
Human
chromosome
Choose
another
species
or
chromosome
Mouse
chromosome
with
syntenic
region
Syntenic
regions
Region
of
interest
Table
of
syntenic
genes
Exercises:
Whole
genome
alignments
59
(c)
Click
on
the
Region
in
detail
link
at
the
left
and
turn
on
the
tracks
for
multiple
alignments
and
conservation
score
for
the
5
teleost
fish
EPO
by
configuring
the
page.
What
is
the
difference
between
the
5
teleost
fish
EPO
multiple
alignment
track
and
the
Constrained
elements
track?
Which
regions
of
the
gene
do
most
of
the
constrained
element
blocks
match
up
to?
Can
you
find
more
information
on
how
the
constrained
elements
track
was
generated?
Exercise
22
Synteny
Go
to
www.ensembl.org
Find
the
Rhodopsin
(RHO)
gene
for
Human.
Go
to
the
Location
tab.
(a)
Click
Synteny
at
the
left.
Are
there
any
syntenic
regions
in
dog?
If
so,
which
chromosomes
are
shown
in
this
view?
(b)
Stay
in
the
Synteny
view.
Is
there
a
homologue
in
dog
for
human
RHO?
Are
there
more
genes
in
this
syntenic
block
with
homologues?
Exercise
23
Whole
genome
alignments
(a)
Find
the
Ensembl
BRCA2
(Breast
cancer
type
2
susceptibility
protein)
gene
for
human
and
go
to
the
Region
in
detail
page.
(b)
Turn
on
the
BLASTZ
or
LASTZ-net
alignment
tracks
for
chicken,
chimp,
mouse
and
platypus
and
the
Translated
BLAT
alignment
tracks
for
anole
lizard
and
zebrafish.
Does
the
degree
of
conservation
between
human
and
the
various
other
species
reflect
their
evolutionary
relationship?
Which
parts
of
the
BRCA2
gene
seem
to
be
the
most
conserved?
Did
you
expect
this?
(c)
Have
a
look
at
the
Conservation
score
and
Constrained
elements
tracks
for
the
set
of
37
mammals
and
the
set
of
21
amniota
vertebrates.
Do
these
tracks
confirm
what
you
already
saw
in
the
tracks
with
pairwise
alignment
data?
60
(d)
Retrieve
the
genomic
alignment
for
a
constrained
element.
Highlight
the
bases
that
match
in
>50%
of
the
species
in
the
alignment.
(e)
Retrieve
the
genomic
alignment
for
the
BRCA2
gene
for
primates.
Highlight
the
bases
that
match
in
>50%
of
the
species
in
the
alignment.
61
Regulation
Demo:
Raw
ChIPSeq
data
Were
going
to
add
some
regulation
data
to
the
Region
in
detail
view.
Well
start
at
the
human
region
11:2012486-2030153,
which
contains
the
imprinted
H19
gene.
Add
regulation
tracks
using
Configure
this
page.
First,
were
going
to
add
ChIP-seq
data
for
histone
modifications
and
polymerase
binding.
Click
on
Histones
&
polymerases
under
Regulation
in
the
left-hand
menu.
Add
tutorial
labels
to
help
use
this
view
Legend
Cell
lines
Histone
Select
modifications
boxes
You
can
turn
on
a
single
track
by
clicking
on
the
box
in
the
matrix.
Note
that
certain
tracks
are
selected
for
all
cell
lines
by
default
(PolII,
PolIII,
H3K27me3,
H3K36me3,
H3K4me3,
H3K9me3).
These
will
62
appear
in
the
Region
in
detail
view
only
if
you
specify
a
track
style
for
the
cell
lines.
Turn
on
all
the
tracks
for
GM12878.
Hover
over
the
cell
line
name
then
select
All.
Now
choose
the
track
style
for
the
tracks
youve
switched
on.
Click
on
the
track
style
box
for
GM12878
and
select
Both.
There
is
a
similar
matrix
for
Open
chromatin
&TFBS.
Use
this
to
turn
on
all
tracks
for
GM12878
in
Both.
Close
the
menu
to
see
the
tracks
in
the
browser.
Peaks
of
histone
modifications
Histograms
of
histone
Click
for
legend
of
modifications
histogram
colours
Demo:
Regulatory
features
and
segmentation
These
data
are
used
to
construct
the
Reg-feats
and
Segmentation
features.
The
merged
Reg-feats
are
switched
on
in
the
Region
in
detail
view
by
default.
63
Click
on
Configure
this
page.
Then
select
Regulatory
features.
Turn
on
the
Reg.
Feats:
GM12878
and
Reg.
Segs:
GM12878
tracks.
Save
and
close
the
menu.
Reg
feats
are
shown
as
bar
and
whisker
plots
A
single
coloured
bar
represents
the
segmentation
Legends
of
reg
feats
and
segmentation
colour
codes
Can
you
see
correlations
between
the
different
kinds
of
regulatory
data
representation?
You
can
also
add
methylation
data
using
Configure
this
page.
Find
it
under
DNA
methylation
and
turn
on
GM12878
RRBS
ENCODE
and
GM12878
WGBS
ENCODE.
64
Our
regulatory
data
incorporates
the
ENCODE
data.
To
see
the
raw
ENCODE
data
and
the
ENCODE
segmentation,
you
need
to
add
the
ENCODE
hub.
From
ensembl.org,
click
on
the
ENCODE
icon.
This
page
contains
information
about
the
ENCODE
data
and
how
it
is
incorporated
into
Ensembl.
Add
the
ENCODE
hub
by
clicking
on
the
Link
to
add
the
ENCODE
track
hub.
This
will
take
you
directly
to
the
matrices
for
adding
ENCODE
data
to
the
Region
in
detail
view.
The
ENCODE
matrices
work
in
the
same
way
as
the
Open
chromatin
&TFBS
and
Histones
&
polymerases
matrices,
except
that
some
have
multiple
options
(indicated
by
numbers
within
the
boxes).
Exercises:
Regulation
Exercise
24
Gene
regulation:
Human
STX7
(a)
Find
the
Location
tab
(Region
in
detail
page)
for
the
STX7
gene.
Are
there
regulatory
features
in
this
gene
region?
If
so,
where
in
the
gene
do
they
appear?
(b)
Click
Configure
this
page
and
on
the
Regulatory
features
menu
in
the
left
hand
side.
Turn
on
Segmentation
features
for
HUVEC,
HeLa-
S3,
and
HepG2
cell
types.
Do
any
of
these
cells
show
predicted
enhancer
regions
in
the
STX7
region?
(c)
Use
Configure
this
page
to
add
supporting
data
indicating
open
chromatin
for
HeLa-S3
cells.
Are
there
sites
enriched
for
marks
of
open
chromatin
(DNase1
and
FAIRE)
in
HeLa
cells
at
the
5
end
of
STX7?
65
(d)
Configure
this
page
once
again
to
add
histone
modification
supporting
data
for
the
same
cell
type
as
above
(e.g.HeLa-S3).
Which
ones
are
present
at
the
5
end
of
STX7?
(e)
Is
there
any
data
to
support
methylated
CpG
sites
in
this
region
(5
end)
of
STX7
in
B-cells?
(f)
Create
a
Share
link
for
this
display.
Email
it
to
yourself
then
open
the
link.
Exercise
25
Regulatory
features
in
human
The
HLA-DRB1
and
HLA-DQA1
genes
are
part
of
the
human
major
histocompatibility
complex
class
II
(MHC-II)
region
and
are
located
about
44
kb
from
each
other
on
chromosome
6.
In
the
paper
The
human
major
histocompatibility
complex
class
II
HLA-DRB1
and
HLA-
DQA1
genes
are
separated
by
a
CTCF-binding
enhancer-blocking
element
(Majumder
et
al
J
Biol
Chem.
2006
Jul
7;281(27):18435-43)
a
region
of
high
acetylation
located
in
the
intergenic
sequences
between
HLA-DRB1
and
HLA-DQA1
is
described.
This
region,
termed
XL9,
coincided
with
sequences
that
bound
the
insulator
protein
CCCTC-binding
factor
(CTCF).
Majumder
et
al
hypothesise
that
the
XL9
region
may
have
evolved
to
separate
the
transcriptional
units
of
the
HLA-DR
and
HLA-DQ
genes.
(a)
Go
to
the
region
from
32,540,000
to
32,620,000
bp
on
human
chromosome
6
(b)
Is
there
a
regulatory
feature
annotated
in
the
intergenic
region
between
the
HLA-DRB1
and
HLA-DQA1
genes
that
has
CTCF
binding
supporting
data
as
(part
of)
its
core
evidence?
(c)
Has
the
CTCF
binding
detected
at
this
position
been
observed
in
all
cell/tissue
types
analysed?
(d)
Have
a
look
at
the
Regulatory
supporting
evidence
-
Histones
&
Polymerases
configuration
matrix.
For
which
cell/tissue
type
are
the
most
histone
acetylation
data
sets
available?
In
this
cell/tissue
type,
is
the
region
that
shows
CTCF
binding
also
a
region
of
high
acetylation,
as
found
by
Majumder
et
al?
66
Advanced
Access
Demo:
Upload
small
files
We
have
some
patients
that
present
with
microcephaly
and
developmental
delay.
They
all
have
large
scale
deletions
on
chromosome
five:
Patient
Chromosome
Start
End
P1 5 36821632 37091234
P2 5 36731476 36978306
P3 5 36908552 37108671
We
can
turn
them
into
a
BED
file
and
view
them
in
the
genome
browser.
To
find
out
about
BED
format,
click
on
Help
&
Documentation
in
the
top
bar
from
any
page
in
Ensembl:
Click
on
BED
File
Format
to
find
out
more:
67
This
page
describes
the
BED
file
format.
For
our
data,
we
have
chromosome
coordinates
and
a
name
for
each
feature.
Following
the
instructions
on
this
page,
we
can
put
our
data
into
BED
format
as
follows:
chr5
36821632
37091234
P1
chr5
36731476
36978306
P2
chr5
36908552
37108671
P3
To
see
this
data
in
Ensembl,
we
need
to
go
to
a
region
of
interest.
Well
go
to
the
region
of
these
data.
Put
human
5:36700000-
37110000
into
the
top
right
search
box
to
jump
to
the
Region
in
detail
page.
Click
on
the
Add
your
data
button
at
the
left.
If
youve
previously
added
data
to
Ensembl,
this
button
will
say
Manage
your
data
instead.
or
A
menu
will
appear:
Choose
a
name
for
the
data
Species
is
human
Select
BED
More
options
will
now
appear
in
the
menu.
Since
upload
is
allowed
for
BED,
this
option
appears.
You
are
still
able
to
attach
a
URL
if
you
want
to.
68
Paste
the
BED
data
into
the
box
then
click
Upload.
You
should
get
to
a
dialogue
box
telling
you
your
upload
has
been
successful.
Close
the
menu
to
go
back
to
your
region
of
interest.
Save,
share
or
delete
this
data.
If
youve
got
an
Ensembl
account,
you
can
save
this
data
to
your
account.
Accounts
are
free
to
set
up
and
allow
you
to
save
configurations
and
data,
and
share
with
groups.
69
Demo:
Attach
URLs
of
large
files
Larger
files,
such
as
BAM
files
generated
by
NGS,
need
to
be
attached
by
URL.
Ive
put
a
BAM
file
of
human
chromosome
20
RNASeq
data
online
at:
http://www.ebi.ac.uk/~emily/Workshops/BAM
Lets
take
a
look
at
that
URL.
Here
you
can
see
two
files
Illumina_reads_test.bam
and
Illumina_reads_test.bam.bai
(the
files
beginning
with
._
are
artefacts
of
creating
this
folder
on
a
Mac
ignore
them).
These
files
are
the
BAM
file
and
the
index
file
respectively.
When
attaching
a
BAM
file
to
Ensembl,
there
must
be
an
index
file
in
the
same
folder.
To
attach
the
file,
click
on
Manage
your
data,
then
click
on
Add
your
data
to
add
a
new
track.
We
get
to
the
same
dialogue
box
as
before.
This
time
well
name
our
data
Illumina
reads
and
choose
BAM
as
the
data
format.
Paste
in
the
URL
of
the
BAM
file
itself
(http://www.ebi.ac.uk/~emily/Workshops/BAM/Illumina_reads_tes
t.bam),
then
click
Attach.
70
Close
the
menu.
To
see
this
data,
jump
to
a
region
on
chromosome
20.
Lets
go
to
the
region
of
the
CDH22
gene.
Search
for
the
gene
and
click
on
the
location.
BAM
read
intensity
BAM
reads
We
can
zoom
in
to
see
the
sequence
itself.
Drag
out
boxes
in
the
view
to
zoom
in,
until
you
see
a
view
like
this.
Consensus
BAM
read
sequence
Sequence
of
individual
BAM
reads
Genomic
sequence
71
Demo:
REST
API
I
have
the
coordinates
of
a
particular
protein
motif
with
respect
to
the
protein
that
its
in.
I
would
like
to
find
out
where
this
motif
lies
on
the
genome.
Im
interested
in
a
coiled-coil
domain
at
position
116-216
in
the
protein
ENSP00000386200.
To
do
this
I
want
to
use
the
REST
API.
Ill
start
at
the
REST
homepage
at
http://beta.rest.ensembl.org/.
Here
you
can
see
a
list
of
all
the
possible
REST
endpoints,
with
names
and
short
descriptions.
Scroll
down
to
find
the
section
Mapping.
The
endpoint
GET
map/translation/:id/:region
does
what
we
want.
Click
on
the
link.
72
Description
of
the
endpoint
Required
parameters:
what
the
endpoint
NEEDS
to
work
Optional
paramaters:
allow
you
to
choose
your
output
format
Example
requests
Code
examples
The
example
in
different
output
shown
languages
for
by
default
accessing
this
endpoint
If
you
wish
to
extract
this
data
using
a
language
such
as
Perl,
Python,
Ruby
or
Java,
or
to
get
the
data
using
command
line
tools
such
as
Curl
or
Wget,
you
can
click
on
them
to
see
code
examples.
Were
just
going
to
do
a
simple
lookup
using
a
URL.
The
top
of
the
page
shows
us
that
the
method
is
map/translation/:id/:region.
That
means
that
we
can
get
our
data
using
a
URL
in
the
format
beta.ensembl.rest.org/map/translation/:id/:region.
73
For
our
data
we
can
use
the
URL
http://beta.rest.ensembl.org/map/translation/ENSP00000386200/
116..216.
Put
this
into
your
internet
browser.
This
will
take
you
to
a
text
page:
From
this
we
can
see
that
our
coiled-coil
domain
covers
two
different
regions,
which
will
be
two
different
exons
of
the
transcript.
They
are
on
chromosome
7
and
span
114268607-114268732
and
114269860-
114270036.
If
we
were
accessing
this
data
programmatically,
the
standard
output
format
would
allow
us
to
extract
the
data.
74
Advanced
exercise
This
exercise
requires
you
to
combine
the
knowledge
you
have
gained
about
different
aspects
of
Ensembl.
It
is
designed
to
be
challenging
and
force
you
to
come
up
with
solutions
yourself.
Methylation
data
in
human
The
human
PDHA2
gene,
that
encodes
for
a
subunit
of
the
pyruvate
dehydrogenase
complex,
is
exclusively
expressed
in
spermatogenic
cells.
In
the
paper
Human
testis-specific
PDHA2
gene:
Methylation
status
of
a
CpG
island
in
the
open
reading
frame
correlates
with
transcriptional
Activity
(Pinheiro
et
al
Mol
Genet
Metab.
2010
Apr;99(4):425-30),
two
CpG
islands
in
the
PDHA2
gene
are
reported,
one
encompassing
the
core
promoter
region
and
extending
into
the
open
reading
frame,
the
other
exclusively
located
in
the
coding
region.
The
latter
CpG
island
was
shown
to
be
methylated
in
somatic
tissues
but
demethylated
in
testicular
germ
cells
and
has
therefore
been
proposed
to
play
an
important
role
in
the
tissue-specific
expression
of
the
PDHA2
gene.
(a)
Find
the
PDHA2
gene
for
human
and
go
to
the
Region
in
detail
page.
Zoom
out
one
step,
so
that
5
kb
around
the
PDHA2
gene
is
shown.
(b)
Turn
on
the
CpG
islands
track.
Two
CpG
islands
are
reported
in
the
PDHA2
gene
by
Pinheiro
et
al
(2010).
Do
they
appear
in
this
track?
If
not,
why
not?
(Tip:
turn
on
Display
empty
tracks
to
confirm
that
a
track
is
on
but
has
no
data.)
(c)
Confirm
the
existence
of
the
two
CpG
islands
using
the
EMBOSS
program
CpGPlot
(http://www.ebi.ac.uk/Tools/emboss/cpgplot/index.html)
on
the
sequence
around
the
PDHA2
gene.
(d)
Upload
the
CpG
islands
found
by
CpGPlot
using
Manage
your
data.
Use
BED
format,
which
in
its
simplest
form
just
consists
of
the
chromosome
and
the
start
and
end
coordinates,
separated
by
spaces
(as
an
optional
fourth
field,
you
can
add
a
name/description).
The
genomic
start
and
end
coordinates
of
the
CpG
islands
can
be
calculated
from
the
genomic
start
coordinate
of
the
sequence
on
75
which
the
CpGPlot
program
was
run
and
the
relative
location
of
the
CpG
islands
on
this
sequence
as
given
by
the
CpGPlot
output.
(e)
Create
a
link
to
allow
you
to
show
your
new
BED
track
to
colleagues,
compared
to
the
%GC
track.
(f)
What
is
the
methylation
status
of
the
two
CpG
islands
in
different
tissues?
Is
there
any
tissue
in
particular
which
is
different
to
other
tissues?
(g)
Turn
on
the
RNASeq
tracks
for
different
tissues.
Is
there
evidence
that
PDHA2
is
expressed
in
one
tissue
more
than
others?
How
does
this
relate
to
the
DNA
methylation
data
you
saw?
What
does
this
suggest
about
the
way
this
gene
is
regulated?
(h)
How
well
conserved
is
the
region
of
the
PDHA2
gene
amongst
the
37
eutherian
mammals?
Are
the
CpG
islands
conserved?
(i)
How
many
GO
terms
are
associated
with
PDHA2?
Can
you
export
the
sequences
of
all
human
genes
that
are
also
associated
with
the
first
of
these
terms?
(j)
Can
you
fetch
the
gene
sequence
for
PDHA2
in
FASTA
using
the
Ensembl
REST
API?
76
Answers
Exploring
the
Ensembl
genome
browser
Ensembl
species
Exercise
1
Panda
(a)
Select
Panda
from
the
drop
down
species
list,
or
click
on
View
full
list
of
all
Ensembl
species,
then
choose
Panda
from
the
list.
The
assembly
is
ailMel1
or
GCA
000004335.1
(b)
Click
on
More
information
and
statistics.
Statistics
are
shown
in
the
tables
on
the
left.
The
length
of
the
genome
is
2,245,312,831
bp.
There
are
19,343
coding
genes.
Exercise
2
Zebrafish
(a)
Click
on
Zebrafish
on
the
front
page
of
Ensembl
to
go
to
the
species
homepage.
News
is
in
the
top
right.
Whats
new
in
Zebrafish
release
73:
Splicing
events
Structural
variations
Zebrafish
knockout
data
(b)
Assembly
Zv8
is
available
in
the
archived
release
59.
Exercise
3
Mosquitos
(a)
Go
to
metazoa.ensembl.org.
Open
the
drop
down
list
or
click
on
View
full
list
of
all
Ensembl
Metazoa
species.
There
are
two
Anopheles
species:
Anopheles
gambiae
and
Anopheles
darlingi.
(b)
Click
on
Anopheles
gambiae,
then
on
More
information
and
statistics.
The
genome
was
published
in
2002
by
Holt
et
al
and
updated
in
2007
by
Sharakhova
et
al.
77
Exercise
4
Bacteria
Go
to
bacteria.ensembl.org
and
start
to
type
the
name
Belliella
baltica
into
the
search
species
box.
It
will
autocomplete,
allowing
you
to
select
Belliella
baltica
DSM
15883,
(TaxID
866536)
from
the
drop-
down
list.
Click
on
More
information
and
statistics.
Belliella
baltica
has
3,680
coding
genes
and
53
non-coding.
Region
in
detail
Exercise
5
Exploring
a
genomic
region
in
human
(a)
Go
to
the
Ensembl
homepage
(http://www.ensembl.org/).
Select
Search:
Human
and
type
13:32448000-33198000
in
the
text
box
(or
alternatively
leave
the
Search
drop-down
list
like
it
is
and
type
human
13:32448000-33198000
in
the
text
box).
Click
Go.
This
genomic
region
is
located
on
cytogenetic
band
q13.1.
It
is
made
up
of
seven
contigs,
indicated
by
the
alternating
light
and
dark
blue
coloured
bars
in
the
Contigs
track.
(b)
Draw
with
your
mouse
a
box
encompassing
the
BRCA2
transcripts.
Click
on
Jump
to
region
in
the
pop-up
menu.
(c)
Click
Configure
this
page
in
the
side
menu
(or
on
the
cog
wheel
icon
in
the
top
left
hand
side
of
the
bottom
image).
Type
tilepath
in
the
Find
a
track
text
box.
Select
Tilepath.
Click
on
the
(i)
button
to
find
out
more
The
tilepath
track
shows
the
BAC
clones
that
the
assembly
was
based
upon.
Save
and
close
the
new
configuration
by
clicking
on
(or
anywhere
outside
the
pop-up
window).
There
is
not
just
one
clone
that
contains
the
complete
BRCA2
gene.
The
BAC
clone
RP11-37E23
contains
most
of
the
gene,
but
not
its
very
3
end
(contained
in
RP11-298P3).
This
was
reflected
on
the
two
contigs
that
make
up
the
entire
BRCA2
gene
(the
Contigs
track
is
on
by
default).
78
(d)
Click
Share
this
page
in
the
side
menu.
Select
the
link
and
copy.
Compose
an
email
to
yourself,
paste
the
link
in
and
send
the
message.
Open
the
email
and
click
on
your
link.
You
should
be
able
to
view
the
page
with
the
new
configuration
and
data
tracks
you
had
added
to
in
the
Location
tab.
(e)
Click
Export
data
in
the
side
menu.
Leave
the
default
parameters
as
they
are.
Click
Next>.
Click
on
Text.
Note
that
the
sequence
has
a
header
that
provides
information
about
the
genome
assembly
(GRCh37),
the
chromosome,
the
start
and
end
coordinates
and
the
strand.
For
example:
>13 dna:chromosome
chromosome:GRCh37:13:32883613:32978196:1
(f)
Click
Configure
this
page
in
the
side
menu.
Click
Reset
configuration.
Click
.
Exercise
6
Exploring
patches
and
haplotypes
in
human
(a)
Go
to
the
Ensembl
homepage
(http://www.ensembl.org/).
Select
Search:
Human
and
type
6:112294691-112624977
in
the
text
box
(or
alternatively
leave
the
Search
drop-down
list
like
it
is
and
type
human
6:112294691-112624977
in
the
text
box).
Click
Go.
You
will
see
a
green
highlighted
region
in
the
middle
of
this
region.
Click
on
the
thin
dark
green
bar
in
any
of
the
three
views
to
see
the
label
HG1304_PATCH.
To
learn
about
patches,
open
a
new
tab
in
your
internet
browser,
go
to
the
Ensembl
homepage
and
put
patch
into
the
search
box.
79
Choose
Help
&
Docs
from
the
left
hand
side.
There
are
glossary
terms
(Patch
and
Alternative
sequence)
and
an
FAQ
(What
haplotypes
and
assembly
patches
can
I
see
for
human?)
that
explain
patches.
(b)
Patches
are
marked
in
green
in
the
chromosome
view
at
the
top.
Click
on
the
leftmost
patch
to
confirm
that
it
is
definitely
HG27_patch.
Drag
a
box
around
it
(less
than
1Mb)
then
click
on
Jump
to
region.
Scroll
down
to
the
Region
in
detail
view
and
click
on
the
thin
dark
green
bar
at
the
top
of
the
patch.
A
drop-down
containing
the
coordinates
of
the
patch
will
appear.
6: 26585843-26859228
(c)
Another
option
in
this
drop-down
is
Compare
with
reference.
Click
on
this.
Scroll
down
the
page
to
see
the
comparison
between
the
patch
and
reference.
Aligned
sequences
are
highlighted
in
pink
and
linked
together
in
green.
The
sequences
in
this
region
have
been
rearranged.
(d)
Click
the
back
button
in
your
browser
to
return
to
the
Region
in
detail
page.
Using
your
mouse,
click
and
drag
within
the
1Mb
view
to
move
right.
The
red
highlighted
regions
are
all
labelled
HSCHR6_MHC
etc,
which
is
the
MHC
haplotypic
region.
Search
help
again
to
understand
what
haplotypes
are,
in
the
same
way
as
you
did
for
patches.
80
Answers
Genes
and
Transcripts
Exercise
7
-
Exploring
the
human
MYH9
gene
(a)
Go
to
the
Ensembl
homepage
(http://www.ensembl.org).
Select
Search:
Human
and
type
MYH9.
Click
Go,
then
Human
on
the
results
page.
Click
on
Gene.
Click
on
either
the
Ensembl
ID
ENSG00000100345
or
the
HGNC
official
gene
name
MYH9.
Chromosome
22
on
the
reverse
strand.
81
(UTR
sequence
is
shown
in
purple).
You
can
also
see
this
in
the
cDNA
view
if
you
click
on
the
cDNA
link
in
the
left
side
menu.
(d)
Click
on
Oligo
probes
in
the
side
menu.
Probesets
from
Affymetrix,
Agilent,
Codelink,
Illumina,
and
Phalanx
match
to
this
transcript
sequence.
Expression
analysis
with
any
of
these
probesets
would
reveal
information
about
the
transcript.
Hint:
this
information
can
sometimes
be
found
in
the
ArrayExpress
Atlas:
www.ebi.ac.uk/arrayexpress/
Exercise
8
Finding
a
gene
associated
with
a
phenotype
(a)
Start
at
the
Ensembl
homepage
(http://www.ensembl.org).
Type
phenylketonuria
into
the
search
box
then
click
Go.
Choose
Gene
from
the
left
hand
menu.
The
gene
associated
with
this
disorder
is
PAH,
phenylalanine
hydroxylase,
ENSG00000171759.
(b)
Click
on
the
gene
symbol
to
go
to
the
Gene
tab.
Click
on
Expression
in
the
left
hand
menu.
The
gene
is
expressed
in
all
tissues
listed.
This
is
unsurprising
for
a
metabolic
gene.
Hover
over
the
column
titles
to
view
definitions.
Intron
spanning
reads
are
RNASeq
reads
that
cover
exon
junctions.
RNASeq
alignments
are
RNASeq
reads
that
align
to
the
genome.
82
(c)
If
the
transcript
table
is
hidden,
click
on
Show
transcript
table
to
see
it.
There
are
four
protein
coding
transcripts.
Click
on
Transcript
comparison
in
the
left
hand
menu.
Click
on
Select
transcripts.
Either
select
all
the
transcripts
labelled
protein
coding
one-by-one,
or
click
on
the
drop
down
and
select
Protein
coding.
Close
the
menu.
(d)
Click
on
External
references.
The
MIM
disease
ID
is
261600.
Exercise
9
Exploring
a
plant
gene
(Vitis
vinifera,
grape)
(a)
Go
to
http://plants.ensembl.org/index.html
Select
Vitis
vinifera
from
the
drop
down
menu
All
genomes
select
a
species
or
click
on
View
full
list
of
all
Ensembl
Plants
species
and
then
choose
V.
vinifera.
Type
MADS4
and
click
on
the
gene
name
link
MADS4
[VIT_01s0010g03900
].
Click
on
GO:
biological
process
in
the
side
menu.
There
are
nine
terms
listed
including
GO:0006351,
transcription,
DNA-dependent,
and
GO:0006355,
regulation
of
transcription,
DNA-dependent.
(b)
Click
on
the
transcript
tab
named
Vv01s0010g03900.t01
(or
on
the
Transcript
tab).
Click
on
Exons
in
the
left
hand
menu.
There
are
eight
exons,
of
which
exon
8
is
longest
with
303
bp,
of
which
13
are
coding.
c)
Click
on
either
Protein
Summary
or
Domains
&
features
in
the
left
hand
menu
to
see
graphically
or
as
a
table
respectively.
A
TF_MADSbox
is
identified
by
six
domain
prediction
methods.
A
TF_Kbox
domain
is
identified
by
two.
Two
coiled-coils
are
identified
by
one.
83
Answers
BioMart
Exercise
10
Finding
genes
by
protein
domain
As
with
all
BioMart
queries
you
must
select
the
dataset,
set
your
filters
(input)
and
define
your
attributes
(desired
output).
For
this
exercise:
Dataset:
Ensembl
genes
in
mouse
Filters:
Transmembrane
proteins
on
chromosome
9
Attributes:
Ensembl
gene
and
transcript
IDs
and
Associated
gene
names
Go
to
the
Ensembl
homepage
(http://www.ensembl.org)
and
click
on
BioMart
at
the
top
of
the
page.
Select
Ensembl
genes
as
your
database
and
Mus
musculus
genes
as
the
dataset.
Click
on
Filters
on
the
left
of
the
screen
and
expand
REGION.
Change
the
chromosome
to
9.
Now
expand
PROTEIN
DOMAINS,
also
under
filters,
and
select
Transmembrane
domains
and
then
Only.
Clicking
on
Count
should
reveal
that
you
have
filtered
the
dataset
down
to
420
genes.
Click
on
Attributes
and
expand
GENE.
Select
Associated
gene
name.
Now
click
on
Results.
The
first
10
results
are
displayed
by
default;
display
all
results
by
selecting
ALL
from
the
drop
down
menu.
The
output
will
display
the
Ensembl
gene
ID,
Ensembl
Transcript
ID
and
Associated
gene
names
of
all
proteins
with
a
transmembrane
domain
on
mouse
chromosome
9.
If
you
prefer,
you
can
also
export
as
an
Excel
sheet
by
using
the
Export
all
results
to
XLS
option.
Exercise
11
Convert
IDs
Click
New.
Choose
the
ENSEMBL
Genes
73
database.
Choose
the
Homo
sapiens
genes
(GRCh37)
dataset.
Click
on
Filters
in
the
left
panel.
Expand
the
GENE
section
by
clicking
on
the
+
box.
Select
ID
list
limit
-
RefSeq
protein
ID(s)
and
enter
the
list
of
IDs
in
the
text
box
(either
comma
separated
or
as
a
list).
HINT:
You
may
have
to
scroll
down
the
menu
to
see
these.
84
Count
shows
11
genes
(remember
one
gene
may
have
multiple
splice
variants
coding
for
different
proteins,
that
is
the
reason
why
these
29
proteins
do
not
correspond
to
29
genes).
Click
on
Attributes
in
the
left
panel.
Select
the
Features
attributes
page.
Expand
the
External
section
by
clicking
on
the
+
box.
Select
HGNC
symbol
and
RefSeq
Protein
ID
from
the
External
References
section.
Click
the
Results
button
on
the
toolbar.
Select
View
All
rows
as
HTML
or
export
all
results
to
a
file.
Tick
the
box
Unique
results
only.
Exercise
12
Export
homologues
Click
New.
Choose
the
ENSEMBL
Genes
74
database.
Choose
the
Ciona
savignyi
genes
(CSAV2.0)
dataset.
Click
on
Filters
in
the
left
panel.
Expand
the
GENE
section
by
clicking
on
the
+
box.
Enter
the
gene
list
in
the
ID
List
Limit
box.
Click
on
Attributes
in
the
left
panel.
Select
the
Homologs
attributes
page.
Expand
the
Orthologs
section
by
clicking
on
the
+
box.
Select
Human
Ensembl
Gene
ID.
Click
Results
(remember
to
tick
the
unique
results
only
box).
Exercise
13
Export
structural
variants
(a)
Choose
Ensembl
Variation
74
and
Homo
sapiens
Structural
Variation.
Filters:
Region:
Chromosome
1,
Base
pair
start:
130408,
Base
pair
end:
210597
Count
shows
6
out
of
3,577,025
structural
variants.
Attributes:
Structural
Variation
(SV)
Information:
DGVa
Study
Accession
and
Source
Name
85
Structural
Variation
(SV)
Location:
Chromosome
name,
Sequence
region
start
(bp)
and
Sequence
region
end
(bp).
(b)
Choose
Ensembl
Variation
74
and
Homo
sapiens
Short
Variation
(SNPs
and
indels).
Filters:
Filter
by
Variation
ID
enter:
rs1801500,
rs1801368
Attributes:
Variation
Name,
Variant
Alleles,
Phenotype
description,
and
Associated
gene.
You
can
view
this
same
information
in
the
Ensembl
browser.
Click
on
one
of
the
variation
IDs
(names)
in
the
result
table.
The
variation
tab
should
open
in
the
Ensembl
browser.
Click
Phenotype
Data.
Exercise
14
Find
genes
associated
with
array
probes
(a)
Click
New.
Choose
the
ENSEMBL
Genes
74
database.
Choose
the
Homo
sapiens
genes
(GRCh37)
dataset.
Click
on
Filters
in
the
left
panel.
Expand
the
GENE
section
by
clicking
on
the
+
box.
Select
ID
list
limit
-
Affy
hg
u133
plus
2
probeset
ID(s)
and
enter
the
list
of
probeset
IDs
in
the
text
box
(either
comma
separated
or
as
a
list).
Count
shows
25
genes
match
this
list
of
probesets.
Click
on
Attributes
in
the
left
panel.
Select
the
Features
attributes
page.
Expand
the
GENE
section
by
clicking
on
the
+
box.
In
addition
to
the
default
selected
attributes,
select
Description.
Expand
the
External
section
by
clicking
on
the
+
box.
Select
HGNC
symbol
from
the
External
References
section
and
AFFY
HG
U133-PLUS-2
from
the
Microarray
Attributes
section.
Click
the
Results
button
on
the
toolbar.
Select
View
All
rows
as
HTML
or
export
all
results
to
a
file.
Tick
the
box
Unique
results
only.
Your
results
should
show
that
the
25
probes
map
to
25
Ensembl
genes.
86
(b)
Dont
change
Dataset
and
Filters-
simply
click
on
Attributes.
Select
the
Sequences
attributes
page.
Expand
the
SEQUENCES
section
by
clicking
on
the
+
box.
Select
Flank
(Transcript)
and
enter
2000
in
the
Upstream
flank
text
box.
Expand
the
Header
information
section
by
clicking
on
the
+
box.
Select,
in
addition
to
the
default
selected
attributes,
Description
and
Associated
Gene
Name.
Note:
Flank
(Transcript)
will
give
the
flanks
for
all
transcripts
of
a
gene
with
multiple
transcripts.
Flank
(Gene)
will
give
the
flanks
for
one
possible
transcript
in
a
gene
(the
most
5
coordinates
for
upstream
flanking).
Click
the
Results
button
on
the
toolbar.
(c)
You
can
leave
the
Dataset
and
Filters
the
same,
and
go
directly
to
the
Attributes
section:
Click
on
Attributes
in
the
left
panel.
Select
the
Homologs
attributes
page.
Expand
the
GENE
section
by
clicking
on
the
+
box.
Select
Associated
Gene
Name.
Deselect
Ensembl
Transcript
ID.
Expand
the
ORTHOLOGS
section
by
clicking
on
the
+
box.
Select
Mouse
Ensembl
Gene
ID,
Mouse
Chromosome
Name,
Mouse
Chr
Start
(bp)
and
Mouse
Chr
End
(bp).
Click
the
Results
button
on
the
toolbar.
Check
the
box
Unique
results
only.
Select
View
All
rows
as
HTML
or
export
all
results
to
a
file.
Your
results
should
show
that
for
most
of
the
human
genes
at
least
one
mouse
orthologue
has
been
identified.
87
Answers
Variation
Finding
variants
in
Ensembl
Exercise
15
Human
population
genetics
and
phenotype
data
(a)
Please
note
there
is
more
than
one
way
to
get
this
answer.
Either
go
to
the
Variation
Table
for
the
human
TAGAP
gene,
and
Show
variants
in
the
5UTR,
or
search
Ensembl
for
rs1738074
directly.
Once
youre
in
the
Variation
tab,
click
on
the
Genes
and
regulation
link
or
icon.
This
SNP
is
found
in
three
transcripts
(ENST00000326965,
ENST00000338313,
and
ENST00000367066).
(b)
Click
on
Population
genetics
at
the
left
of
the
variation
tab.
(Or,
click
on
Explore
this
variation
at
the
left
and
click
the
Population
genetics
icon.)
In
Yoruba
(CSHL-HAPMAP:HapMap-YRI
population),
the
least
frequent
genotype
is
CC
at
the
frequency
of
9.7%.
This
is
also
the
least
frequent
genotype
in
in
other
populations
(to
find
out
what
the
three
letter
population
are,
have
a
look
at
our
FAQ
(http://www.ensembl.org/Help/Faq?id=328)
(c)
Click
on
phylogenetic
context.
The
ancestral
allele
is
T
and
its
inferred
from
the
alignment
in
primates.
Select
the
37
eutherian
mammals
EPO
LOW
COVERAGE
alignment
and
click
on
Go.
A
region
containing
the
SNP
(highlighted
in
red
and
placed
in
the
centre)
and
its
flanking
sequence
are
displayed.
The
T
allele
is
conserved
in
all
but
three
of
the
37
eutherian
mammals
displayed.
Note
that
one
species
has
no
alignment
in
that
region
and
many
other
species
have
no
variation
database.
(d)
Click
Phenotype
Data
at
the
left
of
the
Variation
page.
This
variation
is
associated
with
diabetes,
multiple
sclerosis
and
coeliac.
There
are
known
risk
alleles
for
both
multiple
sclerosis
and
coeliac
and
the
corresponding
P
values
are
provided.
The
allele
A
is
associated
with
coeliac
disease.
Note
that
the
alleles
reported
by
Ensembl
are
T/C.
Ensembl
reports
88
alleles
on
the
forward
strand.
This
suggests
that
A
was
reported
on
the
reverse
strand
in
the
PubMed
article.
You
can
view
External
Data
sources
that
mirror
data
from
SNPedia
and
LOVD.
We
share
information
about
the
effects
of
variations
in
DNA,
citing
peer-reviewed
scientific
publications.
Click
on
SNPedia
and
LOVD
in
the
left
hand
menu
to
explore
further.
No
LOVD
data
was
found
for
this
variant
so
far.
Exercise
16
Exploring
a
SNP
in
human
(a)
Go
to
the
Ensembl
homepage
(http://www.ensembl.org/).
Type
rs1801133
in
the
Search
box,
then
click
Go.
Click
on
rs1801133.
(b)
Click
on
Genes
and
Regulation
in
the
side
menu
(or
the
Genes
and
Regulation
icon).
No,
rs1801133
is
Missense
variant
in
four
MTHFR
transcripts.
It's
a
downstream
gene
variant
of
ENST00000418034.
(c)
In
Ensembl,
the
alleles
of
rs1801133
are
given
as
G/A
because
these
are
the
alleles
in
the
forward
strand
of
the
genome.
In
the
literature
and
in
dbSNP,
the
alleles
are
given
as
C/T
because
the
MTHFR
gene
is
located
on
the
reverse
strand.
The
alleles
in
the
actual
gene
and
transcript
sequences
are
C/T.
(d)
Click
on
Population
genetics
in
the
side
menu.
In
all
populations
but
two
(from
the
1000
genomes
and
HapMap
projects),
the
allele
G
is
the
major
one.
The
two
exceptions
are:
CLM
(Colombian
in
Medelin;
1000
Genomes),
HCB
(Han
Chinese
in
Beijing,
China;
HapMap).
(e)
Click
on
Phenotype
Data
in
the
left
hand
side
menu.
The
specific
study
where
the
association
was
originally
described
is
given
in
the
Phenotype
Data
table.
Click
on
pubmed/20031578
for
more
details.
The
association
between
rs1801133
and
homocysteine
levels
is
described
in
the
paper
Novel
associations
of
CPS1,
MUT,
NOX4
and
DPEP1
with
plasma
homocysteine
in
a
healthy
population:
89
a
genome-wide
evaluation
of
13,974
participants
in
the
Womens
Genome
Health
Study
(Pare
et
al,
Cir
Cardiovasc
Genet.
2009
Apr;2(2):142-50).
(f)
Click
on
Phylogenetic
Context
in
the
side
menu.
Select
Alignment:
6
primates
EPO
and
click
Go.
Gorilla,
orangutan,
chimp,
macaque
and
marmoset
all
have
a
G
in
this
position.
Please
note
that
there
is
no
variation
database
for
gorilla
and
marmoset
though.
(g)
Go
to
http://neandertal.ensemblgenomes.org/
and
type
rs1801133
in
the
Search
Neandertal
text
box.
Click
Go.
Click
on
rs1801133
on
the
results
page.
Click
on
Jump
to
region
in
detail.
Click
on
Configure
this
page
in
the
side
menu.
Click
on
Variation
features.
Select
All
variations
Normal.
SAVE
and
close.
Draw
a
box
of
about
50
bp
around
rs1801133
(shown
in
yellow
in
the
centre
of
the
display).
Click
on
Jump
to
region
on
the
pop-up
menu.
The
Sequences
track
shows
that
there
are
four
reads
for
Neanderthal
at
the
position
of
rs1801133,
all
with
a
G,
so
based
on
these
(very
limited)
data
there
is
no
evidence
that
both
alleles
were
already
present
in
Neanderthal.
Exercise
17
Structural
variation
in
human
(a)
Go
to
the
Ensembl
homepage
(http://www.ensembl.org/).
Select
Search:
Human
and
type
ccl3l1
in
the
search
box.
Click
Go.
Click
on
CCL3L1
(Human
Gene)
at
the
top.
(b)
Click
on
Structural
Variation
in
the
side
menu.
Yes,
CNVs
have
been
annotated
for
this
gene
by
multiple
studies,
as
indicated
by
the
many
bars
in
the
larger
and
smaller
structural
variants
tracks
in
the
display.
Details
are
given
in
the
table
below
the
display.
90
Note:
Can
you
do
this
with
BioMart?
Exercise
18
Exploring
a
SNP
in
mouse
(a)
Go
to
www.ensembl.org,
type
rs29522348
in
the
search
box.
Click
on
rs29522348
(Mouse
Variation).
SNP
rs29522348
is
located
on
17:73924993.
In
Ensembl,
its
alleles
are
provided
as
in
the
forward
strand.
(b)
Click
on
HGVS
names
to
reveal
information
about
HGVS
nomenclature.
This
SNP
has
got
three
HGVS
names,
one
at
the
genomic
DNA
level
(17:g.73924993C>T),
one
at
the
transcript
level
(c.721G>A)
and
one
at
the
protein
level
(p.Val241Ile).
(c)
In
Ensembl,
the
allele
that
is
present
in
the
reference
genome
assembly
is
always
put
first
(C
is
the
allele
for
the
reference
mouse
genome,
strain
C57BL/6J).
(d)
Click
on
Individual
genotypes
is
the
left
hand
side
menu.
In
the
summary
of
genotypes
by
population,
click
on
Show
for
PERLEGEN:MM_PANEL2,
or
search
for
the
two
strain
names.
There
are
indeed
differences
between
the
genotypes
reported
in
those
two
different
strains.
The
genotype
reported
in
NOD/LTJ
is
TT
whereas
in
BALB/cByJ
the
genotype
is
CC.
VEP
Exercise
19
VEP
(Variant
Effect
Predictor
tool)
(a)
Go
to
www.ensembl.org
and
click
on
the
link
tools
at
the
top
of
the
page.
Currently
there
are
5
tools
listed
in
that
page.
Click
on
Variant
Effect
Predictor
and
enter
the
three
variants
as
below:
7
117171039
117171039
G/A
7
117171092
117171092
T/C
7
117171122
117171122
T/C
Note:
Variation
data
input
can
be
done
in
a
variety
of
formats.
See
more
details
here
91
http://www.ensembl.org/info/docs/variation/vep/vep_formats.htm
l
Under
the
non-synonymous
SNP
predictions
option,
select
prediction
only
for
SIFT
and
PolyPhen,
then
click
Next.
The
output
format
is
either
in
HTML
or
text.
You
will
get
a
table
with
the
consequence
terms
from
the
Sequence
Ontology
project
(http://www.sequenceontology.org/)
(i.e.
synonymous,
missense,
downstream,
intronic,
5
UTR,
3
UTR,
etc)
provided
by
VEP
for
the
listed
SNPs.
You
can
also
upload
the
VEP
results
as
a
track
and
view
them
on
Location
pages
in
Ensembl.
SIFT
and
PolyPhen
are
available
for
missense
SNPs
only.
For
two
of
the
entered
positions,
the
variations
have
been
predicted
to
be
probably
damaging/deleterious
(coordinate
117171092)
and
benign/tolerated
(coordinate
117171122).
All
the
three
variations
have
been
already
described
and
are
known
as
in
rs1800078,
rs1800077
and
rs35516286
in
dbSNP
and
other
sources
(databases,
literature,
etc).
(b)
In
order
to
see
your
uploaded
SNPs
as
a
track
in
Region
in
detail,
you
will
need
to
choose
a
name
for
this
upload
(e.g.
VEP)
when
entering
the
data
into
the
VEP
tool.
So
you
may
need
to
enter
the
data
again.
Once
you
have
done
that
and
given
a
name
to
the
upload,
click
on
any
link
under
the
location
column
(in
the
VEP
results
table)
to
see
your
newly
added
VEP
track
with
the
three
variations
in
the
Location
tab
(or
Region
in
detail
view)
in
Ensembl.
92
Answers
Comparative
Genomics
Gene
trees
and
homologues
Exercise
20
Orthologues,
paralogues
and
gene
trees
for
the
human
BRAF
gene.
(a)
Go
to
www.ensembl.org,
choose
human
and
search
for
BRAF.
Click
through
to
the
Gene
tab
view.
On
the
gene
tab,
click
on
Orthologues
at
the
left
side
of
the
page
to
see
all
the
63
orthologous
genes.
There
are
orthologues
in
8
primates.
The
percentage
of
identical
amino
acids
in
the
Tarsier
protein
(the
orthologue)
compared
with
the
gene
of
interest.
i.e.
human
BRAF
(the
target
species/gene)
is
69%.
This
is
known
as
the
Target
%ID.
The
identity
of
the
gene
of
interest
(human
BRAF)
when
compared
with
the
orthologue
(Tarsier
BRAF,
the
query
species/gene)
is
62%
(the
query
%ID).
Note
the
difference
in
the
values
of
the
Target
and
Query
%
ID
reflects
the
different
protein
lengths
for
the
human
and
tarsier
BRAF
genes.
(b)
There
is
more
than
one
way
to
get
to
the
answer.
Option
1:
Go
to
the
orthologues
page
and
click
on
the
marmoset
orthologue
to
open
the
gene
tab.
Click
Genomic
alignments
at
the
left.
Then
select
Alignment:
Human
(Homo
sapiens)
lastz
and
click
Go.
The
red
sequence
is
present
in
exons,
so
there
is
a
gene
in
both
species
in
this
region.
You
can
find
where
the
start
and
stop
codons
are
located
if
you
configure
this
page
and
select
START/STOP
codons.
Option
2:
Go
to
location
tab
of
the
marmoset
BRAF
gene
and
then
click
on
Region
Comparison
view
at
the
left.
Click
on
Select
species
or
regions
at
the
left
and
click
on
the
+
to
select
Human
(Homo
sapiens)
lastz
then
save
and
close.
You
should
see
an
alignment
between
the
human
BRAF
gene
region
and
the
BRAF
gene
region
for
the
marmoset.
93
(Note:
To
see
a
blue
line
connecting
homologous
genes
in
the
Region
Comparison
view
page,
click
on
configure
this
page
and
under
Comparative
features
select
join
genes.
Zoom
out
on
the
location
view
to
see
blue
lines
connecting
all
the
homologous
genes
between
marmoset
and
human
genes
in
that
region).
Whole
genome
alignments
Exercise
21
Zebrafish
orthologues
(a)
Start
in
the
Location
tab
(region
in
detail)
for
dbh
(ENSDARG00000069446).
Click
on
Alignments
(Image)
at
the
left,
and
select
the
5
teleost
fish
EPO
alignment
in
the
pull-down
menu
in
the
view.
The
zebrafish,
stickleback,
medaka,
fugu,
and
tetradon
are
shown
in
this
region.
All
the
species
show
a
gene
in
the
aligned
region.
This
can
also
be
seen
in
the
Alignments
(text)
page
(the
exons
are
highlighted
in
red).
(b)
You
can
export
the
alignments
from
either
Alignments
(images)
or
Alignments
(text)
menus
in
the
Location
tab.
Click
on
the
blue
Export
data
button
at
the
left,
and
choose
Clustal
from
the
list.
(c)
Click
on
Region
in
detail
in
the
left
hand
menu.
Turn
on
the
multiple
alignment
and,
constrained
elements
and
conservation
score
for
5
teleost
fish
EPO
tracks,
all
under
the
Comparative
genomics
menu
by
configuring
the
page.
The
5
teleost
fish
EPO
track
just
shows
that
the
whole
region
for
the
dbh
gene
can
be
aligned
among
those
five
species
of
fish.
The
Constrained
elements
and
Conservation
score
tracks
show
the
conserved
sequence
is
located
where
in
the
alignment.
Higher
conservation
regions
match
up
with
exonic
regions
(exons
tend
to
be
highly
conserved)
of
the
gene.
Note
that
there
are
intronic
regions
that
seem
to
be
fairly
conserved
across
the
species
available.
Click
on
the
Track
name
and
the
(information
button)
to
read
more
about
constrained
elements
(or
any
other
data
track).
94
Exercise
22
Synteny
(a)
Change
the
species
to
dog
next
to
the
image.
Yes,
there
are
multiple
syntenic
regions
in
dog
to
human
chromosome
3,
which
is
in
the
centre
of
this
view.
Dog
chromosomes
6,
20,
23,
31,
33,
and
34
have
syntenic
regions
to
human
chromosome
3.
(b)
Scroll
down
to
the
bottom
of
the
page.
There
is
a
homologue
in
dog
of
human
RHO.
Click
Centre
on
gene
RHO
to
compare
the
genes
between
human
and
dog
in
this
syntenic
block.
Exercise
23
Whole
genome
alignments
(a)
Go
to
the
Ensembl
homepage
(http://www.ensembl.org/).
Select
Search:
Human
and
type
brca2
in
the
search
box.
Click
Go.
Click
on
13:32889611-32973805:1
below
BRCA2
(Human
Gene).
You
may
want
to
turn
off
all
tracks
that
you
added
to
the
display
in
the
previous
exercises
as
follows:
Click
Configure
this
page
in
the
side
menu.
Click
Reset
configuration.
SAVE
and
close.
(b)
Click
Configure
this
page
in
the
side
menu
Click
on
BLASTZ/LASTz
alignments
under
the
Comparative
genomics
menu.
Select
Chicken
(Gallus
gallus)
-
BLASTZ_NET
Normal,
Chimpanzee
(Pan
troglodytes)
BLASTZ_NET
Normal,
Mouse
(Mus
musculus)
BLASTZ_NET
-
Normal
and
Platypus
(Ornithorhynchus
anatinus)
-
BLASTZ_NET
-
Normal.
Click
on
Translated
blat
alignments.
Select
Anole
Lizard
(Anolis
carolinensis)
-
TRANSLATED_BLAT_NET
-
Normal
and
Zebrafish
(Danio
rerio)
-
TRANSLATED_BLAT_NET
Normal.
SAVE
and
close.
Yes,
the
degree
of
conservation
does
reflect
the
evolutionary
relationship
between
human
and
the
other
species;
the
highest
degree
of
conservation
is
found
in
chimp,
followed
by
mouse,
platypus,
chicken,
lizard
and
zebrafish,
respectively.
Especially
the
exonic
sequences
of
BRCA2
seem
to
be
highly
conserved
95
between
the
various
species,
which
is
what
is
to
be
expected
because
these
are
supposed
to
be
under
higher
selection
pressure
than
intronic
and
intergenic
sequences.
(c)
Click
Configure
this
page
in
the
side
menu.
Click
on
Conservation
regions
under
the
Comparative
genomics
menu.
Select
Conservation
score
for
37
eutherian
mammals
EPO_LOW_COVERAGE,
Conservation
score
for
21
amniota
vertebrates
Pecan
and
Constrained
elements
for
21
amniota
vertebrates
Pecan.
SAVE
and
close.
Both
the
Conservation
score
and
Constrained
elements
tracks
largely
correspond
with
the
data
seen
in
the
pairwise
alignment
tracks;
all
exons
of
the
BRCA2
gene
show
a
high
degree
of
conservation
(Note
the
UTRs
which
are
not
conserved).
(d)
Click
on
a
constrained
element
(brown
block).
Click
on
View
alignments
(text)
in
the
pop-up
menu.
Click
Configure
this
page
in
the
side
menu.
Select
Conservation
regions:
All
conserved
regions.
SAVE
and
close.
The
conserved
regions
will
be
shown
in
light
blue.
(e)
Click
on
the
Gene:
BRCA2
tab.
Click
on
Genomic
alignments
under
Comparative
Genomics
in
the
side
menu.
Select
Alignment:
6
primates
EPO.
Click
Go.
Click
Configure
this
page
in
the
side
menu.
Select
Conservation
regions:
All
conserved
regions.
SAVE
and
close.
The
conserved
regions
will
be
shown
in
light
blue.
96
Answers
Regulation
Exercise
24
Gene
regulation:
Human
STX7
(a)
Search
for
human
gene
STX7
from
the
home
page.
Click
on
Location
in
the
search
results.
Regulatory
features
from
the
Ensembl
regulatory
build
are
based
on
indicators
of
open
chromatin
such
as
CTCF
binding
sites,
DNase
I
hypersensitive
sites,
and
Transcription
Factor
binding
sites.
The
Regulatory
features
are
turned
on
by
default
in
the
Region
in
detail
view.
There
are
many
regulatory
features
mapping
to
the
STX7
transcripts,
including
the
5
end.
Click
on
the
Reg.
Feats
track
name
to
jump
to
an
article
explaining
the
underlying
data.
Click
and
drag
the
Reg.
Feats
track
next
to
the
Genes
(Merged
Ensembl/Havana)
track
to
better
compare
where
the
Regulatory
features
(grey
boxes)
are
in
the
gene.
(b)
See
the
legend
below
the
Region
in
detail
view
to
find
the
predicted
enhancer
segments
are
coloured
in
yellow.
Two
appear
in
the
HUVEC
cell
type
only
(out
of
the
three
cells
chosen).
(c)
Configure
this
page
and
click
on
Open
chromatin
&TFBS.
Turn
on
both
peaks
and
signal
for
DNase
1
and
FAIRE
in
HeLa-S3
cells
(the
boxes
in
this
configure
this
page
window
will
turn
blue.
For
more
information
on
how
to
select
and
view
the
supporting
data,
click
on
Show
tutorial
in
the
pop
up
window).
Close
the
menu.
There
are
two
DNase
1
hypersensitive
sites
in
the
5
exon
of
STX7.
Click
on
the
coloured
block
to
find
out
that
the
DNase1
enriched
sites
in
HeLa-S3
cells
come
from
the
ENCODE
project.
There
is
no
FAIRE
site
known
in
this
region.
(d)
Configure
this
page
and
click
on
Histones
&
polymerases.
Change
the
Filter
by
menu
from
All
classes
to
Histone.
Select
the
all
the
histone
modifications
available
for
HeLa
cells
(some
of
them
might
be
on
by
default).
Save
and
close
the
menu.
97
H3K4me3,
H3K9ac
and
H3K27ac
sites
have
been
found
in
the
5
region
of
STX7
in
HeLa-S3
cells.
(e)
Click
on
configure
this
page
and
choose
the
DNA
Methylation
menu.
Scroll
down
to
Enable/disable
all
External
data
then
turn
on
the
first
track
in
the
list
(MeDIP-chip
B-cells).
Save
and
close
the
menu.
The
CpG
sites
at
the
5
end
of
STX7
are
not
highly
methylated
(note
the
yellow/green
bars).
Yellow,
green,
and
blue
bars
represent
unmethylated,
intermediately
methylated,
and
methylated
regions,
respectively.
For
more
information
on
human
DNA
methylation
DAS
tracks,
see:
www.ensembl.org/info/docs/funcgen/index.html
(f)
Click
Share
this
page
in
the
side
menu.
Select
the
link
and
copy.
Go
into
your
email
account
and
compose
an
email
to
yourself.
Paste
the
link
in,
then
send.
Open
the
email
and
click
on
your
link.
Exercise
25
Regulatory
features
in
human
(a)
Go
to
the
Ensembl
homepage
(http://www.ensembl.org/).
Select
Search:
Human
and
type
6:32540000-32620000
in
the
search
box.
Click
Go.
You
may
want
to
turn
off
all
tracks
that
you
added
to
the
display
in
the
previous
exercises
as
follows:
Click
Configure
this
page
in
the
side
menu.
Click
Reset
configuration.
SAVE
and
close.
(b)
You
can
click
on
all
the
regulatory
features
shown
in
the
Reg.
Feats
track
that
are
located
in
the
intergenic
region
of
those
genes.
The
resulting
pop-up
window
for
each
of
those
will
show
the
core
attributes
underlying
the
regulatory
features.
Yes,
there
is
one
regulatory
feature
around
coordinates
32589947-32591273
that
has
CTCF
binding
data
as
part
of
its
core
evidence.
Its
ID
is
ENSR00000488025.
98
(c)
Click
Configure
this
page
in
the
side
menu.
Click
on
Regulation
Open
chromatin
&
TFBS.
Select
MultiCell
-
Track
style:
Peaks.
SAVE
and
close.
CTCF
binding
has
been
detected
at
this
position
in
eleven
of
the
cell/tissue
types
analysed.
(CD4,
GM06990,
GM12878,
H1ESC,
HMEC,
HSMM,
HUVEC,
HeLa-S3,
HepG2,
NH-A,
NHEK)
(d)
Click
Configure
this
page
in
the
side
menu.
Click
on
Regulation
Histones
&
polymerases.
According
to
the
Histones
&
Polymerases
configuration
matrix
the
most
information
on
histone
acetylation
is
available
for
CD4
cells.
Hover
over
CD4
in
the
Histones
&
Polymerases
configuration
matrix.
Select
Select
features
for
CD4
-
All.
SAVE
and
close.
Yes,
the
region
that
shows
CTCF
binding
is
also
a
region
of
high
acetylation
of
histone
2A,
2B,
3
and
4
in
CD4
cells.
99
Answers
Advanced
exercise
Methylation
data
in
human
(a)
Go
to
the
Ensembl
homepage
(http://www.ensembl.org/).
Select
Search:
Human
and
type
PDHA2
in
the
for
text
box.
Click
Go.
Click
on
4:96761239-96762625:1.
Zoom
out
one
step,
so
that
the
5kb
region
around
the
PDHA2
gene
is
shown.
You
may
want
to
turn
off
all
tracks
that
you
added
to
the
display
in
the
previous
exercises
as
follows:
Click
Configure
this
page
in
the
side
menu.
Click
Reset
configuration.
SAVE
and
close.
(b)
Click
Configure
this
page
in
the
side
menu.
Type
cpg
in
the
Find
a
track
box.
Select
CpG
islands.
SAVE
and
close.
No
CpG
islands
are
shown.
As
for
the
inclusion
of
CpG
islands
into
the
Ensembl
database
for
human
a
minimum
length
of
400
bp
is
required,
the
reason
for
this
could
be
that
the
CpG
islands
in
the
PDHA2
gene
are
shorter
than
400
bp.
However,
there
is
a
%GC
track,
which
shows
that
the
region
that
comprises
the
5
part
of
the
PDHA2
gene
and
the
region
directly
upstream
of
the
gene
has
a
high
%GC
(the
red
line
in
the
%GC
track
indicates
50%
GC).
It
is
difficult
/
impossible
to
distinguish
individual
CpG
islands
in
this
track,
though.
(c)
Click
Export
data
in
the
side
menu.
Click
Next>.
Click
on
Text.
Select
and
copy
the
sequence.
Go
to
http://www.ebi.ac.uk/Tools/emboss/cpgplot/index.html.
Paste
the
sequence
into
the
text
box.
Click
Run.
CpGPlot
does
confirm
the
existence
of
two
CpG
islands
in
the
PDHA2
gene
region
of
lengths
200
and
263
bp,
respectively.
So,
100
it
is
indeed
because
of
their
length
being
less
than
400
bp
that
these
CpG
islands
are
not
present
in
the
Ensembl
database.
(d)
Click
Add
your
data
in
the
side
menu
(Note
that
if
you
have
previously
uploaded
data
to
Ensembl,
this
box
will
say
Manage
your
data
instead).
Click
on
Upload
Data.
Type
CpG
islands
in
the
Name
for
this
upload
(optional)
box.
Select
Data
format:
BED.
Copy
the
following
into
the
Paste
file
box:
chr4
96761176
96761375
cpg_island_1
chr4
96761500
96761762
cpg_island_2
Click
Upload.
Click
on
Go
to
nearest
region
with
data:
4:96701276-96811276.
The
two
CpG
islands
should
now
be
shown
on
the
Region
in
detail
page.
They
should
coincide
with
the
regions
of
high
%GC.
Zoom
in
on
the
two
CpG
islands.
To
display
the
names
of
the
CpG
islands:
Hover
over
the
CpG
islands
track
name.
Hover
over
the
icon
of
the
cog-wheel.
Select
Labels.
(e)
Drag
your
CpG
islands
track
so
that
it
is
next
to
the
%GC
track.
Click
Share
this
page
in
the
side
menu.
Select
the
link
and
copy.
Paste
into
your
internet
browser
to
view.
(f)
Click
Configure
this
page
in
the
side
menu.
Click
on
Regulation
DNA
Methylation.
Select
all
MeDIP
tracks
in
Normal
mode.
SAVE
and
close.
Yellow,
green
and
blue
represent
unmethylated,
intermediately
methylated
and
methylated
regions,
respectively
(see
the
Methylation
Legend
at
the
bottom
of
the
page).
It
can
be
seen
that
the
region
around
the
5
part
of
the
PDHA2
gene
is
methylated
in
all
assayed
tissues
and
cell
lines,
except
in
sperm.
101
The
MeDIP-seq
track
for
sperm
shows
that
the
unmethylated
regions
coincide
with
the
CpG
islands
found
by
CpGPlot.
(g)
Click
on
Configure
this
page,
then
select
RNASeq
models.
Turn
on
the
BAM
files
for
all
the
tissues
in
Coverage
only.
You
will
see
histograms
of
RNASeq
coverage
for
each
of
the
tissues.
All
of
these
histograms
appear
to
be
the
same
height,
but
the
numbers
at
the
left
indicate
the
peak.
The
largest
number
is
for
the
merged
read,
10,048.
For
the
tissue-specific
read,
Testes
have
a
peak
of
1850,
higher
than
all
the
other
tissues.
There
are
also
more
wider
peaks
in
the
Testes
track.
The
unmethyated
CpG
islands
in
sperm
suggest
that
this
gene
is
negatively
regulated
by
CpG
island
methylation.
(h)
Click
on
Configure
this
page,
then
select
Comparative
genomics.
Turn
on
the
tracks
for
the
Constrained
elements
for
37
eutherian
mammals
and
Conservation
score
for
37
eutherian
mammals.
The
region
of
the
gene
itself
has
high
GERP
scores,
indicated
by
constrained
elements
over
most
of
the
gene.
There
is
no
apparent
difference
in
the
conservation
score
between
the
CpG
islands
and
their
flanking
regions.
(i)
Click
on
the
Transcript
Tab,
Transcript:
PDAH2-001
and
select
Ontology
table.
There
are
ten
terms
in
the
table,
the
first
being
GO:0006090,
pyruvate
metabolic
process.
To
export
the
list
use
BioMart.
Click
on
BioMart
in
the
top
bar.
Choose
Ensembl
Genes
73
and
Homo
sapiens
genes
(GRCh37).
Click
on
Filters.
Open
the
menu
for
GENE
ONTOLOGY.
Select
GO
Term
Accession
and
put
GO:0006090
into
the
box.
Click
on
Attributes.
Choose
Sequences.
Expand
SEQUENCES
and
select
Unspliced
(Gene).
Expand
Header
information
and
deselect
Ensembl
Transcript
ID.
Click
Results.
You
can
export
these
results
if
you
wish.
102
(j)
Go
to
the
REST
API
documentation
page
at
http://beta.rest.ensembl.org/documentation.
Click
on
GET
sequence/id/:id
to
get
the
documentation
for
this
command.
You
will
need
the
stable
ID
of
PDHA2,
go
to
the
browser
page
to
find
that
it
is
ENSG00000163114.
Use
the
documentation
to
construct
a
URL
in
the
correct
form,
ie:
http://beta.rest.ensembl.org/sequence/id/:id?format=fasta
Add
the
ID
to
the
URL
to
create:
http://beta.rest.ensembl.org/sequence/id/ENSG00000163114?form
at=fasta
This
URL
will
give
you
the
sequence.
103
Quick
Guide
to
Databases
and
Projects
Here
is
a
list
of
databases
and
projects
you
will
come
across
in
these
exercises.
Google
any
of
these
to
learn
more.
Projects
include
many
species,
unless
otherwise
noted.
Other
help:
The
Ensembl
Glossary:
http://www.ensembl.org/Help/Glossary
Ensembl
FAQs:
http://www.ensembl.org/Help/Faq
SEQUENCES
EMBL-Bank,
NCBI
GenBank,
DDBJ
Contain
nucleic
acid
sequences
deposited
by
submitters
such
as
wet-lab
biologists
and
gene
sequencing
projects.
These
three
databases
are
synchronised
with
each
other
every
day,
so
the
same
sequences
should
be
found
in
each.
CCDS
coding
sequences
that
are
agreed
upon
by
Ensembl,
VEGA-
Havana,
UCSC,
and
NCBI.
(human
and
mouse).
NCBI
Entrez
Gene
NCBIs
gene
collection
`
NCBI
RefSeq
NCBIs
collection
of
reference
sequences,
includes
genomic
DNA,
transcripts
and
proteins.
NM
stands
for
Known
mRNA
(eg
NM_005476)
and
NP
(eg
NP_005467)
are
Known
proteins.
UniProtKB
the
Protein
knowledgebase,
a
comprehensive
set
of
protein
sequences.
Divided
into
two
parts:
Swiss-Prot
and
TrEMBL
UniProt
Swiss-Prot
the
manually
annotated,
reviewed
protein
sequences
in
the
UniProtKB.
High
quality.
UniProt
TrEMBL
the
automatically
annotated,
unreviewed
set
of
proteins
(EMBL-Bank
translated).
Varying
quality.
VEGA
Vertebrate
Genome
Annotation,
a
selection
of
manually-
curated
genes,
transcripts,
and
proteins.
(human,
mouse,
zebrafish,
gorilla,
wallaby,
pig,
and
dog).
VEGA-HAVANA
The
main
contributor
to
the
VEGA
project,
located
at
the
Wellcome
Trust
Sanger
Institute,
Hinxton,
UK.
104
GENE
NAMES
HGNC
HUGO
Gene
Nomenclature
Committee,
a
project
assigning
a
unique
and
meaningful
name
and
symbol
to
every
human
gene.
(Human).
ZFIN
The
Zebrafish
Model
Organism
Database.
Gene
names
are
only
one
part
of
this
project.
(Z-fish).
PROTEIN
SIGNATURES
InterPro
A
collection
of
domains,
motifs,
and
other
protein
signatures.
Protein
signature
records
are
extensive,
and
combine
information
from
individual
projects
such
as
UniProt,
along
with
other
databases
such
as
SMART,
PFAM
and
PROSITE
(explained
below).
PFAM
A
collection
of
protein
families
PROSITE
A
collection
of
protein
domains,
families,
and
functional
sites.
SMART
A
collection
of
evolutionarily
conserved
protein
domains.
OTHER
PROJECTS
NCBI
dbSNP
A
collection
of
sequence
polymorphisms;
mainly
single
nucleotide
polymorphisms,
along
with
insertion-deletions.
NCBI
OMIM
Online
Mendelian
Inheritance
in
Man
a
resource
showing
phenotypes
and
diseases
related
to
genes
(human).
105