Beruflich Dokumente
Kultur Dokumente
Paving roads through data mountains, consortia are developing workflows and tools for widespread use.
Cancer geneticist Matthew Meyerson, who Those data, along with other sequenc- sequence files, the TCGA centers have cre-
is at the Dana-Farber Cancer Institute and ing results such as exome or mRNA ated tiers. Higher-level datafor example,
the Broad Institute of MIT and Harvard, sequence data, are held at the Cancer the list of somatic mutations in exome data
tracks the many ways tumors wreak chaos Genomics Hub at the University of or copy-number changes along the genome,
2013 Nature America, Inc. All rights reserved.
in orderly cells. He wants to squeeze into California, Santa Cruz (UCSC), with or expression levels of different genesall
his schedule a dedicated time period in controlled access for data that could allow of these are public data, Getz says. Those
Gad Getzs lab at the Broad Institute to individuals to be identified. Nonsequence are much smaller in size than raw sequence
hone his computational skills for analyz- data are kept at the TCGA data portal. files, he says, a difference that can help sci-
ing data about cancer genomes. Recently, a researcher in Getzs group at entists shopping for more manageable files.
Such collaborations could become the Broad downloaded genome sequences Since its launch in 2008, the ICGC has
more common as scientists dive into data of patients tumor and normal tissue from amassed around 250 terabytes of data from
sets generated by large consortia includ- the Cancer Genomics Hub. We down- approximately 1,300 donors, in Lincoln
ing The Cancer Genome Atlas (TCGA) loaded something like 20 whole genomes, Steins rough estimation. He directs bio-
Research Network and the International tumor-normal pairs, in 3 days, Getz says. informatics and computational biology at
Cancer Genome Consortium (ICGC). Thats quite fast. the Ontario Institute for Cancer Research
To shape an experiment, Getz suggests To create data packets that are easier to (OICR), which is also the ICGCs data coor-
that scientists first look at existing data. handle for analysis than the gigantic raw dination center. ICGC scientists in Asia,
However, this shift in habits is not an easy
sell, and doubts about tools and compu-
tational approaches abound. To make
choosing among the options easier for the
npg
izing over 24,000 tumor genomes from 50 run, the more restricted environment of that are still in flux, Getz says. But were
tumor types, comparing tumor and normal Galaxys virtual machine offers a predict- experimenting with building our compute
tissue2. able version of the operating system with pipeline on the cloud.
ICGC data are deposited in the European preinstalled libraries, says Stein. People The clouds best feature, he says, is elas-
Genome Phenome Archive. Somatic vari- can star a tool that they like and dislike, ticity. Researchers pay for the amount of
ant data are openly accessible at the ICGC so if it doesnt work, it will get low ratings, compute time used, not for maintaining
Data Portal but scientists must apply to and we would probably not even bother their own hardware. You could say, OK,
access data such as raw sequence, germline with it in our benchmarking, he says. now I need 1,000 computers, he says.
mutations or clinical data. One popular tool in use at OICR is the And then the next day you only need two.
In the past, each ICGC country housed Broads sequence-variant caller, GATK. This solution works for both big genome
its own data, but that strategy is changing. It teases out the alterations between a centers and small labs, which can use
The federated model, as weve discovered, persons tumor and normal tissue as well clouds run by Amazon, Google, Microsoft,
has an Achilles heel, Stein says. Network as variations from the human reference IBM or other providers.
connectivity issues have on occasion made genome, says Quang Trinh, a computa- Once the analysis is done, the virtual
data inaccessible. All the interpreted data tional biologist at OICR. Careful testing computers are released, and the data have
are now being copied into a centralized precedes the addition of any tool to the to travel, which costs time and money.
database administered by OICR. This OICR production pipeline, an approach he To address data transfer issues, Getz and
npg
transfer will be completed this autumn. hopes others can follow, too. Each time his colleagues are exploring ways to keep
The year-long project is worth the effort you pick a tool, you have to run through data in the cloud. He says more help will
because the new system scales well, Stein come from increased access to the high-
says. The database uses the distributed speed academic Internet backbone called
MongoDB architecture, which also offers Internet2, which includes 100-gigabit-
high data availability, he says. per-second connections and is being set
up by a consortium of universities, gov-
Which tools do I use? ernment agencies and companies.
A full toolbox is evidence of a vibrant
developer community. Cancer genome Arent pipelines for oil?
analysis tools number easily in the hun- Genomics analysis pipelines cannot get
dreds, says Stein, and every conference oil from point A to point B, but they can
poster session brings more. Its daunting transform data from A to Z. Every 2 weeks,
for experts in the field as well. the Broads Genome Data Analysis Center
Its good to have many tools, but there (GDAC; http://gdac.broadinstitute.org/),
is no systematic comparison of these with team members from the Broad, MD
tools, says Getz. Stein and his team find Anderson Cancer Center and Harvard
that many published tools have issues Medical School, swoops up all the gen-
beyond a lack of documentation. They erated TCGA data, normalizes them and
OICR
dont install; they crash; they dont pass makes them available.
their own internal tests, Stein says. Cancer genome analysis tools number easily in In a separate automated analysis pipeline
Although many tool builders test their the hundreds, says Lincoln Stein. series, these data sets are run through many
Virtual data factory Stein. The team plans to tally their findings
into a series of best practices, which stand
2013_01_16 analyses Run
AnalysisReport
BLCA
BRCA
# Pipelines
49
66
% Sucessful
100%
100%
Download
Open Protected
Open Protected
to help researchers use pipelines.
CESC 46 100% Open Protected
COADREAD
COAD
Analysis summary
66
66
100%
100%
Open Protected
Open Protected
DLBC 2013_01_16
8 100% Open Protected
GBM
HNSC
KICH
68
49
23
100%
100%
100%
Open Protected
Open Protected
Open Protected
Can I buy the data analysis?
TCGA KIRC
KIRP
66
63
100%
100%
Open Protected
Open Protected Expr
Beyond open-source tools, many commer-
LAML 31 CN
100% Open Protected
clusters
analysis LGG 63 100% Open Protected
LIHC
LUAD
15
66
peaks
100%
100%
Open Protected
Open Protected cial offerings exist. As the Broad widens
data LUSC 66 100% Open Protected
M. Noble/Broad Institute
UCEC 66 100% Open Protected
software tools: for example, to detect sig- institution or type of computational infra- vices and Life Technologies analyze the
nificant copy-number alterations, correlate structure. Iceman genome (http://icemangenome.
methylation status with clinical features or Whereas Firehose handles a substan- net/), a mummy dating back to 3,300 bc.
find significantly mutated genes, Getz says. tial portion of analytical workflows for Some companies focus on sequence data
The pipelines run in a computational TCGA, SeqWare currently handles just analysis for drug discovery or clinical uses.
framework called Firehose, which also the ICGC variant annotation pipeline, says Cancer research right now is not unlike
generates analysis reports. OConnor. the phase when whole-genome sequenc-
Soon the Broad will open Firehose to With SeqWare, data coming off the ing took off, says Thomas Knudsen, CEO
all TCGA scientists and, eventually, the sequencer flow into a database that is of the bioinformatics firm CLC bio, which
wider research community. We want monitored by a software-based decider has customers in academia, biotech and
to make the system available so people that triggers predetermined workflows for pharma. First, the early adopters in large
can install their own tools and run more assembly, alignment and analysis. This genome centers built their own tools, and
tools, Getz says. The future aim is to type of system has allowed us to automati- then companies such as his offered theirs.
generate something that looks like a cally analyze thousands of samples with Similarly, large-scale cancer research will
publication automatically, with figures, very little human interaction, he says. Our
supplementary information and figure plans are to release these workflows to the
legends, he says. The pipeline report still public, which would allow people to rep- Sequencers
npg
requires interpretation by scientists, but it licate our work at their own organization
jump-starts analysis. or on the Amazon cloud. There is also a
One analysis challenge has been the portal for nontechies to interact with the
Babel Problem, as Broad software engineer system and get analyzed data back.
Tracking
Michael Noble calls it. Scientists were not Not all labs need platforms for large-
database
able to precisely refer to TCGA data slices, scale automated analysis of terabases of
which reduced reproducibility. They did sequence data. However, thats changing,
not speak the same language, says Getz. To says OConnor. As sequencing technolo- Workflows
resolve this issue, Noble created Version gies evolve, individual labs increasingly
Files
Stamp to tag each data set and analysis produce data hills similar to the out-
run. Scientists can now identify the specific put of small genome centers from a few
data they use for a particular analysis. years ago.
Cluster engine
Firehose has a cousin called SeqWare SeqWare-based workflows are among
developed by computational biologist the many ICGC pipelines. As Stein
B. OConnor/OICR
Brian OConnor during his postdoctoral explains, the consortium is currently Local and cloud-based
fellowship at the University of California, addressing this multipipeline situation computing
M. Noble/Broad Institute
includes scientists seeking additional packaged into standardized form
computational know-how. Customers can
approach Knome to find genomic variants awg_lgg__2013_01_16 Packages with same date guaranteed
in data by using the companys platform, to contain same data subset (for example,
which integrates public data sources and custom analyses of lower-grade glioma data)
analysis tools.
The Broad Institute confronts the Babel Problem that emerged when scientists used TCGA data but
Early versions of academically produced could not readily identify data sets. Version Stamp makes each automated analysis identifiable.
2013 Nature America, Inc. All rights reserved.
entists to scale up genomic data analysis to build pipelines for you, says Elaine
and include large public-domain data sets Mardis, who codirects The Genome
such as TCGA, and to view them across Institute at Washington University School
genotype and phenotype, says Jonathan of Medicine. She also advises DNAnexus,
Sheldon, global senior director of transla- a company offering these types of genome
tional medicine in Oracles health sciences analysis services.
business unit. Frankly, bioinformaticians
have to spend way too much time doing Beyond doubts, questions await
the mundane but necessary formatting and Researchers can use the available data
reformatting work to load these public data and tools on their own hardware and the
into systems ready for analysiswe are cloud to pursue their questions of inter-
R. Boston/Washington Univ. St. Louis
productizing this step so they can focus on est. There are plenty of open questions
working with the disease scientists. to plumb because large genome centers
To this end, the company built an omics do not have time for in-depth analysis,
data model, which involves such tasks as says Meyerson. Besides working to better
defining data structures and how they understand the mix of normal and can-
relate to one another, and a platform that cerous cells in a tumor, scientists seek to
can analyze data from different sequenc- discern mutations that drive cancer pro-
The momentum in cancer genomics and analysis
ers and analysis pipelines, either locally gression. The identification of drivers
stands to help cancer patients, says Elaine or in a secured cloud-based computing and passengers either computationally
Mardis. Thats really at the end of the day why environment or a combination of both, or experimentally remains a challenge,
we are doing all of this. Sheldon says. he says.
M. Nemchuk/Broad Institute
is more deeply disordered, especially in ments and then, almost inevitably,
terms of rearrangements and all sort of relapse into therapy-resistant disease,
unexpected structural events than we had she says. Scientists do not yet understand
ever anticipated. what fundamental changes in the genome
He believes genome-based diagnosis explain such events, but the momentum
We want to make the system available so people will become common for cancer patients, in cancer genomics and analysis can
can install their own tools and run more tools, and his commercial ventures reflect address such conundrums, which stands
says Gad Getz. this view, including a licensed patent to help cancer patients, she says. Thats
to LabCorp of America and the launch really at the end of the day why we are
The genes most commonly mutated of Foundation Medicine, which offers doing all of this.
2013 Nature America, Inc. All rights reserved.
in cancer are turning out to be ones sequencing-based cancer diagnosis. 1. Chin, L., Hahn, W.C., Getz, G. & Meyerson, M.
t hat had b e en ident if ie d on eit her Though individual scientists lack- Genes Dev. 25, 534555 (2011).
the gene or pathway level prior to the ing computational expertise cannot yet 2. The International Cancer Genome Consortium et
al. Nature 464, 993998 (2010).
advent of second-generation sequenc- take raw whole-genome sequence reads 3. Gymrek, M., McGuire, A.L., Golan, D., Halperin, E.
ing, says Meyerson. But the data are and find the important variants in their & Erlich, Y. Science 339, 321324 (2013).
also delivering unexpected results. He samples, they can stitch together open- 4. The Cancer Genome Atlas Research Network.
Nature 489, 519525 (2012).
and TCGA colleagues discovered pre- source tools from the genome centers to 5. Ley, T.J. et al. Nature 456, 6672 (2008).
viously unreported loss-of-function analyze their own large data sets, says
mutations in the HLA-A gene in over 170 Mardis.
squamous cell lung cancers 4. The team A little over 5 years ago, her team pub- Vivien Marx is technology editor for
noted that this discovery speaks to cancers lished the first whole-genome sequence Nature and Nature Methods
ability to evade the immune system. Such comparison of tumor and normal tissue (v.marx@us.nature.com).
npg