You are on page 1of 48

Reproducible

Research and
the Cloud
Dr Kenji Takeda (Kenji.Takeda@Microsoft.com)
Microsoft Research
@azure4research
@ktakeda1
Microsoft Research
Scientific Discovery

= + +
The Research Lifecycle
Data
Acquisition &
modelling
Collaboration
and
visualisation
Analysis &
data mining
Dissemination
& sharing
Archiving and
preserving
fourthparadigm.org
X-Info
The evolution of X-Info and Comp-X for each discipline X
How to codify and represent our knowledge
Data ingest
Managing a petabyte
Common schema
How to organize it
How to reorganize it
How to share with others
Query and Vis tools
Building and executing models
Integrating data and Literature
Documenting experiments
Curation and long-term
preservation
The Generic Problems
Experiments &
Instruments
Simulations
Literature
Other Archives
facts
facts
facts
facts
Questions
Answers
Data-Intensive Research
Believe it or not: how much can we rely on
published data on potential drug targets?
at least 50% of published studies, even those in top-tier academic journals,
cant be repeated with the same conclusions by an industrial lab
Osherovich, L. Hedging against academic risk. SciBX 14 Apr 2011 (doi:10.1038/scibx.2011.416).
Cold fusion
Faster than light
Mar-Sep 2011 -
OPERA
March 2012 -
ICARUS
July 2012
Corrected paper
Science 2.0 EU Consultation
http://www.consultation-science20.eu/
CLOUD COMPUTING
On-demand services,
delivered over the network
Cloud computing provides
Getting what you need,
when you need it
Cloud computing is good for
Focussing on your research
Cloud computing is good for
The Cloud
democratizes
access to scale &
economies of scale
Cloud
Computing
Patterns
t
C
o
m
p
u
t
e

Inactivity
Period
t
t
t
On and Off
On & off workloads (e.g. batch job)
Over provisioned capacity is wasted
Time to market can be cumbersome
Unpredictable Bursting
Unexpected/unplanned peak in demand
Sudden spike impacts performance
Cant over provision for extreme cases
C
o
m
p
u
t
e

Growing Fast
Successful services needs to grow/scale
Keeping up w/ growth is big IT challenge
Cannot provision hardware fast enough C
o
m
p
u
t
e

Predictable Bursting
Services with micro seasonality trends
Peaks due to periodic increased demand
IT complexity and wasted capacity
C
o
m
p
u
t
e

Global
presence
Datacenter
Edge point
The Microsoft Cloud
Cloud Computing
Choose from multiple runtimes and languages for your
applications: Python, Java, PHP, .NET, Node.js
Run Linux on Microsoft Azure Virtual Machines (VHD)
Support multiple frameworks and popular open source
applications with Microsoft Azure Web Sites
HDInsight Hadoop for Big Data analysis
Microsoft Azure
http://github.com/windowsazure
Research Cloud Ecosystem
REPRODUCIBLE RESEARCH
h
t
t
p
:
/
/
w
w
w
.
p
h
d
c
o
m
i
c
s
.
c
o
m
/
c
o
m
i
c
s
.
p
h
p
?
f
=
1
6
8
9
Computational experiments should be
recomputable for all time
Recomputation of recomputable experiments
should be very easy
It should be easier to make experiments
recomputable than not to
Tools and repositories can help recomputation
become standard
The only way to ensure recomputability is to
provide virtual machines
Runtime performance is a secondary issue
Ian Gent , Alexander Konovalov and Lars Kotthoff
Steven Crouch, Devasena Inupakutika
Recomputation.org
Zanadu.IO
Patrick Henaff and Claude Martini
Zanadu.IO
khmer-protocols:
Effort to provide standard
cheap assembly
protocols for cloud
machines.
Entirely copy/paste; ~2-6
days from raw reads to
assembly, annotations,
and differential
expression analysis. Est
~$150 per data set
Open, versioned,
forkable, citable.
Open Science
C. Titus Brown, @ctitusbrown
http://ged.cse.msu.edu/
http://ivory.idyll.org/
Explicitly a protocol explicit
steps, copy-paste, customizable,
versioned; not black box.
No requirement for computational
expertise or significant
computational hardware.
~1-5 days to teach a bench
biologist to use.
$100-150 of rental compute
(cloud computing)
for $1000 data set.
Now adding in quality control and
internal validation steps.
Some thoughts
Reproducible
computing
environment
(Azure)
Publicly
available
data
(MMETSP)
Open and
versioned
protocol
Provenance
tracking and
registration
(Synapse?)
Computing Cancer
http://biomodelanalyzer.research.microsoft.com/
Troubling Trends in Scientific Software
Azure Machine Learning
Azure Machine Learning Awards 15 Sep14
Azure Machine Learning - Sharing
www.tryfsharp.org
NOTES FROM THE FIELD
http://www.rigb.org/docs/faraday_notebooks__induction_0.pdf
21
st
Century Log Notebooks
Verification versus Validation
Are you building
it right?
Are you building
the right thing?
Reproducing my
own results
Replicating other
peoples results
Reproducing other
peoples results
Repeatability, Replicability,
Reproducibility, Reuse
reviewers have no time and no resources to reproduce
data and to dig deeply into the presented work.
Life Sci VC: Academic bias & biotech failures: http:// lifescivc.com/2011/03/academic-bias-
biotech-failures/#0_ undefined,0_
P
h
o
t
o
:

l
e
e
c
h
a
n
t
m
c
a
r
t
h
u
r
,

C
C
-
B
Y
Enabling Science 2.0
www.azure4research.com
Use laptops &
desktop computers
Overwhelmed by
data
Finding analysis
ever more difficult;
sharing even
harder
www.azure4research.com
Enabling Science 2.0
Microsoft Azure for Research
Azure Research Awards
General next 15 Aug
Machine Learning next 15 Sep
Microsoft Azure for Research
Online Training
Webinars
Technical papers & walkthroughs
Research community engagements
www.azure4research.com
THANK YOU
Kenji.Takeda@Microsoft.com
www.azure4research.com
Microsoft Azure for Research Group
@azure4research