Sie sind auf Seite 1von 22

Data Driven Cancer(DDC) or Cancer Driven Data(CDD)?

An omics puzzle to be solved for better prognosis in the disease

Alokkumar Jha
PhD Student
Insight centre for data analytics
NUI Galway

Cancer is like the Mafia


Treatments have variable
effect
Resistance can evolve
Doesn't work for all people
Doesnt hit the progenitor

Looks ordinary(almost)
Dont play by the rules
Have competitive
advantage
Allude detection

Current Scenario in cancer research and data science

Flood of Data
NextGen Biology Mantra: More data is good.

Structural
Variation

Exome
Sequences

DNA Methylation
Copy Number
Alterations

Expression

Data Generation Mechanisms


There is approximately 500 petabytes of healthcare data in existence today and
that number is expected to skyrocket to more than 25,000 petabytes within the
next seven years Groves, P., Kayyali, B., Knott, D. & Kuilen, S. V. (2013). The Big-Data Revolution in Healthcare.
MicKinsey & Company Report.

Data Driven Cancer (DDC)


Know biomarkers for certain cancer types but difficult to
understand gene behaviour and alternation disease from the
gene(Gene->Data) ,Biomarker research, Targeted therapy
Cancer Driven Data (CDD)
Molecular level information for cancer is not very well
know so use existing open source data and discover cancer
behaviour(Data->Cancer) 1000Geneome,GWAS

Analysis like investigating a


plane
crash
Patient
Sample 1
Patient Sample 2

Patient Sample 3

Patient Sample N

Data Driven Cancer


Indication of novel cancer types based on their signature and targets (Genes/ Proteins) or alternate indications

MET,ITGA2,CAV1,ASPH,LGALS3,F2RL1,SERPINE2,EGFR,CAV2
SDC4,LMNA,TPM1,DAB2,GNG12,FN1,PTPRM,MYLK,KRT18
LAMB1,ADAM9,TIMP1,ITGA3,CD44,MIR21,ITGA5,IGFBP3,NRP1
S100A6,ACTN1,ANXA2,TGFB2,THBS1,FOSL1,YAP1,TJP1,EREG,PTPRF
TIMP2,EPHA2,KRT8,SNAI2,CTTN,SERPINE1,LAMC2,IGFBP6
F2RL2,MMP2,TGFBR2,LAMA4,TIMP3,DKK1,JAG1,AXL,AREG,PTN
KRT7,LAMB3,CDH1,COL4A2,SDC1,PKP2,CLDN1,TGFA,CXCL2
ITGB4,APP,KRT19,TGFB1I1,PTGS2,LAMA3,COL4A1,EDN1,PLAU
LOXL2,PPL,CALD1,KLF5,ITGB6,MMP1,PLAT,LOX,CCND1,CTGF
TGIF1,TFPI2,TUBB6,COL1A1,CLDN7,TACSTD2,CDH2,GJA1
NID1,DSP,SPARC,CDH3,GNG11,EFNA5,IL1A,RHOB,EPCAM,F11R

MLN7243

Signature Genes

Data Driven Cancer


PPI(protein
Protein
Interaction)

PPI based
Disease
Enrichment

PPIs

PPI databases
HPRD, BIND,
IntAct, Vidal,
MiNT, PID,
BioGrid

Graph Statistics
Number of genes from your seed list: 100
Number of intermediate components: 90
Number of interactions in subnetwork created
from seed list: 351
Total components in the background network:
2086
Total interactions in the background network:
11429

TOP Gene
COBAS2.0
BioMyndb
DAVID

DDC

Top Ten Diseases from this list based on p-value >0.05

Linkedmdbwor
(22 databases)

Background gene
based disease
enrichment

Background
gene from
linkedcanDB
Linkedcandb
OMIM, TTD, CTD ,
clinvar, COSMIC,
kegg, wikipathway,
reactome etc. (32
databases)

Algorithm
defined
background
genes

TOPgene,Cobas2.0,Biomy
ndb,David,Disent,Gsea

Idiopathic intracranial hypertension with papilledema


Galactorrhea-Hyperprolactinemia
Chromosome 13q trisomy
Intrahepatic cholangiocarcinoma
Isotretinoin embryopathy like syndrome
Familial primary gastric lymphoma

Gtp cyclohydrolase i deficiency


Dystonia, dopa-responsive; drd
Epidermolysis bullosa letalis
Cirrhosis, familial
Epidermolysis bullosa, generalized atrophic
benign; gabeb

Summery: DDC

Requirements
Integrated dataset for downstream analysis
Inferred activities reflect neighbourhood of influence around a gene
Can boost signal for survival analysis and assessment of mutation
impact

Cancer Driven Data


Proteasome
Subunit

LinkedSeq
(ENCODE,TCGA,SR,
GWAS,GRO-seq,
1000genome etc.)
PSM
D9

NGS(ChIP+
RNA seq
Approach)

Cancer Driven Data

Tissue

U133A

Cancer

Tissue data
Proteasome
Subunit

U133plus2

PSM
D9

Cell line data

Normal

13

13

Adipose

59

12

72

14

19

Adrenal gland

39

14

87

15

155

4693

639

3130

1099

8974

Brain

785

568

592

1627

3572

Breast

1954

251

2635

91

4931

Cervix

74

12

64

34

184

Colon

1294

206

256

27

1783

Endometrium

72

61

142

Esophagus

48

24

28

109

GIST

64

64

202

14

21

239

Blood

Microarray
Approach

Total

Cancer

Abdomen

Bladder

LinkedArray
(U133Plus2,U133)
GEO,EBI Express

Normal

Head and neck

Heart

41

41

Kidney

573

105

366

66

1110

Liver

182

25

156

52

415

Lung

441

225

582

364

1612

Muscle

177

331

508

Myometrium

24

24

Ovary

859

21

341

1230

Pancreas

132

55

13

208

Prostate

308

45

244

83

680

Sarcoma

493

493

Skin

290

28

499

59

876

13

22

41

268

57

46

18

389

Small intestine

Stomach
Testis

184

13

207

Thyroid

62

25

44

25

156

Tongue

11

15

Uterus

155

12

24

191

Vagina

Vulva

21

14

35

Total

13057

2655

9284

4087

29083

LinkedTheraputics :A linked data approach towards


Centrality Measures
connected omics healthcare
Closeness

Probes
U133Plus2
54,613

U133A
22,215

Protein
Synonym
problem
(PSMD9=RPN4
=P27)
LinkedMDBWOR
(22 databases)

Linked Visualization &


Reporting
LinkedVIZ

Normal
Tissue
Network
(U133plus2
N+U133A-N)

Weighted
Network with
PCC
Cancer
Network
(U133plus2
C+U133A-C)
Linked Pathways
LinkedPathway
KEGG,REACTOME

Betweenness
Eccentricity
Degree
Eigen Vector
Radiality
Shortest path Length
Longest path length

Topological Stability based on


Tringle Counts ( Normal vs
Cancer)Measure of LOSS/GAIN
Clustering of Both Networks
(Community Clustering )
Leading Disease by each
cluster/Indirect Indications

LinkedTheraputics :Results

CDD
25* forms of cancer
glioblastoma multiforme
(brain)

squamous carcinoma
(lung)

serous
cystadenocarcinoma
(ovarian)

Etc. Etc. Etc.

Biospecimen Core
Resource with more
than 150 Tissue
Source Sites

Multiple data types

6 Cancer Genomic
Characterization
Centers

3 Genome
Sequencing
Centers

7 Genome Data
Analysis Centers

Data Coordinating
Center

Clinical diagnosis
Treatment history
Histologic diagnosis
Pathologic report/images
Tissue anatomic site
Surgical history
Gene expression/RNA
sequence
Chromosomal copy
number
Loss of heterozygosity
Methylation patterns
miRNA expression
DNA sequence
RPPA (protein)
Subset for Mass Spec

Future Medicine Practice in cancer


research

Chin et al.
2014,Cell

Motivation
TCGA has many high
quality primary tumor
samples,
but metastasis kills
Which primaries will metastasize?

Image courtesy of wikimedia commons

18

Overview of pathway-guided
approach
Integrate many data sources to gain accurate view of
how genes are functioning in pathways
Predict the functional consequences of mutations by
quantifying the effect on the surrounding pathway
Use pathway signatures to implicate mutations in novel
genes to (re-)focus targeting
Identify critical Achilles Heels in the pathways that
distinguish a particular sub-type

Sche
ma

Assembly

FASTA seq

Ensemble ID

Cell-lines
Kegg pathway

SNP Id

SGP
X

GO: gene ontology

Molecular Mass

equivalent class

interaction

Peroxiredoxin 6

Chr location:start-end
COSMIC mutation

Proteomes
Cytogenetic band
Protein abundance cross
organisms

Mutation type Disease


iD:PharmaGKB

Data

GRCh38.p2
ENSG00000117592
rs761610936 dbSN
P
GO:0016021
Integral membrane
SwissPr
component
ot
X:139,955,72139,965,520
c.17G>T

Ensembl
e

MCF7,HeLa
MNLVICVLLLSIWKNN
CMTTNQTNGSSTTGD
KPVESMQTKLNY
LRRNLLILVGIIIMVFV
FICFCYLHYNCLSDDA
SKAGMVKKKGIAA
KSSKTSFSEAKTASQC
SPETQPMLSTADKSS
DSSSPERASAQS

NCBI

COSMI
C

COSM1249516

SGP
X
HGN
C

same as

hsa:347487
KEGG

UniProt

Gene
cards

PA164718516

9606.ENSP00000359571

equivalent to

CXORF
66

UniProt

PaxDb

Xq27.1

39944 Da

HPA

SPD

see also
PharmaGK
B

Modified reside

Q5JRM2

EMBL-EBI

chromosome X
open
reading frame
66 UP000005640

Glycosylation

Acknowledgements

Dr . Ratnesh Sahay
Group Leader, eHealth and Life sciences , Insight centre for data analytics @ NUI Galway

Dr . Prasanna Venkatraman
Principal Investigator, Advanced centre for treatment education and research in cancer, Mumbai, India
Dr . Rangapriya Sundarajan
Sr. Research Associate, Advanced centre for treatment education and research in cancer, Mumbai, India

Das könnte Ihnen auch gefallen