Beruflich Dokumente
Kultur Dokumente
Schedule
Date Time 4-Nov-05 13.45 - 15.30 18-Nov-05 13.45 - 15.30 15.45 - 17.30 25-Nov-05 13.45 - 15.30 2-Dec-05 13.45 - 15.30 15.45 - 17.30 Room 174 174 306/308 403 174 306/308 Lecture: Introduction Lecture: Predictive Data Mining Practical Assignments Lecture: Descriptive Data Mining & Search Lecture: Bioinformatics Data Mining Cases Practical Assignments
Evaluation
Practical assignment (2nd) plus take home exercise
Agenda Today What is data mining? A short summary of life Data mining revisited
Problem:
Leukemia (different types of Leukemia cells look very similar) Given data for a number of samples (patients), can we
Accurately diagnose the disease? Predict outcome for given treatment? Recommend best treatment?
Solution
Data mining on micro-array data
ALL
AML
Some working definitions. Data Mining and Knowledge Discovery in Databases (KDD) are used interchangeably Data mining =
The process of discovery of interesting, meaningful and actionable patterns hidden in large amounts of data
Multidisciplinary field originating from artificial intelligence, pattern recognition, statistics, machine learning, bioinformatics, econometrics, .
A short summary of life Bio Building Blocks Biotech Data Mining Applications
The Promise.
The Promise.
The Promise.
DNA Trivia DNA stores instructions for the cell to peform its functions Double helix, two interwoven strands Each strand is a sequence of so called nucleotides Deoxyribonucleic acid (DNA) comprises 4 different types of nucleotides (bases): adenine (A), thiamine (T), cytosine (C) and guanine (G)
Nucleotide uracil (U) doesnt occur in DNA
DNA Trivia Each nucleus contain 3 x 10^9 nucleotides Human body contains 3 x 10^12 cells Human DNA contains 26k expressed genes, each gene codes for a protein in principle DNA of different persons varies 0.2% or less Human DNA contains 3.2 x 10^9 base pairs
X-174 virus: 5,386 Salamander: 100 109 Amoeba dubia: 670 109
Proteins: 3D Structure
A representation of the 3D structure of myoglobin, showing coloured alpha helices. This protein was the first to have its structure solved by X-ray crystallography by Max Perutz and Sir John Cowdery Kendrew in 1958, which led to them receiving a Nobel Prize in Chemistry. http://en.wikipedia.org/wiki/Protein
Proteins: 3D Structure
Molecular surface of several proteins showing their comparative sizes. From left to right are: Antibody (IgG), Hemoglobin, Insulin (a hormone), Adenylate Kinase (an enzyme), and Glutamine Synthetase (an enzyme).
Proteins: 3D Structure
G Protein-Coupled Receptors (GPCR) represent more than half the current drug targets
DNA Codes for Proteins but Proteins also Control Gene Expression
combinations of a few gene regulatory proteins can generate many different cell types during development
Function prediction
Predicting function from structure Protein localization
Expression analysis
Genes: micro array data analysis etc. Proteins
Regulation analysis
Automated recognition of sick yeast cells in images (with prof. Verbeek) Recommender systems in bioinformatics
Amazon.com style recommendations
Multidisciplinary field originating from artificial intelligence, pattern recognition, statistics, machine learning, bioinformatics, econometrics, .
Finding best matching instances Every instance is a point in pattern space. Attributes are the dimension of an instance, f.e. Age, weight, gender etc. Pattern spaces may be high dimensional (10 to thousands of dimensions)
Clustering is the discovery of groups in a set of instances Groups are different, instances in a group are similar In 2 to 3 dimensional pattern space you could just visualise the data and leave the recognition to a human end user
f.e. age
f.e. weight
Clustering is the discovery of groups in a set of instances Groups are different, instances in a group are similar In 2 to 3 dimensional pattern space you could just visualise the data and leave the recognition to a human end user
f.e. age
f.e. weight
Goal classifier is to seperate classes on the basis of known attributes The classifier can be applied to an instance with unknow class For instance, classes are healthy (circle) and sick (square); attributes are age and weight
weight age
20000 patients age > 67 yes 1200 patients Weight > 85kg yes 400 patients Diabetic (%50) no 800 customers Diabetic (%10) no 18800 patients gender = male? no
etc.
Goal classifier is to seperate classes (circle, square) on the basis of attribute age and income
Input (attributes) is coded as activation on the input layer neurons, activation feeds forward through network of weighted links between neurons and causes activations on the output neurons (for instance diabetic yes/no) Algorithm learns to find optimal weight using the training instances and a general learning rule.
Neural Networks
Example simple network (2 layers)
age
body_mass_index
weightage
Probability of being diabetic = f (age * weightage + body mass index * weightbody mass index)
Classification
Simpel network: only a line available (why?) to seperate classes
Multilayer network:
f.e. weight
f.e. age
Important measures
Support condition: how often do potatoes and sauerkraut occur together (A,B) Confidence rule: how often do sausages then occur / support conditions (is A,B C always true?)
Quiz Question
What have we learned today An introduction into applying data mining for bioinformatics A short history of life Basic data mining concepts