Sie sind auf Seite 1von 28

An Introduction to Machine

Learning Techniques
By Clara M. Pennock (Keele University, UK)
Outline
• Aim: to give a basic understanding of machine learning and how to
apply these.
• Start with basic explanation of machine learning.
• Give examples of machine learning I’ve used.
• End with tutorial on using machine learning.
Basics

Non-linear
Label
function with
e.g. Quiescent (0),
some
Star-forming (1)
parameter w.
Depends on 𝑓𝑊 (𝑥)
Ԧ = 𝑦Ԧ
the algorithm
chosen.

Features
e.g. U-V, V-J
Basics – The types

Supervised Learning Unsupervised Learning


• Input: a list of objects with • Input: a list of objects with
measured properties and labels. measured properties.
• The algorithm is optimizing a • The algorithm detects clusters,
score (cost function) that complex relations or outliers.
depends on the input labels and
predicted labels.
Supervised Learning
• Start with a list of objects (X) with measurements (features) and
known outputs (y) in the form of a label or value.
• The Training Set:
𝑥Ԧ1 , 𝑥Ԧ2 , 𝑥Ԧ3 , … , 𝑥Ԧ𝑛 Measurements (fluxes, colours, etc…)
𝑦Ԧ1 , 𝑦Ԧ2 , 𝑦Ԧ3 , …, 𝑦Ԧ𝑛 Label (Morphology, object type, etc…)

Classification -> Label/Category


Regression -> Continuous variable
Supervised Learning
• Aim: to find a non-linear function, 𝑓𝑊 (𝑥)
Ԧ that outputs the correct
label/measurements for an unlabelled dataset.
• The function used is determined by what machine learning algorithm
you use.
• It is translated into a minimization problem : find W such that the
prediction error is minimal over all unseen vectors
• You need two things: a loss function and a minimization/optimization
algorithm.
Supervised Learning
• 1. Define a loss function (to measure errors for each input and output):

𝑙𝑜𝑠𝑠(𝑓𝑤 𝑥Ԧ𝑖 , 𝑦Ԧ𝑖 )

e.g. Quadratic loss function (𝑓𝑤 𝑥Ԧ𝑖 − 𝑦Ԧ𝑖 )2 used in regression.


No. of observed
examples
• 2. Minimize the empirical risk (average error)
𝑁
1
ℛ𝑒𝑚𝑝𝑖𝑟𝑖𝑐𝑎𝑙 𝑓𝑤 = ෍[𝑙𝑜𝑠𝑠(𝑓𝑤 , 𝑥,Ԧ 𝑦)]
Ԧ
𝑁
𝑖
Supervised Learning: Checking accuracy
• Training set: used to train the classifier
• Test set : sample of the training set aside to test the trained classifier
on.
• NEVER use the test set sample in the training!
Unsupervised Learning
• Input: Start with a list of objects (X) with measurements (features).
• Depending on the algorithm, you can detect clusters, complex
relations, outliers.
• Can also be used to reduce the dimensions of a dataset into a
dimension we can view.
• Input : 𝑥Ԧ1 , 𝑥Ԧ2 , 𝑥Ԧ3 , … , 𝑥Ԧ𝑛 Measurements (fluxes, colours, etc…)
Unsupervised Learning

Classifying galaxies using spectra

Example taken from astroML, a website


that’s provides astronomy examples for
using machine learning.

From :
https://www.astroml.org/book_figures/ch
apter7/fig_PCA_LLE.html
Python: Scikit-Learn

Supervised Learning Unsupervised Learning


• Generalized Linear Models • Gaussian mixture models
• Linear and Quadratic Discriminant Analysis • Manifold learning
• Kernel ridge regression • Clustering
• Support Vector Machines • Biclustering
• Stochastic Gradient Descent • Decomposing signals in components
• Nearest Neighbours • Covariance estimation
• Gaussian Processes • Novelty and Outlier Detection
• Naïve Bayes • Density Estimation
• Decision Trees • Neural Networks (unsupervised)
• Ensemble Methods • …
• Semi-supervised
• Neural Networks (supervised)
• …
How to choose?
• Depends on what you want to do. Classify/Cluster/Look for Outliers?
• What type of data do you have? Labelled/Unlabelled, Images/Survey
catalogues?
• Once you know what type, you try out a few to see how well they
work for you.

• There’s a good flow chart here: https://scikit-


learn.org/stable/tutorial/machine_learning_map/index.html to help
you get started.
What I’ve personally used

Convolution Neural Networks (CNNs) Random Forests


• Supervised learning technique • Supervised learning technique
that is based on neural pathways that uses Boolean logic to
in the brain. separate samples to determine
• Used to detect strong classes/values.
gravitational lenses in large • Using to find AGN behind the
survey images. Magellanic clouds using survey
• Done with Matlab. (Python data from radio to X-rays.
equivalent is Keras) • Done with Python Sci-kit Learn.
Auto-detection of strong gravitational
lenses using convolutional
neural networks
Masters Project done at the University of Nottingham
Link to paper: https://doi.org/10.1051/emsci/2017010
Neural Networks
• Based off neural pathways in the
human brain and used to
recognise patterns.
• Consists of layers of neurons,
where each neuron in one layer
is connected to every neuron in
the next layer.
• Connections have weights to
determine importance.
𝑓1 [𝑊1,1 𝑥1 + 𝑊1,2 𝑥1 + 𝑊1,3 𝑥1 + 𝑊1,4 𝑥1 ]=output
Convolutional Neural Networks
Feature maps Pooled Feature
Convolves maps
image with
randomised
kernels and
pools the
results.
Output is the
Classification Layer
input for the
neural
network.
Non-linearity stage
CNN

Convolution Pooling
Convolutional Neural Networks
CNN training set

• Two classes: non-


lensed and lensed
• 420,000 Simulated
images.
• Trained on 210,000
of them.
• Tested on 210,000
with accuracy
>98%.
Testing
• Tested on images that contain
lensed sources and other
sources.
• False labelling and incorrectly-
sized boxes were the most
prominent issues across all
images tested.
Re-training
• To improve this result, we added
several real images to the
training set.
• Positive training: natural lenses
• Negative training: stars with
large diffraction spikes, spiral
galaxies and nebulae
• This solved many of the issues
across the majority of the test
images. ≥ 90% accuracy.
Random Forests to find AGN
behind the Magellanic Clouds
Current PhD project
Decision Trees
• Simple model.
• Based on Boolean logic.
• Easy to interpret.
Random Forests

• Forest – created
from multiple
decision trees.
• Random –
dataset is
randomly split
into subsets,
each decision
tree is trained on
a different
subset.
Random Forest to find AGN behind the
Magellanic Clouds
Training Set: Reasons I use it:
• AGN – 292 spectroscopically identified • Require very little data preparation
sources • Perform well on imbalanced datasets.
• Galaxies (no identified AGN) – 512 • Can handle both numerical and
spectroscopically identified sources categorical data.
• Stars – ~40,000 from Simbad • Can handle multi-output problems.
• Other galactic sources (e.g. YSOs, PNe, • Outputs feature importance.
SN, SNRs) – ~3000 from Simbad
• Can handle numerous features and
objects.
Have survey catalogue data ranging • Produces classification probabilities.
from Radio to X-rays.
• Generalises well to unseen datasets.
Quick tutorials
• I’ve created some simple python scripts for you to try out. You can use the
provided example data or data of your own.

Dropbox link:
https://www.dropbox.com/sh/6tbzg2pc5kk50mf/AADBi3_CYOixCmpLA3tdn
WOua?dl=0

You can do these now, or later.

Any questions, my email address is c.m.pennock@keele.ac.uk


More information
Last year I attended the XXX Canary Islands Winter School of
Astrophysics which was about Big Data Analysis in Astronomy using
machine learning. Recently they have released the contents given by
each one of the lectures followed by the recorded talks and the slides
used for them. References, tutorials, and suggestions for additional
reading are also included.
http://www.iac.es/winterschool/2018/pages/book-ws2018.php

Machine Learning in Astronomy: a practical overview (Dalya Baron


2019) https://arxiv.org/abs/1904.07248

Das könnte Ihnen auch gefallen