Representation Learning For Galaxy Spectra

IPSA
Institut Polytechnique Bis, 63 boulevard de Brandebourg

des Sciences Avancées 94200 Ivry-sur-Seine
Representation learning for red shift galaxy spectra

MASTER’S PROJECT
Written by
Romain Fonteyne
Supervised by Mrs. Joana Frontera-Pons

From 20/10/2018 to 29/01/2019
Acknowledgment
I am using this opportunity to express my gratitude to my supervisor, Mrs. Joana Maria Frontera-
Pons, who supported me throughout the course of this end-of-studies Master’s project. I am thankful
for her aspiring guidance and invaluably constructive criticism.
Abbreviations
AE Autoencoder
AI Artificial Intelligence
ANN Artificial Neural Network
CEA Commissariat à l’énergie atomique et aux énergies alternatives
DAE Denoising autoencoder
ML Machine Learning
NN Neural Network
PCA Principal Component Analysis
SVM Vector Support Machine
VAE Variational autoencoder
tf Tensorflow
Symbols
A Matrix
b, bT , b−1 Bias matrix
d Distance
E Set of positive real numbers
f (xcorr , θ) Input model (function)
g(y, θ−1 ) Output model (function)
h(x) Logistic function
k Optimal number of principal components
L(x̂, x), LH (x̂, x) Loss function
n Number of variables
p Number of individuals
pt Pattern
q(x) Corrupted data function
r Correlation coefficient
S(t) Sigmoid function
W, W T , W −1 Weight matrix
y, y(x) Encoded data
x Input data
xcorr Corrupted data
X̂, x̂ Reconstructed data
x̄, ȳ Mean value
α Slope
θ Weight matrix symbol
σ Nonlinearity
σ̂X Standard deviation of X
σ̂Y Standard deviation of Y
σ̂XY Estimator of the covariance
Abstract
This Master’s thesis focuses on the estimation of galaxy red shifts and its application in machine
learning. This work is conducted within the context of a classical neural network model, for which
we explain the principle of this project as well as its limitations. In particular, we recall that red shift
is an increase in the wavelength within the studied galaxy spectrum. Considering that photometric
red shifts can only be measured for millions of galaxies, and that spectroscopic data are limited, we
demonstrate why they are essential to modern cosmology and how to estimate such precise values
through Deep Learning. We apply this learning method on data collected by the Commissariat à
l’énergie atomique et aux énergies alternatives (CEA) of Saclay and we pay particular attention to
the errors and accuracy of the data retransmitted by our denoising autoencoder (DAE). For the
success of our mission, we use Unsupervised Learning techniques in order to train our algorithm
to determine the spectral red shift in a autonomous way. The large number of spectra present
in the training data set allowed us to build a robust algorithm with adequate parameter values.
Hence, we adjusted the parameters in such a way that the DAE optimally reconstructs the data,
thus obtaining spectral values (output) in very good agreement with the input data. Lastly, we
observed that optimizing the algorithm of a DAE is necessary for an optimal functioning.
Résumé
Ce projet de fin d’études se concentre sur l’estimation des décalages spectraux de galaxies vers
le rouge (couremment appelés “red shifts”) et ses applications au sein de l’apprentissage machine.
Cette étude se place dans le contexte d’un modèle standard de réseaux de neurones pour lequel nous
expliquons le principe de cette étude, ainsi que ses limites et la manière dont le travail a été réalisé. En
particulier, nous rappelons que le décalage vers le rouge est une augmentation de la longueur d’onde
dans le spectre de la galaxie étudiée. Etant donné que les décalages vers le rouge photométriques
ne peuvent être mesurés que pour des millions de galaxies, et que le nombre de données spectro-
scopiques est limité, nous montrons pourquoi ils sont essentiels à la cosmologie moderne et comment
estimer des valeurs aussi précises grâce au Deep Learning. Nous avons appliqué cette méthode sur
des données recueillies par le Commissariat à l’énergie atomique et aux énergies alternatives (CEA)
de Saclay et porté une attention particulière aux erreurs et à la précision des données retransmises
par notre auto-encoder filtreur de bruit (Denoising Autoencoder). Pour la réussite de notre mission,
nous avons utilisé l’apprentissage non supervisé afin d’entraîner notre algorithme à déterminer le
décalage spectral de façon autonome. Le grand nombre de spectres présents au sein des données
d’entraînement nous a permis de construire un algorithme robuste avec des paramètres adéquats.
Nous avons ajusté les paramètres de telle sorte que le DAE reconstruit au mieux les données, pour
ainsi obtenir des valeurs spectrales en très bon accord avec les données initiales. Nous avons observé
que l’optimisation de l’algorithme d’un DAE est nécessaire pour un fonctionnement optimal.
Contents
Acknowledgement 1
Abbreviations 2
Symbols 3
Project Synthesis 7
1 Introduction 1
1.1 The Universe and Galaxies’ Red Shift . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Big data treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Project objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 State of the Art 3

2.1 DARTH FADER algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Practical example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Representation Learning 9
3.1 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Denoising autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Introduction to machine learning algorithms . . . . . . . . . . . . . . . . . . . . . . 15
MNIST example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Spectra representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Application to spectroscopic red shift 28

4.1 Red shift estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Doppler-Fizeau effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Relativistic effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Applied theory to neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Initial programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Nonlinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Nonlinearity computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Galaxy spectra reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Initial results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5 Further discussion 46
6 Conclusion and Future Work 47
Bibliography 48
Webography 49
Appendices 52
Appendix 1: MNIST data decoding with 1 layer . . . . . . . . . . . . . . . . . . . . . . . . 54
Appendix 1: MNIST data decoding with 2 layers . . . . . . . . . . . . . . . . . . . . . . . 57
Appendix 3: Principal Component Analysis programs . . . . . . . . . . . . . . . . . . . . 63
Appendix 4: Denoising autoencoder for galaxy spectra red shift estimation . . . . . . . . 66
Project Synthesis
Table 1 – Project synthesis
Synthesis by Romain Fonteyne - 2018-2019

Subject Objectives
Representation learning for galaxy spectra. To investigate the use of recent advances in
deep learning to design new representations
for galaxy spectra.
Main customer Utilized tools
IPSA - Python, tensorflow, numpy ;

Asked on: 20 October 2018. - Matlab.
Realized studies
- Investigation of the deep learning techniques. - Study of the denoising autoencoders archi-
tecture to derive robust representations for the
continuum component of the galaxy spectra.
- Study of Artificial Neural Networks (ANNs).
Results Explanations about possible deviations
- Understanding of Neural Networks.

- Decoding of MNIST data. - Too many projects to work on at a time.
- Display of coherent and useful graphs. - Slow machines.
- DAE training optimized for galaxy spectra. - Very long time of execution.
Encountered difficulties Future work
- Understanding of Deep Learning.

- Understanding of TensorFlow. - Improvement of the DAE script.
- Understanding of DAE. - Estimation of red shift.
- Difficulties about the programming of DAE.
1 Introduction
1.1 The Universe and Galaxies’ Red Shift

Since immemorial time Mankind has been attracted by the mysteries of the Universe. Due to
galaxies, planets or nebulas, the sky is full of celestial bodies placed at varying distances with lots of
different characteristics. If we know how to determine the age of a celestial body and the distance
that separates us from this one in a more or less approximate way for a very long time, it is really
during the last two decades that the study and the observation of the universe have grown con-
siderably. The scientific interests of spatial and environmental observation vary, depending on the
mission and tools used, but all share a common goal: to better understand the surrounding environ-
ment. Over time, regular phenomena have been understood. Step by step scientists benefited from
those understandings and began to use this knowledge in their advantage, to face new challenges
and discover new mysteries such as the Universe expansion. This discovery, ensuring that the Uni-
verse is not finite, allowed the scientist to conclude that – with the exception of the Andromeda
galaxy and some nearby dwarf galaxies – all the galaxies are moving away from us. Moreover, the
more a galaxy is far away the faster it will move away from the solar system. Its lightning emission
is therefore affected by its velocity due to the Doppler effect thus the light emitted by the galaxy
moving away will be shifted to red because of a wavelength increase. This particular phenomenon
is commonly called red shift. In order to measure the spectral shift, astronomers use galaxies spectra
which allow them to determine their velocity. However, one of nowadays problems concerns the
amount of data scientists need to gather and save.
1.2 Big data treatment

Data is collected almost everywhere: from scientific measurements of the radiation from dis-
tant galaxies to personal tracking of the footsteps made. While large amounts of data are collected
and can be stored, they are too large for interpretation by humans. Therefore, computer based ap-
proaches are developed. Interpretation may involve assigning labels to data points, finding special
data points, or finding trends in the data. Depending on the data, these tasks can be very challenging.
Often, the relevant and properties can not be measured directly. Usually, they are inferred indirectly
from sets of values that can be measured. However, these values will only partially reflect the prop-
erties of interest. Most likely many irrelevant properties are captured by these values. Over the last
decade, the increasing number of large and deep multi-wavelength galaxy survey datasets has led to
a dramatic improvement of the amount of data to treat. In particular, deep observation sets of data
at optical and near-infrared wavelengths have enabled astronomers and scientists to gather quanti-
fied galaxy stellar emission sets with unprecedented details across a wide range of stellar mass, and
galaxy environment. Besides, galaxies in the distant Universe are very often identified and classified
using spectro-photometric criteria, enabling a better classification of data collected over decades.
Romain Fonteyne © 2019 Page 1/70

1.3 Project objectives
While new generation of sky survey telescopes produce huge amount of spectra, automated
spectral analysis is highly required to treat them all. Galaxies, and especially their red shift, are
important targets of sky survey programs. Therefore efficient and automated methods for galaxy
spectra data estimation is the basis of systematic study on the different properties and space and time
evolutions. In order to address the problem, a solution is to use Machine Learning, and more pre-
cisely Representation. Machine learning is a sub-domain of computer science derived from artificial
intelligence. This scientific discipline is mainly divided into two parts, supervised and unsupervised
learning. Although it is extremely related to statistics and mathematical optimization, it explores
model building and the study of algorithms that can learn from data. These data are specific to a
precise task that the model tries to solve by learning from examples. Unsupervised learning attempts
to find a hidden structure in unlabeled data. This type of learning encompasses many techniques
aimed either at summarizing, explaining key features, or at determining the distribution of a set of
data. As a result, this thesis will only focus on unsupervised algorithms of representation learning
for galaxies spectra’s red shift distribution estimation. The first part of this report will thus present
machine learning and neural networks before introducing encoders in the second part. Lastly but
not least, results will be presented in a third part criticized with scientific point of views.

2 State of the Art
2.1 DARTH FADER algorithm

The simplest method for estimating the red shift is by vi-
sual inspection, where the visual inspection is performed
to secure the identification of the objects, reliably estimate
the emission red shift of galaxies. However, due to large
data sets and a sheer number of obtained spectra the use of
automated algorithms, with sophisticated techniques for
dealing with noise, is necessary in order to obtain accurate
information. In the mean time, by using what are called
spectral line matching methods, there is a necessity of us-
ing a high signal-to-noise ratio to detect at least one emis-
sion line above a predefined threshold. These methods al-
low to resolve degeneracies thanks to photometric data
inclusion. A similar but different method, where the red
Figure 1 – Darth Fader algorithm
shift is considered to be zero (z = 0) thanks to a template
result.
set, aims to match a set of galaxy spectra with the tem-
plate set in order to estimate the red shift. In this method
called “template matching method”, the chi-squared test (χ2 , the sampling distribution of the test
statistic is a chi-squared distribution when the null hypothesis is true) is used to determine the shift
in wavelength between the template and the galaxy spectrum, and hence the red shift of that galaxy.
Additionally, our work will deal with cross-correlation methods which use a discrete Fourier trans-
form to correlate a template spectrum with a galaxy spectrum, allowing the shift of the template
spectrum to become a free parameter. These methods can be computed as a simple multiplication in
Fourier space between the template and galaxy spectra. However, they require the spectra to be free
of continuum. As those methods can be computed as a simple multiplication (resulting in easier and
faster computation than performing the same procedure in real space), cross-correlation techniques
are really convenient.
0
The following assumption is made within the algorithm: any test spectrum Sλ may be represented as
a linear combination of normalized template spectra Tiλ . The DARTH FADER algorithm (Denoised
and Automatic Red shifts Thresholded with a False Detection Rate), which is a new wavelet-based
method for estimating red shifts of galaxy spectra widely automated, shows that, if there are lots of
line features (either absorption or emission) present in the spectrum it is easy to determine the red
shift via cross-correlation with a representative set of eigentemplates.

In addition, the wavelet representa-
tion of a signal is useful as it en-
ables one to extract features at a range
of different scales. In order to esti-
mate absorption and emission lines in
a way as robust as possible, DARTH
FADER involves the use of a False Dis-
covery Rate (FDR) threshold. This
method allows us to control contam-
ination from false positive lines aris-
ing from noise features. In the most
Figure 2 – Example spectra after FDR denoising. Source: general context, the aim is to identify
Darth Fader article. which pixels of the galaxy spectrum
contain signal, and are therefore “ac-
tive”, and those which contain noise and are therefore “inactive”. The measured flux in each pixel,
however, may be attributed to either signal or noise, with each having an associated probability
distribution. When deciding between these two competing hypotheses, the null hypothesis is that
the pixel contains no signal, and the alternative hypothesis is that signal is present (in addition to
noise). The FDR is given by the ratio:
Vf
F DR =
Va
where Vf is the number of inactive pixels, and Va is the number of active pixels.
Thus a FDR ratio of 0.05 allows on average one false feature for every 20 features detected. Even
though it is usually not possible to reach this statistical accuracy. Increasing the FDR parameter does
provide significant improvement in the efficacy of the denoising. As a result, the False Discovery
Rate denoising procedure denoises the positive and negative halves of the spectrum independently,
with positivity and negativity constraints respectively.
Nevertheless, for a large test catalogue that includes a variety of galaxy types, a large number of
templates is needed to ensure the best match-up between template and test spectra. To use all of
them in the cross-correlation would be excessively time-consuming. If it was possible to reduce the
number of templates whilst still retaining most of the information content of these templates then
we can render the method more practical. Principal Component Analysis is a simple tool that allows
us to do just that: to reduce the dimensionality of this problem, that may be described as a neural
network, by extracting the most important features from our set of template spectra, the principal
components.

2.2 Principal Component Analysis
If the artificial neural network has a layer where the number of neurons is smaller than the di-
mension of the input data, or if special learning techniques are used, information in that layer will
be a compressed representation of the input vector. In order to make a dimensional reduction, a
mathematical process of dimension transformation must be used.
Definition
Principal Component Analysis (abbreviated PCA) is a statistical procedure that uses an orthogonal
transformation to convert a set of observations of possibly correlated variables (entities each of
which takes on various numerical values) into a set of values of linearly uncorrelated variables
called principal components. Therefore, the principle is to build a representation system of a reduced
dimension that keeps the distance between individuals, where the distance d, used in a set E such
that d : E × E → R+ , is written as the following:
v
u n
uX
d= t (x
i − yi )2
i=1
One of the main problems of such an analysis is the loss of information, which may be controlled
or uncontrolled.
As a mean to analyze the relationships between variables, PCA utilizes a correlation coefficient that
measures the linear link between two variables X and Y . This coefficient is defined by the following
expression:
σ̂XY
r=
σ̂X · σ̂Y
with
 PN
1


 σ̂XY = N i (xi − x̄)(yi − ȳ)






 qP
N

σ̂X = (xi − x̄)2


qPi



N
 σ̂Y = i (yi − ȳ)2






 PN
1
x̄ = xi


i



 N
1 PN
ȳ = yi


N i
which are respectively estimator of the covariance, standard deviations, and means.
r is ranged between −1 and 1, as a direct consequence of the triangle inequality.

Consider n, p ∈ N such that n > p, and A ∈ Mn,p (R) a matrix that gathers all the numerical data
(real or non-real value samples) coming from a data set such that, in this example:
- n = 15
-p=8
- ∀(i, j) ∈ N2+ , ai,j = r.v
Therefore:
 
a
 1,1
· · · a1,8 
 a2,1 · · · a2,8 
 
 
 ·
A= · · · · 

 
 · · · · · 
 
 
a15,1 · · · a15,8
An individual is an element of Rp . As an example, the ith individual corresponds to the line of

elements ai,k . Hence, in this case an individual is an element of R8 .
A variable is an element of Rn . For instance, the j th variable corresponds to the column of elements
ak,j . Hence, in this case a variable is an element of R15 .
Figure 3 – PCA graph example. Source: Wikipedia.

Practical example
Considering randomly chosen values for the components of the matrix A, it is possible to get a
graph (called a “scree plot”) showing the number of principal components needed in regards to the
eigenvalues of A. Here is an example of a scree plot obtained using the Python script available in
Appendix 3:
Figure 4 – Scree plot (output).
This graph allows the user to determine the optimal number of principal components k (thus the
optimal number of principal vectors) to downsize the input-dataset dimension while having a new
set of values where the variables are linearly uncorrelated. In order to get k, it is necessary to
carefully read the scree plot. It is easily observable that, on the previous output graph, the first
part of the curve is a decreasing line which has a slope α = 4. This line stops before becoming a
slightly rounded curve. The area connecting both parts of the curve is called the “Elbow”. Finally,
the optimal k is the value where the elbow is the most concentrated. In other words, the number
of principal components that maximizes the rate of non-correlation is the intersection between the
line and the rounded parts of the curve. Therefore, according to the upper results we can conclude
that, as for this case the best k is k = 3.

This method of multi-factorial data analysis comes with numerous pros, such as:
• Mathematical simplicity: this factorial method only uses eigenvalue and eigenvector calcula-
tions.
• Simple results: thanks to the graphs it provides, the Principal Components Analysis makes it
possible to apprehend a large part of its results in a fast way.
• Flexibility: The PCA is a very flexible method, since it is applied to a set of data of any size
and content, as long as it is quantitative data organized in individuals/variables.
• Power of computation: The PCA may be a simple mathematical approach it is still a really
powerful method. It offers, in a few operations only, a summary and an overall view of the
relationships between the quantitative variables, results which could not have been obtained
otherwise, or only at the cost of tedious manipulations.
As a method of data analysis, PCA does not really have disadvantages per se. It is simply applied to
specific cases in order to generate a particular type of result. It would therefore make no sense to say
that it is a disadvantage that, this method, does not apply outside of this context. Similarly, since it is
a data summarization technique we can say that the loss of information necessarily generated is not
a con, but rather a condition for obtaining the result, even if it may obscure important characteristics
in pre-defined particular cases.
To counteract these two major problems, our case study will focus on a specific method whose
results will be more accurate than results coming from a principal component analysis, as the loss
of information (mathematical function) will be minimized.

3 Representation Learning
In today’s modern-day world people are living in an ultra-connected world (the Internet, wireless
connections, smart phones, etc.) where each interaction reinforces an infinite database. In order to
treat all those data, scientists have started to develop a concept called “Artificial Intelligence”, where
the long-term aim is to make machines think like humans do. In other words, machines will be
able to think, to plan, to learn, and to understand our language and our aptitude to communicate.
Although no one expects at the moment, or even in the near future, a perfect equivalence with
human intelligence, Artificial Intelligence (AI) has important implications for our lives. The "brain"
on which AI is based is a technology called “machine learning”, which has been designed to make
us more productive and easier to work with.
3.1 Machine learning

Machine Learning (ML) is one of the fields of study related to Artificial Intelligence. It is the sci-
entific subject which is focused on the development, the analysis, and the smooth implementation
of automation methods that allow a machine to evolve thanks to a learning process. Therefore,
this enables users to fulfill tasks that are difficult or impossible to complete by more conventional
algorithmic means. One simple example ML algorithm application in our everyday-life is Google
Maps. Indeed, Google Maps analyzes speed of Traffic through anonymous location data from smart-
phones. This enables Google to reduce travel time by suggesting faster routes, for instance. There
are different types of Machine Learning algorithms known nowadays:
1. Supervised Learning (task driven) – the algorithm learns how to make decisions.
2. Unsupervised Learning (data driven) – the algorithm tries to find a hidden structure.
3. Reinforcement Learning – the algorithm learns from the environment.
In Machine Learning, the idea is that the algo-

rithm builds an "internal representation" by it-
self in order to be able to perform the specified
task (prediction, identification, etc.). In order
to do it, the user will first have to enter a data
set of examples, so that the algorithm can train
and improve. This data set is called the “train-
ing set”. The learning algorithm is the method
Figure 5 – Basic steps of ML. Source: Jinesh
by which the statistical model will be param-
Maloo.
eterized from the example data. Even though
there are many different algorithms, it is casual
to choose a particular type of algorithm depending on the type of task the user wants to perform

and the type of data available. Among all the known machine learning algorithms we can define:
linear regression, Vector Support Machine (SVM), and neural networks.
The specific task to be accomplished is the problem we are trying to solve by modeling the phe-
nomenon. In the next section we will see that our problem deals with layers, hence let consider
neural networks only. In particular, the notion of neural networks has a huge impact on deep learn-
ing which exploits this concept by its very nature using unsupervised methods. An Artificial Neural
Network (ANN) is a computational model that is inspired by the way biological neural networks in
the human brain process information. Neural networks have generated a lot of excitement in Ma-
chine Learning research and industry, thanks to many breakthrough results in speech recognition,
computer vision and text processing.
Most machine learning practitioners are first exposed to feature extraction techniques through unsu-
pervised learning, so is Deep Learning. In unsupervised learning, an algorithm attempts to discover
the latent features that describe a data set’s "structure" under certain (either explicit or implicit)
assumptions. For example, low-rank singular value decomposition (of which principal component
analysis is a specific example) factors a data matrix into three reduced rank matrices that mini-
mize the squared error L(x̂, x) of the reconstructed data matrix x̂ such that L(x̂, x) = kx̂ − xk2 .
Therefore the error is zero (which corresponds to the best result achievable) if x̂ = x.
Although traditional unsupervised learning techniques will always be staples of machine learning
pipelines, representation learning has emerged as an alternative approach to feature extraction with
the continued success of deep learning. In representation learning, features are extracted from un-
labeled data by training a neural network on a secondary, supervised learning task. This learning
model by layers representation came to this world in 1950. As previously described, there are nu-
merous way of learning (supervised, unsupervised, regression, reinforcement, . . . ), but all of them
require a lot of training sets due to the amount of parameters to optimize. In order to follow these
steps, a special type of neural networks called “autoencoders” must be used, whose goal is to inter-
pret training sets of data to predict corrupted data in real a data set.
Figure 6 – Autoencoder diagram.

3.2 Deep Learning
Consider that the human brain is made up with substances making neurons. If we assume that
most of human intelligence may be due to one learning algorithm, hence the main idea behind deep
learning methods is to build learning algorithms that mimic the brain. Therefore, deep learning is
a set of learning methods attempting to model data with complex architectures combining different
non-linear transformations. The elementary bricks of deep learning are the neural networks, that
are combined to form the deep neural networks.
These techniques have enabled significant progress in the fields of sound and image processing,
including facial recognition, speech recognition, computer vision, automated language processing,
text classification (for example spam recognition). Potential applications are very numerous. A
spectacularly example is the AlphaGo program, which learned to play the GO game by the deep
learning method, and defeated the world champion in 2016. As a matter of fact, we can add that
vision recognition is ultra challenging as for a small 256×256 resolution and for 256 pixel values
a total of 2524,288 images are possible. This number is way bigger than the number of stars in the
visible universe (1024 ). In the meantime, there are at least 1010 possible GO games. However, in
48
order to generate results coming from thousands of learnt data, algorithms must use autoencoders.
Figure 7 – Learning Representations. Source: UVA.

Autoencoders
An autoencoder is an artificial neural network typically used for the purpose of dimensionality
reduction. It is a neural network that has three layers: an input layer, a hidden (encoding) layer,
and a decoding layer. What makes an autoencoder special is that the output neurons are directly
connected to the input neurons and the goal is to get the output values to match the input values.
Therefore, the network is trained to reconstruct its inputs, which forces the hidden layer to try to
learn good representations of the inputs.
Figure 8 – Structure of an autoencoder.
When we input an image, it is a vector in a n-dimensional space which is sent to the hidden layer
after some activation function is applied to it, in order to reduce it to a m-dimensional space. This
process happening in every neural network is called dimensionality reduction. Let us consider the
input data x ∈ Rm . We can describe the layers as follows:
• The input data is mapped onto the hidden layer (layer L2 ). This means that the autoencoder
tries to learn a function such that the encoded data y is:

y(x) = σ W T x
where W is the weight matrix such that W T = W −1 , and σ is the nonlinearity. If there is bias, this
expression will be modified in such a way that

y(x) = σ W T x + b

• Optionally, in order to reduce the size of the model, one can use tied weights, that is the
decoder weights matrix is constrained to be the transpose of the encoder weights matrix:
θ0 = θT
• The hidden layer is mapped onto the output layer (layer L3 ). The mapping is an affine trans-
formation optionally followed by a nonlinearity, and the estimated output x̂ is

x̂ = σ W −1 y + b−1
One use case of autoencoders is data compression, similar to the creation of zip files for some data
set that can be unzipped. Obviously, there exist some data losses that do not allow users to recover
the exact same quality of data in output as it was in input. In order to quantify the amount of data
lost in the process we must compute the reconstruction error. This error can be measured in many
ways, depending on the appropriate distributional assumptions on the input given the code. The
traditional mean squared error (mse) L(x̂, x) = kx̂ − xk22 , can be used. If the input is interpreted as
either bit vectors or vectors of bit probabilities, cross-entropy of the reconstruction can be used:
m
X
LH (x̂, x) = − [x̂k log xk + (1 − x̂k ) log (1 − x̂k )]
k=1
If the hidden layer is linear, the autoencoder behaves like Principal Component Analysis. When
the outputs of each layer are wired to the inputs of the successive layer, the autoencoder is called a
“stacked autoencoder”. This kind of tools is a special class of autoencoders used to train a deep net-
work. Once the stacked autoencoder is trained on some data set it is possible to use those weights to
initialize the deep network, instead of randomly initialized weights. One of the more recent appli-
cations of autoencoders is generating novel yet similar outputs to inputs. Despite the large amount
of existing autoencoder types, scientists use a newer type of autoencoder called a “variational au-
toencoder” (VAE) which learns a distribution around data, so it can generate similar but different
outputs. This type of neural network is much more flexible and customizable in their generation
behavior which leads us to the conclusion that a VAE is suitable for art generation of any kind: it
might be able to generate three-dimensional models of game characters, paintings, drawings, and
so on so forth.
There are lots of techniques that are used to prevent autoencoders from unsuccessfully reconstruct-
ing the input image, such as denoising, where the input is partially corrupted on purpose. Those
autoencoders are called “denoising autoencoders”.

Denoising autoencoders
A denoising autoencoder is a basic autoencoder which takes partially corrupted inputs randomly
to address the identity-function risk, which autoencoder has to recover or denoise. This technique
has been introduced with a specific approach to good representation. A good representation is one
that can be obtained robustly from a corrupted input and that will be useful for recovering the
corresponding clean input. The idea is that, if it can rebuild a data set despite it being corrupted it
will be a more robust decoder.
An approximate scheme is the following:
• The input x gets corrupted by a function q(x) such that q(x) = xcorr .
• xcorr is used as in the previous model:
y = f (xcorr , θ) = σ (W xcorr + b)

x̂ = g(y, θ−1 ) = σ W −1 y + b−1
• The error computation is exactly the same as the previous model (use of the original x).
We remind that when there are more nodes in the hidden layer than there are inputs, the network is
risking to learn the so-called “Identity Function”, also called “Null Function”, meaning that the output
equals the input, marking the autoencoder useless. Denoising autoencoders solve this problem by
corrupting the data on purpose by randomly turning some of the input values to zero using some
noise. The amount of noise to apply to the input takes the form of a percentage. Typically, 30
percent, or 0.3, is fine, but if we have very little data, we may want to consider adding more.
Figure 9 – Diagram of the process of a denoising autoencoder.
When calculating the Loss function L(x̂, x), it is important to compare the output values with the
original input, not with the corrupted input. By doing in that way, the risk of learning the iden-
tity function instead of extracting features is eliminated. Therefore, denoising autoencoders are an
important tool for feature selection and extraction.

3.3 Introduction to machine learning algorithms
In order to illustrate deep learning algorithms, the following section will focus on data samples
called “MNIST”, created for a first-hand trial with Tensorflow.
MNIST example
The MNIST dataset comprises 60,000 training examples and 10,000 test examples of the handwritten
digits 0–9, formatted as 28x28-pixel monochrome images.
To start working with MNIST let us include some necessary imports:
1 import tensorflow as tf , m a t p l o t l i b . pyplot as p l t

2 import m a t p l o t l i b . p a t c h e s as mpatches
3 from t e n s o r f l o w . e x a m p l e s . t u t o r i a l s . m n i s t i m p o r t i n p u t _ d a t a
4 # Read d a t a
5 m n i s t = i n p u t _ d a t a . r e a d _ d a t a _ s e t s ( " MNIST_data / " , o n e _ h o t = True )
The code uses built-in capabilities of TensorFlow to download the dataset locally and load it into
the python variable. As a result (if not specified otherwise), the data will be downloaded into the
MNIST_data/ folder. We are also defining some of the values that will be use further in the code:
1 i m a g e _ s i z e = 28
2 i n p u t _ l a y e r _ s i z e = 10
3 learning_rate = 0.05
4 s t e p s _ n u m b e r = 1 0 0 0 # Number o f e p o c h s
5 b a t c h _ s i z e = 100
Our task is to build a classifying neural network with TensorFlow. First, we need set up the archi-
tecture, train the network (using training set) and then evaluate the result on the test set.
To feed the network with the training data, we need to flatten the digit images. Depending on the
phase (training or testing), different examples will be pushed through the classifier. The training
process will be based on the labels while comparing them to the current predictions. This is why
we need to define these two placeholders such that:
1 # Define placeholders
2 t r a i n i n g _ d a t a = t f . p l a c e h o l d e r ( t f . f l o a t 3 2 , [ None , i m a g e _ s i z e ∗ i m a g e _ s i z e ] )
3 i n p u t _ l a y e r = t f . p l a c e h o l d e r ( t f . f l o a t 3 2 , [ None , i n p u t _ l a y e r _ s i z e ] )

Placeholders will be filled with the values passed when evaluating the computational graph. But the
actual goal of the training is to adjust the values of the weights and biases. This is why we need the
structure that will allow us to change the values along the process.
TensorFlow provides variables for this exact purpose. The initial values for the weights will follow
the normal distribution while biases will get the value 0.0, as later we will always assume that there
is no bias. Once we have them defined, the creation of the output layer can be written in one line:
1 # V a r i a b l e s t o be t u n e d
2 W = t f . V a r i a b l e ( t f . random_normal ( [ i m a g e _ s i z e ∗ i m a g e _ s i z e , i n p u t _ l a y e r _ s i z e ] ,
3 stddev =0.1) )
4 b = t f . V a r i a b l e ( t f . c o n s t a n t ( 0 . 1 , s h a p e =[ i n p u t _ l a y e r _ s i z e ] ) )
5
6 # B u i l d t h e network ( o n l y o u t p u t l a y e r )
7 o u t p u t = t f . matmul ( t r a i n i n g _ d a t a , W) + b
No matter the neural network, training process works by optimizing (either maximizing or minimiz-
ing) the loss function. In our case, we would like to minimize the difference between the network
predictions and actual values of nodes (input_layer). In deep learning, we often use a technique
called Cross entropy to define the loss. However, we will be only considering the squared error
L(x̂, x). TensorFlow provides the function called tf.nn.sigmoid that allows to use the sigmoid func-
tion as the nonlinearity of our layer. Besides, the tf.reduce_mean function takes the average over
these sums. This way we get the function that can be further optimised. In our example, we use the
Adam descent method from the tf.train API, though many other descent algorithms exist:
1 # Define the l o s s function

2 l o s s = t f . reduce_mean ( t f . s q u a r e ( o u t p u t − t r a i n i n g _ d a t a ) )
3 # Training step
4 t r a i n _ s t e p = t f . t r a i n . AdamOptimizer ( l e a r n i n g _ r a t e ) . m i n i m i z e ( l o s s )
Adam descent optimizer will work in several steps adjusting the values of the W and b variables (we
remind that b is zero). In particular, we would also like to have a way of evaluating the performance,
so that we know whether the parameters are optimum. First, we want to compare which nodes were
predicted correctly by using tf.argmax function. Then, the TensorFlow function tf.equal returns the
list of booleans so by casting the values to float and then calculating the average we finally get the
accuracy of the model:
1 # Accuracy c a l c u l a t i o n
2 c o r r e c t _ p r e d i c t i o n = t f . e q u a l ( t f . argmax ( o u t p u t , 1 ) , t f . argmax ( i n p u t _ l a y e r , 1 ) )
3 a c c u r a c y = t f . reduce_mean ( t f . c a s t ( c o r r e c t _ p r e d i c t i o n , t f . f l o a t 3 2 ) )

Now is the time when we have the computational graph built, and we can start the training process.
To start, we need to initialize the session and variables defined earlier:
1 # Run t h e t r a i n i n g
2 sess = tf . InteractiveSession ()
3 s e s s . run ( t f . g l o b a l _ v a r i a b l e s _ i n i t i a l i z e r ( ) )
As mentioned before, optimizer works in steps. In our case, we run the train_step inside the loop
feeding it with the batch data: images and corresponding nodes number. Moreover, the placeholders
must be filled in by using feed_dict parameters of the function run:
1 f o r i in range ( steps_number ) :
2 # Get t h e n e x t b a t c h
3 input_batch , i n p u t _ l a y e r _ b a t c h = mnist . t r a i n . next_batch ( b a t c h _ s i z e )
4 feed_dict = { training_data : input_batch , l a b e l s : input_layer_batch }
5
6 # Run t h e t r a i n i n g s t e p
7 t r a i n _ s t e p . run ( f e e d _ d i c t = f e e d _ d i c t )
We can make use of the accuracy defined previously to monitor the performance on the batches
during the training process. By adding the following code, we will print out the value every 50
steps. Nevertheless, this range of steps can be modified by the user. Thus we obtain:
1 # P r i n t t h e a c c u r a c y p r o g r e s s on t h e b a t c h e v e r y 50 s t e p s
2 i f i %50 == 0 :
3 train_accuracy = accuracy . eval ( feed_dict = feed_dict )
4 p r i n t ( " S t e p %d , t r a i n i n g b a t c h a c c u r a c y %g %% " %( i , t r a i n _ a c c u r a c y ∗ 1 0 0 ) )
After the training is finished, the goal is to check the network performance on the data it has not
previously seen - in the test set. We can reuse accuracy and feed it with the training data instead of
the training batch. Therefore, we can write the following lines of code:
1 # E v a l u a t e on t h e t e s t s e t
2 t e s t _ a c c u r a c y = a c c u r a c y . e v a l ( f e e d _ d i c t = { t r a i n i n g _ d a t a : m n i s t . t e s t . images ,
3 i n p u t _ l a y e r : mnist . t e s t . i n p u t _ l a y e r } )
4 p r i n t ( " T e s t a c c u r a c y : %g %% " %( t e s t _ a c c u r a c y ∗ 1 0 0 ) )
We remind that, in simple terms, the eval() method is used to evalue an expression such as a string
as a Python expression.

The test accuracy is now being determined. This value allows us to know if the training algorithm
works perfectly. Finally, we can plot the results obtained thanks to the decoder by using the follow-
ing commands, where range(10) corresponds to 9 images of any number between 0 and 9:
1 f o r i in range ( 1 0 ) :
2 p l t . subplot ( 1 , 10 , i +1)
3 p l t . imshow ( m n i s t . t e s t . i m a g e s [ i ] . r e s h a p e ( 2 8 , 2 8 ) )
4 plt . axis ( ’ off ’ )
5 p l t . show ( )
Since the program is done, we will gather the results inside a table.
In order to compare useful results some parameters will vary. More precisely, the batch size and the
learning rate will be the two main parameters whose value will change, so that conclusions can be
made. At last, we will consider a bias equal to 0.1 then 0.5 to check potential changes.
We remind that learning rate α is a hyper-parameter that controls how much we are adjusting the
weights of our network with respect the loss gradient. The lower the value, the slower we travel
along the downward slope. While this might be a good idea (using a low learning rate) in terms of
making sure that we do not miss any local minima, it could also mean that we’ll be taking a long
time to converge—especially if we get stuck on a plateau region.
Here is the formula showing the relationship:
∂
Wnew = W1 − α × J(W1 )
∂W1
Thus if the learning rate is too large the gradient descent may fail to converge, or even diverge.
Typically learning rates must be configured by the user. At best, we the user would leverage on
past experiences (training test sets) to gain the intuition on what is the best value to use in setting
learning rates. Furthermore, the learning rate affects how quickly our model can converge to a local
minima. Thus getting it right from the get go would mean lesser time for us to train the model.
Hebb’s rule: "practice, practice, practice."

Results
As a result, we obtain the following table of values:
Table 2 – MNIST Results
Number of layers Hidden (encoder) Batch size Epochs Learning rate Accuracy (%)
1 784 10 100 0.1 74.15
1 784 100 1,000 0.1 88.80
1 784 10 100 0.05 77.63
1 784 100 1,000 0.05 88.79
1 784 10 100 0.01 82.67
1 784 100 1,000 0.01 89.96
1 784 10 100 0.001 80.52
1 784 100 1,000 0.001 91.84
1 784 10 100 0.00001 7.46
1 784 100 1,000 0.00001 36.43
2 784 + 32 10 100 0.1 46.99
2 784 + 32 100 1,000 0.1 13.99
2 784 + 32 10 100 0.05 57.15
2 784 + 32 100 1,000 0.05 15.00
2 784 + 32 10 100 0.01 4.74
2 784 + 32 100 1,000 0.01 12.00
2 784 + 32 10 100 0.001 47.08
2 784 + 32 100 1,000 0.001 10.99
2 784 + 32 10 100 0.00001 0.6
2 784 + 32 100 1,000 0.00001 7.99
Here are below a couple of recalls concerning the given values:
- The number of layers has been arbitrary chosen for an example observation. This number may be
increased by the user himself in such a way that the algorithms proceeds within 3 layers or more.
- The number of nodes (hidden) corresponds to a product between the number of rows and the
number of columns for an image. Indeed, as a MNIST image is a picture of 28 × 28 pixels, the first
hidden layer has 784 weights.
- The batch size has been chosen in such a way that the batch size-epochs ratio is equal to 10.
However, this value could be multiplied by 10 or 100 for better results.

Hence, as shown in these data the enhance of accuracy depends consequently on the number of
times the algorithm is going through the dataset, but it also depends on the learning rate of the
algorithm. This gives the following curves:
(a) Accuracy as a function of the learning rate
(b) Accuracy over a number of 100 epochs
Figure 10 – Accuracy evolution

As it is observable when analyzing the curves, it is necessary to have the highest epoch value possible
for having the best data reconstruction accuracy. Indeed, this coherent result comes from the fact
that algorithms as humans need to loop over data set multiple times for improved results, over
time (Hebb’s rule). Besides, the learning record from Figure (b) shows that as the learning rate
increase, there is a point where the accuracy stops increasing and starts to decrease. This case is
easily observable for the red curve considering a learning rate equal to 0.1. In practice, the learning
rate should ideally be somewhere to the left to the lowest point of the graph. Hence, the learning
rate that needs to be chosen for the best accuracy possible in this case is α = 0.001.
Though it is often hard to get the optimal α, the below diagram demonstrates the different scenarios
one can fall into when configuring the learning rate.
Figure 11 – Effect of various learning rates on convergence. Source: cs231n.
According to this scheme gathering different cases for learning rates convergence, we can showcase
several scenarios which would end up to a bad reconstruction. As an example, with low learning
rates, the loss improves slowly, then training accelerates until the learning rate becomes too large
and loss goes up. In the end, the training process diverges. Therefore, choosing a learning rate
greater than 10−4 is an appropriate choice.
The below results are showing the reconstruction evolution for the different chosen learning rates.

Figure 12 – Learning rate of 1 and 0.1 for 4,000 steps.

Figure 13 – Learning rate of 0.01 and 0.001 for 4,000 steps.

Figure 14 – Learning rate of 0.0001 and 0.00001 for 4,000 steps.

Conclusion
The obtained results are quite outstanding, as for a constant number of steps, and equal to 4, 000
the reconstruction accuracy evolves in an exponential way as the learning rate becomes 10 times
smaller. As a result, using a learning rate of 0.00001 the denoising autoencoder is able to distinguish
the numbers 0, 1, 2, 3, 6, 7, 8 and 9. Meanwhile, it is possible to see that the loss difference between
α = 0.001, α = 0.0001 and α = 0.00001 is really small. Indeed, the below graph shows the amount
of loss data over time and a batch size of 100 for the different learning rates:
(a) Loss for 100 epochs with 2 hidden layers
(b) Loss for 1,000 epochs with 2 hidden layers
Figure 15 – Loss evolution regarding the number of epochs
Thus we conclude that the higher the number of epochs is the faster the loss of the lowest optimized
learning rate will converge to zero. In particular, the lowest tolerated learning rate α here being
0.001, the squared error of such a learning rate will be minimized when the number steps for each
layer is maximized. Theoretically 2, 000 epochs would be sufficient to have a great convergence.

3.4 Spectra representation
The light emitted by a light source like the heart of the stars is in all the wavelengths. A dispersive
system such as a prism or a network, separates and spreads all these wavelengths. As for the visible
band, it is the rainbow which is called continuous spectrum in such a way that, according to DARTH
FADER, the modeling can be expressed by:
Srainnow = L + N + C = 0 + 0 + C = C
where L contains the spectral line information, N is the noise and C the continuum.
These spectra are also an incredible source of information for who knows how to decipher them. In
particular, they can give an indication about the speed of the studied body. Everything people know
about galaxies and the whole universe comes from the light we perceive - the only exception is the
direct exploration of some objects of the solar system such as the Moon, comets and asteroids.
The spectra reconstruction presented in the next section will aim to reconstruct the absorption
lines, but also the emission lines of any spectra. Suppose an excitation of a pure gas, for example
hydrogen. Bright lines will appear on the spectrum, without a continuous background, because only
the hydrogen element is present. Those lines are called emission lines. Suppose a galaxy such as
Andromeda (made of billions of stars, emitting in all wavelengths a continuous spectrum. If a gas, for
example hydrogen, is interposed between the source and the observer, this hydrogen will "absorb"
the light at the wavelengths corresponding to this gas. In the spectrum, this light will disappear,
and we will obtain an absorption spectrum with dark lines. These lines are called absorption lines.
This process is the same for any galaxies and other celestial objects emitting some light.
Figure 16 – Messier 31 vs Aldebaran spectrum.

As said previously, to analyze a spectrum of a galaxy means to first break down its light. This is
achieved using a specific solid (such as a prism) and it is done using diffraction. The light being
spread by the network, it takes a certain amount of light to overload the cameras.
The general appearance of the spectrum of a galaxy, and especially of a star, makes it possible to put
it in one of the spectral classes which also have common characteristics. Indeed, stars are classified
according to their surface temperature. The main classes are, from the warmest to the coldest are
O, B, A, F, G0, G2, G5, K, M, where the Sun is a G2 star.
Finally, we notice that the determination of the class allows scientists to determine the absolute
luminosity (which allows to find the distance with the measurement of the magnitude), the mass,
the radius (thus the average density), and the lifetime of the studied galaxy or star.
This instrumentation, this technique and the software that accompanies them allow amateur as-
tronomers to access many scientific activities formerly reserved for professional astronomers only.
Considering the quality of the material accessible to the amateurs, it develops numerous and active
collaborations between amateurs and professionals.

4 Application to spectroscopic red shift
Observation of billions of galaxies has emerged with the creation of state-of-the-art telescopes like
Hubble or the brand-new James Webb Space Telescope expected for 2021. These technologies have
allowed astronomers to detect variation in the moves of distant galaxies, and in the meantime, con-
firm the Edwin Hubble’s law. In 1929, E. Hubble announced that almost all galaxies appeared to be
moving away from us. In fact, he found that the universe was expanding - with all of the galaxies
moving away from each other. This phenomenon was observed as a red shift of a galaxy’s spec-
trum. This red shift appeared to be larger for faint, presumably further, galaxies. Hence, the farther
a galaxy, the faster it is receding from Earth. This has led to a better understanding of the observable
universe. The velocity of a galaxy could be expressed mathematically as
v =H ×d
where v is the galaxy’s radial outward velocity, d is the galaxy’s distance from Earth, and H is the
constant of proportionality called the Hubble constant such that H = 71 ± 4 km/s/Mpc. This means
that a galaxy 1 megaparsec away will be moving away from us at a speed of 71 km/sec, while another
galaxy 100 megaparsecs away will be receding at 100 times this speed. So essentially, the Hubble
constant reflects the rate at which the universe is expanding.
Red shift is an increase in wavelength within the galaxy spectra which means a decrease in wave
frequency, and thus an increasing distance between Earth and the targeted galaxy. Indeed, for a
frequency ν the wavelength λ is defined by the following expression:
c
λ=
ν
with c the speed of light.
Hence, we can say that a red shift is also the result of a decrease of energy and thus a decrease in
the number of photons, as
E = hν
Finally, to determine an object’s distance, we only need to know its velocity. Velocity is measurable
thanks to the Doppler shift. However, it should be noted that, on very large scales, Einstein’s theory
predicts departures from a strictly linear Hubble law. The amount of departure, and the type, de-
pends on the value of the total mass of the universe. In this way a plot of or red shift as function of
the distance, which is a straight line at small distances, can tell us about the total amount of matter
in the universe and may provide crucial information about the mysterious dark matter.

4.1 Red shift estimation
Doppler-Fizeau effect
Christian Doppler, Austrian physicist (1803-1853) and Hippolyte Fizeau, French physicist (1818-
1896), independently discovered the variation of the frequency of a perceived sound when a sound
source moves with respect to an observer. When applied to light, this Doppler-Fizeau effect gener-
ates an offset of the frequencies emitted by a moving source relative to an observer.
By taking the spectrum of a distant object, such as a galaxy, astronomers can see a shift in the
lines of its spectrum and from this shift determine its velocity. Putting this velocity into the Hubble
equation, they determine the distance. Note that this method of determining distances is based on
observation (the shift in the spectrum) and on a theory (Hubble’s Law).
Indeed, let us consider a point light source O, emitting a monochromatic light towards an observer
A. The source and the observer are located at a distance d.
Hence, at the time t1 a first lightwave is sent by O towards A, and A receives this signal at the time
t01 such that
d
t01 = t1 +
c
At t2 = t1 + ∆t, this source is moving away from the observer with a distance equal to v∆t with a
considered velocity v.
As a result, A receives the signal from O’ at:
v∆t d
t02 = t2 + +
c c
We now are able to determine the time delay between the receiving of the two signals by the ob-
server. This delay is given by:
v∆t d d
t02 − t01 = t2 + + − t1 −
c c c
Hence
v∆t v∆t
t02 − t01 = t2 − t1 + = ∆t +
c c
By factorizing by ∆t = t2 − t1 we obtain
v

t02 − t01 = ∆t 1 +
c

In the case where the two emission times correspond to the travel of an entire light wavelength,
t02 − t01 becomes the period T of the received light which is also expressed by
λ
T = = ∆t
c
and
λ0
T0 = = ∆t0
c
Therefore:
v

0
λ =λ 1+
c
Consider the offset of the emissivity wavelengths at the reception ∆λ such that
∆λ = λ0 − λ
we thus obtain:
λ0 − λ v ∆λ
= =
λ c λ
This ratio is called red shift and is written
∆λ
z=
λ
If the theory is not correct, the distances determined in this way are all nonsense. Most astronomers
believe that Hubble’s Law does, however, hold true for a large range of distances in the universe.

Relativistic effects
We can observe that if the speed of the galaxy is positive - meaning that it moves away from us -
∆λ is positive (as there is a red shift). The shift is made towards the greatest wavelengths, towards
the red for visible light.
In the meanwhile, a galaxy approaching us – meaning that v is negative – will see its light shifted
to blue for visible light. As an example, this is the case for the Andromeda galaxy which is close to
ours. At last, we shall not forget that the measured velocity is not the actual velocity of the galaxy.
Indeed, red shift measurements estimate the radial velocity only – a component of the true velocity
on the observer’s targeting axis.
When the velocity measured by the Doppler-Fizeau effect is no longer negligible compared with the
speed of light, Newtonian classical mechanics is no longer appropriate to describe the phenomenon
properly. We must introduce relativity.
To understand this process, we must analyze the following equation, which allows speeds greater
than that of light, and is therefore unsuitable. The relation to be applied then is the following one:
λ0 1 + vc
=q
λ 1 − vc2
2
As a result, we notice that this equation is not mathematically correct when v = c.
At last, physicists confirmed several years ago that there is also a "gravitational red shift". Relativity
predicts a dilation of time near large densities of matter. This affects the wavelengths coming from
the light passing nearby, shifting them towards the red.
The final objective is to compute the previous mathematical expressions so that the algorithm esti-
mates the red shift z for each observable galaxy spectrum within the amount of data given in the
test set. This would result in the estimation of red shift for any input data of galaxy spectra of the
same range, and in the end, it would make researchers work easier.

4.2 Applied theory to neural networks
Initial programming
In the matrix representation the input, hidden and output nodes are represented by three vectors i
(input nodes vector), h (hidden nodes vector) and o (output nodes vector) respectively. The weights
connecting each layer are represented by a matrix. W (weight matrix) connects the input layer with
the hidden layer, and W T (weight matrix from the hidden layer to the output) connects the hidden
layer with the output layer. See figure below:
Figure 17 – Scheme of the overall computation, where V = W and W = W T . Source: DTU.
In this thesis, we decided to use the Python language to implement our algorithms. This one has two
important advantages for this project: Python is executable without compiling and very light syntax
(which allows to program and debug very quickly), and Python – as a functional programming
language – libraries can be used with an ease. In this case, we will be using the tensorflow library
which provides powerful mathematical computation.
We remind that TensorFlow is an open-source machine learning library created by Google whose
goal is to allow users to rationalize the development and runs of advanced analytical applications.
This library has better computational graph visualizations – which are indigenous when compared
to other libraries like Torch and Theano – and has the advantage of the seamless performance, quick
updates and frequent new releases with new features.

For the programming of a denoising autoencoder (DAE), the following steps have to be done in a
predefined order as below:
1. Compute the three layers: input layer, hidden layer, and decoding layer.
2. Compute the amount of noise: compute the corruption level, in percentage, chosen by the
user himself.
3. Compute the nonlinearity (tanh, Relu, non-Relu, sigmoid . . . ).
4. Compute the weight and bias matrices W and b.
5. Compute the cost function. Compute the estimated reconstruction error.
6. Display the results.
In addition, W is initialized with a uniformly sample formula from:

s
6
W0 = ±4
number of visible units + number of hidden units
such that
1 W_init_max = 4 ∗ np . s q r t ( 6 . / ( n b _ v i s i b l e + n b _ h i d d e n ) )
2 W _ i n i t = t f . random_uniform ( s h a p e = [ n b _ v i s i b l e , n b _ h i d d e n ] ,
3 m i n v a l = −W_init_max ,
4 maxval = W_init_max )
This can also be done using a second approach by determining the uniform function. According to
Probabilities, a law of probability is uniform on a range [a; b] if its probability density is the function
f defined on the same range by:
1
f (x) =
b−a
This can also be written as:

1

 b−a

 ∀x ∈ [a; b]
f (x) = 

0 ∀x ∈
/ [a; b]



Let demonstrate this result. Indeed, we must have
Z b
f (x)dx = 1
a
Hence:
Z b
kdx = 1
a
Using the law of integration we get:
k × [x]ba = k(b − a) = 1
Finally
1
k= = f (x)
b−a
in such a way that:
1
W0 = ±
number of visible units
This may be implemented within the autoencoder program using the following lines of code (in
Python) with the help of the Tensorflow library:
1 W_init_max = random . random ( − ( 1 / s q r t ( n b _ v i s i b l e ) ) , 1 / s q r t ( n b _ v i s i b l e ) )

2 W _ i n i t = t f . random_uniform ( s h a p e = [ n b _ v i s i b l e , n b _ h i d d e n ] ,
3 m i n v a l = −W_init_max ,
4 maxval = W_init_max )
The squared error is a measure of the correctness of the network. It is calculated for each input
pattern pt . Therefore, the higher the correction is the lower the squared error is.

Nonlinearity
Every activation function (or nonlinearity) takes a single number and performs a certain fixed math-
ematical operation on it. Here are below some activation functions often used:
• Sigmoid: sigmoid function.
• Tanh: hyperbolic tangent function.
• ReLU: rectified linear unit function.
• Leaky ReLU: leaky rectified linear unit function.
Consider the sigmoid function S such that
1
S(t) =
1 + e−t
Let us consider the logistic function h such that:
L
h(x) =
1+ e−k(x−x0 )
Hence, we can conclude that S is an increasing function in time, and h is increasing for values of
x ≤ x0 and decreasing for values of x ≥ x0 . Therefore, the sigmoid function is a special case of the
logistic function where:
• L = 1, where L is the maximum value the function can take.
• k = 1, where k is the parameter controlling how steep the change from the minimum to the
maximum value is.
• x0 = 0, where x0 controls where on the x-axis the growth should be.
By analyzing the sigmoid function S we can say that:
• The domain of S is R with a range such that y ∈ R, 0 < y < 1.
• The function S is injective and periodic in t with a period of 2iπ.
• The derivative of S(t) exists and is written as follows:
d 1 e−t

= = S(t) (1 − S(t))
dt 1 + e−t (1 + e−t )2

Nonlinearity computation
Although sigmoid functions have a higher run time than ReLU and hyperbolic tangent functions,
they are simple to use and easy to train. Moreover, sigmoid functions are continuous with compact
support and real-valued (see previous results), and dense in L∞ .
Finally, the tangent hyperbolic function will be chosen as the nonlinearity to be used for our layers,
one by one. Using the following Matlab commands
1 t = linspace ( −6 ,6 ,100) ;
2 x = 1 . / ( 1 + exp (− t ) ) ;
3 y = tanh ( t ) ;
4 plot ( t , x , ’ r ’ ) ;
5 h o l d on ;
6 plot ( t , y , ’b ’ ) ;
7 t i t l e ( ’ S i g m o i d and h y p e r b o l i c t a n g e n t f u n c t i o n s ’ ) ;
8 xlabel ( ’ t ’ ) ;
9 ylabel ( ’y ’ ) ;
10 l e g e n d ( ’ S i g m o i d ’ , ’ Tanh ’ )
we can draw the sigmoid and the hyperbolic tangent functions, so that the below graph is obtained:
Figure 18 – Curve of the sigmoid and hyperbolic tangent functions using Matlab.

In order to implement it in our algorithm, we must use the following code line:
1 t f . nn . s i g m o i d ( t f . matmul ( E s t i m a t e d _ X , W) + b , name = None )
or
1 t f . nn . t a n h ( t f . matmul ( E s t i m a t e d _ X , W) + b , name = None )
where tf.matmul(Estimated_X, W) does the cross product between the matrices Estimated_X (which
corresponds to the matrix gathering the corrupted data) and W.
Moreover, we remind that
sinh x e2x − 1
tanh x = = 2x
cosh x e +1
so that
tanh x = 2 · S(2x) − 1
Hence, we conclude that the hyperbolic tangent function is a shifted or rescaled version of the
sigmoid function. The sigmoid activation function has the potential problem that it saturates at
zero and one while tanh saturates at plus and minus one. Therefore, if the activity in our neural
network during training is close to zero then the gradient for the sigmoid activation function may
go to zero. This is called "the vanishing gradient problem".
Conclusions
Resulting from the above conclusions, we now know that the we must use at least 2 input, hidden
and decoding layers ensured to decode data from a given test set with the highest number of epochs
possible. The best ways to improve the neural network accuracy are the techniques of data optimiza-
tion, algorithm tuning and hyper-parameter optimization. There, we can conclude that the neural
network should not overfit to perform well. One of the main parameters acting on overshooting
is the choice of the activation function. Activation functions map the non-linear functional inputs
to the outputs. Hence, activation functions are highly important and choosing the right activation
function helps our model to learn better.

4.3 Galaxy spectra reconstruction
Initial results
Consider the signal-to-noise ratio SNR and the flux Φ such that SN R ∝ Φ. Plotting the recon-
structed spectra we obtain:
Figure 19 – True data-Reconstructed data comparison for a S/N ratio of 10 and 20.
The above results show the reconstruction of the same spectrum using different values:
- On the top left corner, the number of hidden values nhidden was 20 for a SNR of 10.
- On the top right corner, nhidden = 100 with a SNR of 10.
- On the middle-left, nhidden = 1, 000 with a SNR of 10.
- On the middle-right, nhidden = 100 with a SNR of 20.
- On the bottom left corner, nhidden = 20 with a SNR of 20.
- On the bottom right corner, nhidden = 1, 000 with a SNR of 20.

Hence, we can make the two following conclusions:
1. The lower the SNR the lower the accuracy. The minimum ratio that must be taken is thus greater
than 20, taking into account that a too high value would not be benefit either.
2. For a great number of hidden data the reconstruction of the signal becomes cleaner. In particular,
a number equal to 1, 000 seems to be the most optimized. Therefore, this value shall vary between
700 and 1, 300 for a SNR varying between 15 and 25.
However, as observed there is still a distance (i.e. a squared error not equal to zero) between the true
signal and the reconstructed signal. This is due to a problem of constant accuracy in the model.
In addition, here are tables gathering several results for an unchanged SNR equal to 38:
- For a number of epochs equal to 300
Table 3 – DAE first results
Epochs Batch size Learning rate Cost

0 50 0.0001 0.05053162
100 50 0.0001 0.02289464
200 50 0.0001 0.02883998
300 50 0.0001 0.02869600
- For a number of epochs equal to 1, 000
Table 4 – DAE second results
Epochs Batch size Learning rate Cost

0 100 0.001 0.7009244
100 100 0.001 0.7035959
200 100 0.001 0.7022968
300 100 0.001 0.7027009
400 100 0.001 0.7016478
500 100 0.001 0.7021609
600 100 0.001 0.7016679
700 100 0.001 0.7031889
800 100 0.001 0.7023773
900 100 0.001 0.7042344
1,000 100 0.001 0.7011703

Those results from the second table give the below reconstructed signals:
Figure 20 – Reconstructed input after 1,000 epochs for a signal-to-noise ratio of 38.
We can clearly observe that a too high SNR gives curves absolutely incorrect. This imply that in the
future we will implement back a SNR such that it is equal to 25.
Finally, we could observe that the optimal learning rate is α 0.0001 thanks to the following curve
obtained for a small amount of epochs:
Figure 21 – Cost function vs learning rate.
Indeed, the curve shows well that below such a learning rate the cost is increasing again.
In the end, we have gathered enough data to conclude that: the optimal number of epochs is equal
to 400, the learning rate shall not exceed 10−4 , and the SNR must be approximately equal to 25.

Improvements
Despite the optimization of all those parameters, the neural network is still having difficulties to
reach an excellent reconstruction. The non-improvement of the accuracy from this model may come
from overfitting. As explained, overfitting can be a serious problem in deep learning. In order to
solve this problem, we can use a special technique to reverse it. Dropout is a technique developed to
solve this exact problem whose idea is to randomly drop units in a deep neural network. Moreover,
it is one of the biggest advancements in deep learning to come out in the last few years.
Learning the relationship between the inputs and the outputs of a dataset is a very complicated
procedure. When using a very small data set, the relationship maybe a result of noise in the input
sample. As the dataset for training which is used contains thousands of data, this assumption is no
longer correct.
Dropout refers to randomly and temporary removing a unit, either in a hidden or a visible layer, and
all of its incoming and outgoing connections. Using Tensorflow, we can initialize dropout as below:
1 # Save of the c u r r e n t output

2 y = current_input
3 # Dropout f u n c t i o n
4 d r o p _ o u t = t f . nn . d r o p o u t ( y , 0 . 9 0 ) # 90% t h a t e a c h e l e m e n t i s k e p t
Figure 22 – Scheme of the dropout method.

Conclusions
Using all the results from above, we can understand that the batch size has a real impact on the
learning accuracy. To demonstrate this, let us plot the cost function for two different batch sizes:
Figure 23 – Cost function for a batch size = 20 and SNR = 25 (left), and for a batch size = 40 and
SNR = 23 (right).
Those two curves were obtained for a number of epochs equal to 400. For a doubled batch size, the
cost value is twice higher. This lets suppose that a good batch size would be equal to 10. However,
the graph from below shows us also that, the signal-to-noise ratio is really important to get the
lowest losses possible. Indeed, for the same amount of epochs but with a SNR of 30 we get:
Figure 24 – Cost function for a batch size = 20 and SNR = 30 (left).
This confirms that the best SNR value is 25, and thus the best batch size is 20.
As a result, the reconstructed signals look like the following.

Figure 25 – Signals reconstruction.
To conclude, the results are almost perfect which make the output almost similar to the input. A
small problem of optimization is still present. A change of activation function into ReLU or Leaky
ReLU may solve the problem. See the below results.

Figure 26 – Accurate reconstruction with ReLU activation function of a galaxy spectrum (1).


5 Further discussion
The denoising autoencoder gives impressive results and thus it is a very accurate tool. However,
improvements of accuracy are still possible by optimizing in a better way the signal-to-noise ratio
and the learning rates in such a way that, in the first instance, this can lead to a squared error
reaching to zero. The key parts of change are the number of layers implemented inside the neural
network and the quantity, reliability and quality of data entered as the input. We could observe that
significant parameters such as the number of steps during which the algorithm is running through
the data set and the batch size are important for the reconstructed data accuracy. Therefore, one
way to improve the algorithm would be to determine accurately what is the best batch size and
the optimal number of hidden units, so that the training time of the DAE is minimized as much
as possible without degrading the results. Besides, a second way to go further in this application
would be to use the cross-correlation method that could allow scientists to estimate red shift. Cross-
correlation between two signals u(t) and v(t) is given by the following mathematical expression:
Z +∞
w(t) = u(t) ⊗ v(t) = u∗ (τ − t)v(τ )dτ
−∞
Unlike convolution, the integration variable, τ , has the same sign in the arguments of u(...) and
v(...) so the arguments have a constant difference instead of a constant sum.
Cross correlation is used to find where two signals match, where u(t) is the test waveform - which
would be the initial input spectrum. The signal v(t) is the signal that contains u(t) with an added
noise in our model. Therefore, we would implement this inside the code as the following:
1 c r o s s _ c o r r e l a t i o n = np . c o r r c o e f ( x , y = None , rowvar = True )
where x represents the array containing the variables and observations as the input, and y the
array containing the variables and observations as the output once reconstruction is done. When a
galaxy spectrum is cross-correlated with the given templates, its found emission lines are masked
out, implying that the red shift is derived from the absorption features.
It must be remembered that the reconstructing process was too long for experiencing higher values
for main parameters such as the batch size: a deeper evaluation can emphasize the time spending
over each iteration and thus find the best time consuming-cost ratio for slow machines. Only with
a powerful machine can a full appreciation of the DAE’s current reconstruction time occur.
At this point the denoising autoencoder does not work with its full power in the reconstruction
of spectra. Users should be aware about the potential enhancement that can be achieved with the
current algorithm.

6 Conclusion and Future Work
Unsupervised learning had a catalytic effect in reviving interest in deep learning, but has since
been overshadowed by the successes of purely supervised learning. Although we have not focused
on it in this thesis, we expect unsupervised learning to become far more important in the longer
term. Indeed, human and animal learning is largely unsupervised as we discover the structure of
the world by observing it, not by being told the name of every object. Human vision is an active
process that sequentially samples the optic array in an intelligent, task-specific way using a small,
high-resolution fovea (part of the retina) with a large, low-resolution surround. So is Human vision
of the Universe built-up. In particular, we expect much of the future progress in galaxy spectra
vision to come from systems that are trained end-to-end. Systems combining deep learning and
reinforcement learning are in their infancy, but they already outperform passive vision systems at
classification tasks and produce impressive results in numerous fields. Astrometry and Cosmology
are other areas in which deep learning is poised to make a large impact over the next few years.
We expect systems that use neural networks to understand data or whole documents will become
much better when they learn strategies for selectively attending to one part at a time. Ultimately,
major progress in artificial intelligence will come about through systems that combine represen-
tation learning with complex reasoning. Although deep learning and simple reasoning have been
used for speech and handwriting recognition for a long time, new paradigms are needed to replace
rule-based manipulation of symbolic expressions by operations on large vectors.
This work was based on data provided by CEA Saclay. We have trained a simple denoising
autoencoder for galaxy spectra red shift estimation, based on the objective of undoing a corruption
process by adding some noise. This work was motivated by the goal of making an autoencoder
learn representations of the input that are robust to small irrelevant changes in input, that were
real spectra, and distinguish emission rays from the noise. It was also motivated from a manifold
learning perspective, as a denoising autoencoder can be seen as a way to define and learn a manifold.
Additionally, it gave an interpretation from a generative model perspective. This principle can be
used in our algorithm by adding some layers, so that it may train and stack autoencoders to initialize
a deep neural network. A series of parameters choice experiments were performed to evaluate the
best values for the greatest accuracy. The obtained results support the conclusions that getting the
best signal-to-noise ratio and learning rate are not sufficient to have a “lossless” decoding process.
Indeed, a mathematical problem of optimization makes the decoding impossible to have a squared
error which is minimized. This is led by the fact that unsupervised initialization of layers with an
explicit denoising criterion, chosen by the user, helps to capture interesting structure in the input
distribution. At last, red shift must be estimated with a high efficiency involving a good accuracy
of decoding. Future work inspired by the gotten results should lead to the implementation of cross-
correlation, but it should also investigate the best way of estimating galaxies’ red shift using different
types of corruption process, not only of the input but of the representation itself as well.

Bibliography
[1] D. P. Machado, A. Leonard, J.-L. Starck, F. B. Abdalla, and S. Jouvel, Darth Fader: Using wavelets
to obtain accurate red shifts of spectra at very low signal-to-noise, ESO, CEA Saclay, France, August
25, 2018
[2] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, Pierre-Antoine Manzagol, Extracting and
Composing Robust Features with Denoising Autoencoders, Léon Bottou, University of Montreal, July
5, 2008
[3] J. Frontera-Pons, F. Sureau, J. Bobin, E. Le Floc’h, Unsupervised feature-learning for galaxy SEDs
with denoising autoencoders, ESO, CEA Saclay, France, May 17, 2017
[4] Pascal Vincent, Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Net-
work with a Local Denoising Criterion, Léon Bottou, University of Montreal, December 2010

Webography
https://www.astromatic.net/
https://towardsdatascience.com/
https://openclassrooms.com/
https://medium.com/
https://www.learnopencv.com/
https://stackoverflow.com/
https://github.com/
https://stats.stackexchange.com/
https://blog.goodaudience.com/
https://fr.wikipedia.org/
https://www.futura-sciences.com/
https://www.python-course.eu/
https://www.kaggle.com/

List of Figures
1 Darth Fader algorithm result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Example spectra after FDR denoising. Source: Darth Fader article. . . . . . . . . . . 4
3 PCA graph example. Source: Wikipedia. . . . . . . . . . . . . . . . . . . . . . . . . 6
4 Scree plot (output). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5 Basic steps of ML. Source: Jinesh Maloo. . . . . . . . . . . . . . . . . . . . . . . . . 9
6 Autoencoder diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
7 Learning Representations. Source: UVA. . . . . . . . . . . . . . . . . . . . . . . . . 11
8 Structure of an autoencoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
9 Diagram of the process of a denoising autoencoder. . . . . . . . . . . . . . . . . . . 14
10 Accuracy evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
11 Effect of various learning rates on convergence. Source: cs231n. . . . . . . . . . . . 21
12 Learning rate of 1 and 0.1 for 4,000 steps. . . . . . . . . . . . . . . . . . . . . . . . . 22
13 Learning rate of 0.01 and 0.001 for 4,000 steps. . . . . . . . . . . . . . . . . . . . . . 23
14 Learning rate of 0.0001 and 0.00001 for 4,000 steps. . . . . . . . . . . . . . . . . . . . 24
15 Loss evolution regarding the number of epochs . . . . . . . . . . . . . . . . . . . . 25
16 Messier 31 vs Aldebaran spectrum. . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
17 Scheme of the overall computation, where V = W and W = W T . Source: DTU. . . . 32
18 Curve of the sigmoid and hyperbolic tangent functions using Matlab. . . . . . . . . 36
19 True data-Reconstructed data comparison for a S/N ratio of 10 and 20. . . . . . . . . 38
20 Reconstructed input after 1,000 epochs for a signal-to-noise ratio of 38. . . . . . . . 40
21 Cost function vs learning rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
22 Scheme of the dropout method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
23 Cost function for a batch size = 20 and SNR = 25 (left), and for a batch size = 40 and
SNR = 23 (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
24 Cost function for a batch size = 20 and SNR = 30 (left). . . . . . . . . . . . . . . . . . 42
25 Signals reconstruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
26 Accurate reconstruction with ReLU activation function of a galaxy spectrum (1). . . 44

List of Tables
1 Project synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 MNIST Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 DAE first results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 DAE second results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Appendices

List of appendices
Page 54 Appendix 1 MNIST data decoding with 1 layer
Page 57 Appendix 2 MNIST data decoding with 2 layers
Page 63 Appendix 3 Principal Component Analysis programs
Page 66 Appendix 4 Denoising autoencoder for red shift estimation

Appendix 1: MNIST data decoding with 1 layer
1 # −∗− c o d i n g : u t f −8 −∗−
2 """
3 C r e a t e d on S a t Nov 24 1 6 : 1 4 : 0 5 2 0 1 8
4
5 @author : Romain F o n t e y n e
6
7 T h i s program r u n s MNIST i m a g e s w i t h 1 l a y e r
8 """
9
10 # ################
11 # ## IMPORTS # # #
12 # ################
13
14
15 import tensorflow as t f
16 import m a t p l o t l i b . pyplot as p l t
18
19 # Read d a t a
20 m n i s t = i n p u t _ d a t a . r e a d _ d a t a _ s e t s ( " MNIST_data / " , o n e _ h o t = True )
21
22
23 # ##################################
24 # ## PARAMETERS INITIALIZATION # # #
25 # ##################################
26
27 i m a g e _ s i z e = 28
28 i n p u t _ l a y e r _ s i z e = 10
29 learning_rate = [0.1 , 0.05 , 0.01 , 0.001 , 0.00001]
30 steps_number = 1000
31 b a t c h _ s i z e = 100
32 stddev = 0.0
33 mean = 0 . 0
34
35 accuracy_list = []
36 l o s s _ l i s t = []
37
38
39 # #####################
40 # ## MAIN PROGRAM # # #
41 # #####################
42
43 f o r i in range ( len ( l e a r n i n g _ r a t e ) ) :
44 # Define placeholders

45 t r a i n i n g _ d a t a = t f . p l a c e h o l d e r ( t f . f l o a t 3 2 , [ None , i m a g e _ s i z e ∗ i m a g e _ s i z e ] )
46 i n p u t _ l a y e r = t f . p l a c e h o l d e r ( t f . f l o a t 3 2 , [ None , i n p u t _ l a y e r _ s i z e ] )
47
48 # V a r i a b l e s t o be t u n e d
49 W = t f . V a r i a b l e ( t f . random_normal ( [ i m a g e _ s i z e ∗ i m a g e _ s i z e , i n p u t _ l a y e r _ s i z e ] ,
stddev =0.1) )
50 b = t f . V a r i a b l e ( t f . c o n s t a n t ( 0 . 0 , s h a p e =[ i n p u t _ l a y e r _ s i z e ] ) )
51
52 # B u i l d t h e network ( o n l y o u t p u t l a y e r )
53 o u t p u t = t f . matmul ( t r a i n i n g _ d a t a , W) + b
54
55 # Define the l o s s function

56 # l o s s = t f . reduce_mean ( t f . s q u a r e ( o u t p u t − t r a i n i n g _ d a t a ) )
57 l o s s = t f . reduce_mean ( t f . nn . s o f t m a x _ c r o s s _ e n t r o p y _ w i t h _ l o g i t s ( l a b e l s =
input_layer , l o g i t s =output ) )
58
59 # Training step
60 t r a i n _ s t e p = t f . t r a i n . AdamOptimizer ( l e a r n i n g _ r a t e [ i ] ) . m i n i m i z e ( l o s s )
61
63 c o r r e c t _ p r e d i c t i o n = t f . e q u a l ( t f . argmax ( o u t p u t , 1 ) , t f . argmax ( i n p u t _ l a y e r ,
1) )
65
66 # Run t h e t r a i n i n g
69
70
71 f o r i in range ( steps_number ) :
72 # Get t h e n e x t b a t c h
73 epoch_loss = 0
74 input_batch , i n p u t _ l a y e r _ b a t c h = mnist . t r a i n . next_batch ( b a t c h _ s i z e )
75 feed_dict = { training_data : input_batch , input_layer : input_layer_batch }
76
77 _ , l = s e s s . run ( [ t r a i n _ s t e p , l o s s ] ,
78 feed_dict )
79
80 e p o c h _ l o s s += l
81
82 # Run t h e t r a i n i n g s t e p
83 t r a i n _ s t e p . run ( f e e d _ d i c t = f e e d _ d i c t )
84
85 # p r i n t ( " Loss = " , epoch_loss )

86
87

88 # P r i n t t h e a c c u r a c y p r o g r e s s on t h e b a t c h e v e r y 1 0 0 s t e p s
89 i f i %100 == 0 :
90 train_accuracy = accuracy . eval ( feed_dict = feed_dict )
91 p r i n t ( " S t e p %d , t r a i n i n g b a t c h a c c u r a c y %g %% " %( i , t r a i n _ a c c u r a c y ∗ 1 0 0 ) )
92
93 # E v a l u a t e on t h e t e s t s e t
94 t e s t _ a c c u r a c y = a c c u r a c y . e v a l ( f e e d _ d i c t = { t r a i n i n g _ d a t a : m n i s t . t e s t . images ,
i n p u t _ l a y e r : mnist . t e s t . l a b e l s } )
95 p r i n t ( " T e s t a c c u r a c y : %g %% " %( t e s t _ a c c u r a c y ∗ 1 0 0 ) )
96
97 " " " l o s s _ f l o a t = loss . eval ( )

98 p r i n t ( " Loss : \ n " )
99 print ( loss_float ) """
100
101 a c c u r a c y _ l i s t . append ( t e s t _ a c c u r a c y )
102 # l o s s _ l i s t . append ( l o s s _ f l o a t )
103
104 # ########################
105 # ## OUTPUT DISPLAYS # # #
106 # ########################
107
108 f o r i in range ( 1 0 ) :
109 p l t . subplot ( 1 , 10 , i +1)
110 p l t . imshow ( m n i s t . t e s t . i m a g e s [ i ] . r e s h a p e ( 2 8 , 2 8 ) )
111 plt . axis ( ’ off ’ )
112 p l t . show ( )
MNIST_1_layer.py

Appendix 2: MNIST data decoding with 2 layers
1 # −∗− c o d i n g : u t f −8 −∗−
2 """
3 C r e a t e d on Tue Nov 27 2 2 : 0 0 : 2 7 2 0 1 8
4
6
7 T h i s program r u n s MNIST i m a g e s w i t h 2 l a y e r s
8 """
9
10 # ################
11 # ## IMPORTS # # #
12 # ################
13
14 # Import of Tensorflow
16
17 # I m p o r t o f t h e MNIST d a t a
19
20 # I m p o r t o f math l i b r a r i e s
23 from m a t p l o t l i b . l e g e n d _ h a n d l e r i m p o r t H a n d l e r L i n e 2 D
24 from numpy i m p o r t l o a d t x t
25 i m p o r t numpy a s np
26 from p y l a b i m p o r t r c P a r a m s
27
28
29 # ##################################
31 # ##################################
32
33 # Reading of the data

34 d a t a = np . l o a d ( " l s s t _ c o s m o s s n a p _ s p e c t r a _ r e s t f r a m e _ t r a i n i n g _ s e t _ 1 0 0 0 0 _ b s l _ t 7 _ 8 6 4 2
. npy " )
35 m n i s t = i n p u t _ d a t a . r e a d _ d a t a _ s e t s ( " / tmp / d a t a / " , o n e _ h o t = True )
36 X_train = mnist . t r a i n . images
37 X _ t e s t = mnist . t e s t . images
38
39 # S i z e o f t h e p r i n t wanted
40 rcParams [ ’ f i g u r e . f i g s i z e ’ ] = 20 ,20
41
42 # I n i t i a l i z a t i o n o f l i s t f o r f u t u r e a c c u r a c y and l o s s v a l u e s
43 accuracy_list = [[] ,[] ,[] ,[] ,[]]

44 loss_list = [[] ,[] ,[] ,[] ,[]]
45
46 # Learning rate i n i t i a l i z a t i o n
47 learn_rate = [0.1 , 0.05 , 0.01 , 0.001 , 0.00001] # How f a s t t h e model s h o u l d
learn
48
49 # L o o p i n g t h r o u g h t h e f i r s t 10 t e s t i m a g e s and d i s p l a y
50 " " " f o r i in range ( 1 0 ) :
51 p l t . subplot ( 1 , 10 , i +1)
52 p l t . imshow ( X _ t e s t [ i ] . r e s h a p e ( 2 8 , 2 8 ) , cmap = ’ Greys ’ )
53 plt . axis ( ’ off ’)
54 p l t . show ( ) " " "
55
56 # Creation of the noise matrix

57 n_rows = X _ t e s t . s h a p e [ 0 ]
58 n _ c ol s = X _t e s t . shape [ 1 ]
59 mean = 0 . 0
60 stddev = 0.0
61 n o i s e = np . random . normal ( mean , s t d d e v , ( n_rows , n _ c o l s ) )
62
63 # Creation of the noisy t e s t data

64 X_test_noisy = X_test + noise
65
66 # Display of the noisy images

67 " " " f o r i in range ( 1 0 ) :
68 p l t . subplot ( 1 , 10 , i +1)
69 p l t . imshow ( X _ t e s t _ n o i s y [ i ] . r e s h a p e ( 2 8 , 2 8 ) , cmap = ’ Greys ’ )
71 p l t . show ( ) " " "
72
73 # I n i t i a l i z a t i o n o f t h e n o d e s by l a y e r
74 n_nodes_inpl = 784 # Encoder
75 n _ n o d e s _ h l 1 = 32 # E n c o d e r
76 n _ n o d e s _ h l 2 = 32 # D ec ode r
77 n _ n o d e s _ o u t l = 7 8 4 # D eco der
78
79
80 # D e f i n i t i o n o f t h e b a t c h s i z e and number o f e p o c h s
81 b a t c h _ s i z e = 1 0 0 # How many i m a g e s t o u s e t o g e t h e r f o r t r a i n i n g
82 hm_epochs = 1 0 0 0 # How many t i m e s t o go t h r o u g h t h e e n t i r e d a t a s e t
83 t o t _ i m a g e s = X _ t r a i n . s h a p e [ 1 ] # T o t a l number o f i m a g e s
84
85 # #####################
86 # ## MAIN PROGRAM # # #
87 # #####################
88

89 f o r j in range ( len ( l e a r n _ r a t e ) ) :
90 # F i r s t h i d d e n l a y e r h a s 7 8 4 ∗ 3 2 w e i g h t s and 0 b i a s e s
91 " " " hidden_1_layer_vals = {
92 ’ w e i g h t s ’ : t f . V a r i a b l e ( t f . random_normal ( [ n _ n o d e s _ i n p l , n _ n o d e s _ h l 1 ] ) ) ,
93 ’ b i a s e s ’ : t f . V a r i a b l e ( t f . random_normal ( [ n _ n o d e s _ h l 1 ] ) ) } """
94
95 hidden_1_layer_vals = {
96 ’ w e i g h t s ’ : t f . V a r i a b l e ( t f . random_normal ( [ n _ n o d e s _ i n p l , n _ n o d e s _ h l 1 ] ) ) ,
97 ’ b i a s e s ’ : t f . c o n s t a n t ( 0 . 0 , shape = [ n_nodes_hl1 ] ) }
98
99 # Second h i d d e n l a y e r h a s 3 2 ∗ 3 2 w e i g h t s and 0 b i a s e s
100 " " " hidden_2_layer_vals = {
101 ’ w e i g h t s ’ : t f . V a r i a b l e ( t f . random_normal ( [ n _ n o d e s _ h l 1 , n _ n o d e s _ h l 2 ] ) ) ,
102 ’ b i a s e s ’ : t f . V a r i a b l e ( t f . random_normal ( [ n _ n o d e s _ h l 2 ] ) ) } " " "
103
104
105 hidden_2_layer_vals = {
106 ’ w e i g h t s ’ : t f . V a r i a b l e ( t f . random_normal ( [ n _ n o d e s _ h l 1 , n _ n o d e s _ h l 2 ] ) ) ,
107 ’ b i a s e s ’ : t f . c o n s t a n t ( 0 . 0 , shape = [ n_nodes_hl2 ] ) }
108
109
110 # Second h i d d e n l a y e r h a s 3 2 ∗ 7 8 4 w e i g h t s and 0 b i a s e s

111 """ output_layer_vals = {
112 ’ w e i g h t s ’ : t f . V a r i a b l e ( t f . random_normal ( [ n _ n o d e s _ h l 2 , n _ n o d e s _ o u t l ] ) ) ,
113 ’ b i a s e s ’ : t f . V a r i a b l e ( t f . random_normal ( [ n _ n o d e s _ o u t l ] ) ) } " " "
114
115
116 output_layer_vals = {
117 ’ w e i g h t s ’ : t f . V a r i a b l e ( t f . random_normal ( [ n _ n o d e s _ h l 2 , n _ n o d e s _ o u t l ] ) ) ,
118 ’ b i a s e s ’ : t f . c o n s t a n t ( 0 . 0 , shape = [ n_nodes_outl ] ) }
119
120
121 # C r e a t i o n o f t h e p l a c e h o l d e r ( image w i t h s h a p e 7 8 4 g o e s i n )
122 i n p u t _ l a y e r = t f . p l a c e h o l d e r ( ’ f l o a t ’ , [ None , 7 8 4 ] )
123
124 # M u l t i p l i c a t i o n o f t h e o u t p u t o f i n p u t _ l a y e r by t h e w e i g h t m a t r i x and add

of biases
125 l a y e r _ 1 = t f . nn . s i g m o i d (
126 t f . add ( t f . matmul ( i n p u t _ l a y e r , h i d d e n _ 1 _ l a y e r _ v a l s [ ’ w e i g h t s ’ ] ) ,
127 hidden_1_layer_vals [ ’ biases ’ ]) )
128
129 # M u l t i p l i c a t i o n o f t h e o u t p u t o f l a y e r _ 1 by t h e w e i g h t m a t r i x and add o f

biases
130 l a y e r _ 2 = t f . nn . s i g m o i d (
131 t f . add ( t f . matmul ( l a y e r _ 1 , h i d d e n _ 2 _ l a y e r _ v a l s [ ’ w e i g h t s ’ ] ) ,
132 hidden_2_layer_vals [ ’ biases ’ ]) )

133
134 # M u l t i p l i c a t i o n o f t h e o u t p u t o f l a y e r _ 2 by t h e w e i g h t m a t r i x and add o f

biases
135 o u t p u t _ l a y e r = t f . matmul ( l a y e r _ 2 , o u t p u t _ l a y e r _ v a l s [ ’ w e i g h t s ’ ] ) +
output_layer_vals [ ’ biases ’ ]
136
137 # o u t p u t _ t r u e s h a l l have t h e o r i g i n a l image f o r e r r o r c a l c u l a t i o n s

138 o u t p u t _ t r u e = t f . p l a c e h o l d e r ( ’ f l o a t ’ , [ None , 7 8 4 ] )
139
140 # Cost f u n c t i o n d e f i n i t i o n
141 meansq = t f . reduce_mean ( t f . s q u a r e ( o u t p u t _ l a y e r − o u t p u t _ t r u e ) )
142
143 # Optimizer d e f i n i t i o n
144 o p t i m i z e r = t f . t r a i n . A d a g r a d O p t i m i z e r ( l e a r n _ r a t e [ j ] ) . m i n i m i z e ( meansq )
145
146 # Session initialization

147 init = tf . global_variables_initializer ()
149 s e s s . run ( init )
150
152 c o r r e c t _ p r e d i c t i o n = t f . e q u a l ( t f . argmax ( o u t p u t _ l a y e r , 1 ) , t f . argmax (
input_layer , 1) )
154
155 # Running t h e model f o r a 1 0 0 0 e p o c h s t a k i n g 1 0 0 i m a g e s i n b a t c h e s

156 f o r epoch i n r a n g e ( hm_epochs ) :
157 epoch_loss = 0 # i n i t i a l i z i n g e r r o r as 0
158 f o r i in range ( i n t ( tot_images / b a t c h _ s i z e ) ) :
159 epoch_x = X _ t r a i n [ i ∗ b a t c h _ s i z e : ( i + 1 ) ∗ b a t c h _ s i z e ]
160 _ , c = s e s s . run ( [ o p t i m i z e r , meansq ] , \
161 f e e d _ d i c t = { i n p u t _ l a y e r : epoch_x , \
162 o u t p u t _ t r u e : epoch_x } )
163 e p o c h _ l o s s += c
164 p r i n t ( ’ Epoch ’ , epoch , ’ / ’ , hm_epochs , ’ l o s s : ’ , e p o c h _ l o s s )
165
166 i f i %100 == 0 :
167 t r a i n _ a c c u r a c y = a c c u r a c y . e v a l ( f e e d _ d i c t = { i n p u t _ l a y e r : epoch_x , \
168 o u t p u t _ t r u e : epoch_x } )
169
170 tes t_ac curac y = accuracy . eval ( f e e d _ d i c t ={ input_layer : X_test ,

output_true : X_train } )
171
172 # Add o f t h e c u r r e n t a c c u r a c y t o t h e l i s t
173 a c c u r a c y _ l i s t [ j ] . append ( t e s t _ a c c u r a c y ∗ 1 0 0 )
174

175 # Add o f t h e c u r r e n t l o s s t o t h e l i s t
176 l o s s _ l i s t [ j ] . append ( e p o c h _ l o s s )
177
178 # Choose o f any image

179 any_image = X _ t e s t _ n o i s y [ 2 3 4 ]
180
181 # Run o f t h e image

182 o u t p u t _ a n y _ i m a g e = s e s s . run ( o u t p u t _ l a y e r , \
183 f e e d _ d i c t = { i n p u t _ l a y e r : [ any_image ] } )
184
185
186
187 # ########################
189 # ########################
190
191 " " " f o r i in range ( 1 0 ) :

192 p l t . subplot ( 1 , 10 , i +1)
193 p l t . imshow ( any_image )
195 p l t . show ( ) " " "
196
197 # Rename o f t h e d a t a from t h e p r e v i o u s l i s t s

198 alpha0 , alpha1 , alpha2 = a c c u r a c y _ l i s t [ 0 ] , a c c u r a c y _ l i s t [ 1 ] , a c c u r a c y _ l i s t [ 2 ]
199 alpha3 , alpha4 = a c c u r a c y _ l i s t [ 3 ] , a c c u r a c y _ l i s t [ 4 ]
200
201 loss0 , loss1 , loss2 = l o s s _ l i s t [0] , l o s s _ l i s t [1] , l o s s _ l i s t [2]

202 loss3 , loss4 = l o s s _ l i s t [3] , l o s s _ l i s t [4]
203
204 # Display of the accuracy values

205 plt . figure (1)
206 a0 , = p l t . p l o t ( a l p h a 0 , ’ r ’ , label = r ’ $ \ alpha = 0.1 $ ’ )
207 a1 , = p l t . p l o t ( a l p h a 1 , ’ b ’ , label = r ’ $ \ alpha = 0.05 $ ’ )
208 a2 , = p l t . p l o t ( a l p h a 2 , ’ g ’ , label = r ’ $ \ alpha = 0.01 $ ’ )
209 a3 , = p l t . p l o t ( a l p h a 3 , ’m ’ , label = r ’ $ \ alpha = 0.001 $ ’ )
210 a4 , = p l t . p l o t ( a l p h a 4 , ’ k ’ , label = r ’ $ \ alpha = 0.00001 $ ’ )
211
212 f i r s t _ l e g e n d = p l t . l e g e n d ( h a n d l e s =[ a0 ] , l o c = 1 , frameon = F a l s e )
213 p l t . l e g e n d ( h a n d l e r _ m a p = { a0 : H a n d l e r L i n e 2 D ( nu mp oi nt s = 4 ) } )
214
215 plt . xlabel ( " Training step " )

216 plt . y l a b e l ( " Accuracy " )
217 plt . t i t l e ( " Accuracy in time " )
218 plt . show ( )
219
220 # Display of the l o s s values

222 l0 , = plt . plot ( loss0 , ’r ’ , label = r ’ $ \ alpha = 0.1 $ ’ )
223 l1 , = plt . plot ( loss1 , ’b ’ , label = r ’ $ \ alpha = 0.05 $ ’ )
224 l2 , = plt . plot ( loss2 , ’g ’ , label = r ’ $ \ alpha = 0.01 $ ’ )
225 l3 , = plt . plot ( loss3 , ’m ’ , label = r ’ $ \ alpha = 0.001 $ ’ )
226 l4 , = plt . plot ( loss4 , ’k ’ , label = r ’ $ \ alpha = 0.00001 $ ’ )
227
228 a x _ l e g e n d = p l t . l e g e n d ( h a n d l e s =[ l 0 ] , l o c = 1 , frameon = F a l s e )
229 p l t . l e g e n d ( h a n d l e r _ m a p = { l 0 : H a n d l e r L i n e 2 D ( nu mp oi nt s = 4 ) } )
230

232 plt . y l a b e l ( " Loss " )
233 plt . t i t l e ( " Loss in time " )
234 plt . show ( )
MNIST_2_layers.py

Appendix 3: Principal Component Analysis program
1 %% A u t h o r s : Romain F o n t e y n e
2 %% Date : 0 4 / 1 2 / 2 0 1 8
3
4 % T h i s s c r i p t c a l c u l a t e s t h e p r i n c i p a l components and v e c t o r s
5 % u s i n g t h e P r i n c i p a l Component A n a l y s i s
6
7 rng ’ d e f a u l t ’
8
9 M = 1 5 ; % Number o f o b s e r v a t i o n s
10 N = 8 ; % Number o f v a r i a b l e s o b s e r v e d
11
12 % Made−up d a t a
13 X = r a n d (M, N ) ;
14
15 % De−mean ( MATLAB w i l l de−mean i n s i d e o f PCA , b u t I want t h e de−meaned v a l u e s

later )
16 X = b s x f u n ( @minus , X , mean ( X ) ) ; %X − mean ( X )
17
18 % Do t h e PCA
19 [ c o e f f , s c o r e , l a t e n t , ~ , e x p l a i n e d ] = pca ( X ) ;
20
21 % C a l c u l a t e e i g e n v a l u e s and e i g e n v e c t o r s o f t h e c o v a r i a n c e m a t r i x
22 c o v a r i a n c e M a t r i x = cov ( X ) ;
23 [ V , D] = e i g ( c o v a r i a n c e M a t r i x ) ;
24
25 % " c o e f f " a r e t h e p r i n c i p a l component v e c t o r s . These a r e t h e e i g e n v e c t o r s o f t h e

c o v a r i a n c e m a t r i x . Compare . . .
26 coeff
27 V
28
29 % M u l t i p l y t h e o r i g i n a l d a t a by t h e p r i n c i p a l component v e c t o r s t o g e t t h e
p r o j e c t i o n s o f t h e o r i g i n a l d a t a on t h e
30 % p r i n c i p a l component v e c t o r s p a c e . T h i s i s a l s o t h e o u t p u t " s c o r e " . Compare . . .
31 d a t a I n P r i n c i p a l C o m p o n e n t S p a c e = X∗ c o e f f
32 score
33 % The columns o f X∗ c o e f f a r e o r t h o g o n a l t o e a c h o t h e r . T h i s i s shown w i t h . . .
34 corrcoef ( dataInPrincipalComponentSpace )
35
36 % The v a r i a n c e s o f t h e s e v e c t o r s a r e t h e e i g e n v a l u e s o f t h e c o v a r i a n c e m a t r i x ,
and a r e a l s o t h e o u t p u t " l a t e n t " . Compare
37 % these three outputs
38
39 var ( dataInPrincipalComponentSpace ) ’
40

41 latent
42
43 s o r t ( d i a g (D) , ’ descend ’ )
PCA.m

1 # −∗− c o d i n g : u t f −8 −∗−
2 """
3 C r e a t e d on Thu Nov 1 1 8 : 4 6 : 1 9 2 0 1 8
4
6 """
7
8 i m p o r t numpy a s np , m a t p l o t l i b
10
11 # Make a random a r r a y and t h e n make i t p o s i t i v e −d e f i n i t e

12 num_vars = 8 # Number o f v a r i a b l e s
13 num_obs = 15 # Number o f o b s e r v a t i o n s
14 A = np . random . randn ( num_obs , num_vars ) # C r e a t e random m a t r i x A
15 A = np . a s m a t r i x ( A . T ) ∗ np . a s m a t r i x ( A ) # C a l c u l a t e A by i n t e r p r e t i n g t h e
o r i g i n a l input A as a matrix
16 p r i n t (A)
17
18 U , S , V = np . l i n a l g . s v d ( A )
19 e i g v a l s = S ∗ ∗ 2 / np . cumsum ( S ) [ −1] # Calculate eigenvalues
20
21
22 # #######################
23 # ## OUTPUT RESULTS # # #
24 # #######################
25
26 fig = plt . figure ( figsize =(8 ,5) )

27 s i n g _ v a l s = np . a r a n g e ( num_vars ) + 1
28 p l t . p l o t ( s i n g _ v a l s , e i g v a l s , ’ ro− ’ , l i n e w i d t h = 2 )
29
30 plt . t i t l e ( ’ Scree Plot ’ )

31 p l t . x l a b e l ( ’ P r i n c i p a l Component ’ )
32 pl t . ylabel ( ’ Eigenvalue ’ )
33
34 l e g = p l t . l e g e n d ( [ ’ E i g e n v a l u e s from SVD ’ ] , l o c = ’ b e s t ’ , b o r d e r p a d = 0 . 3 ,
35 shadow= F a l s e , prop = m a t p l o t l i b . f o n t _ m a n a g e r . F o n t P r o p e r t i e s ( s i z e =
’ small ’ ) ,
36 markerscale =0.4)
37 leg . get_frame ( ) . set_alpha ( 0 . 4 )
38 l e g . d r a g g a b l e ( s t a t e = True )
39
40 p l t . show ( )
Scree_Plot–PCA.py

Appendix 4: Denoising autoencoder for galaxy spectra red shift estimation
1 # −∗− c o d i n g : u t f −8 −∗−
2 """
3 C r e a t e d on Thu J a n 03 1 5 : 1 3 : 5 4 2 0 1 8
4
5 @author : Romain FONTEYNE

6
7 This denoising autoencoder t r a i n s over galaxy s p e c t r a

8 """
9
10
11 # ################
12 # ## IMPORTS # # #
13 # ################
14
15 print ( " Importing l i b r a r i e s . . . \ n" )

16
17 # Import of Tensorflow
19 # I m p o r t o f numpy
20 i m p o r t numpy a s np
21 import os
22 # Import of m a t p l o t p l i b
25 from m a t p l o t l i b . l e g e n d _ h a n d l e r i m p o r t H a n d l e r L i n e 2 D
26
27 # ##################################
29 # ##################################
30
31 # I n i t i a l i z a t i o n o f l i s t f o r f u t u r e a c c u r a c y and l o s s v a l u e s
32 accuracy_list = [[] ,[] ,[] ,[] ,[]]
33 cost_list = [[] ,[] ,[] ,[] ,[]]
34
35
36 # ######################
37 # ## MAIN FUNCTION # # #
38 # ######################
39
40 def denoising_autoencoder ( dimensions =[784 , 512 , 256 , 6 4 ] ) :

41 " " " B u i l d a deep d e n o i s i n g a u t o e n c o d e r w/ t i e d w e i g h t s .
42
43 Parameters
44 −−−−−−−−−−

45 dimensions : l i s t , optiona l
46 The number o f n e u r o n s f o r e a c h l a y e r o f t h e a u t o e n c o d e r .
47
48 Returns
49 −−−−−−−
50 x : Tensor
51 I n p u t p l a c e h o l d e r t o t h e network
52 z : Tensor
53 I n n e r −most l a t e n t r e p r e s e n t a t i o n
54 y : Tensor
55 Output r e c o n s t r u c t i o n o f t h e i n p u t
56 c o s t : Tensor
57 O v e r a l l c o s t to use f o r t r a i n i n g
58 """
59 x = t f . p l a c e h o l d e r ( t f . f l o a t 3 2 , [ None , d i m e n s i o n s [ 0 ] ] , name= ’ x ’ )
60 #x = t f . reshape ( x , [1000 , 4258])
61 c o r r u p t _ p r o b = t f . p l a c e h o l d e r ( t f . f l o a t 3 2 , [ None , d i m e n s i o n s [ 0 ] ] )
62 c u r r e n t _ i n p u t = ( c o r r u p t _ p r o b ∗ ( t f . random_normal ( s h a p e = t f . s h a p e ( x ) , mean
= 0 . 0 , s t d d e v = 1 , d t y p e = t f . f l o a t 3 2 , s e e d =None , name=None ) ) ) + x
63 # corrupt_prob = t f . reshape ( corrupt_prob , [1000 , 4258])
64 # current_input = t f . reshape ( current_input , [1000 , 4258])
65 print ( corrupt_prob )
66 # Corruption process
67
68 encoder = [ ]
69 f o r l a y e r _ i , n_output in enumerate ( dimensions [ 1 : ] ) :
70 n_input = i n t ( current_input . get_shape ( ) [ 1 ] )
71 W = t f . V a r i a b l e ( t f . random_uniform ( [ n _ i n p u t , n _ o u t p u t ] , − 1 . 0 / np . s q r t (
n _ i n p u t ) , 1 . 0 / np . s q r t ( n _ i n p u t ) ) , name= "W" )
72 e n c o d e r . append (W)
73 o u t p u t = t f . nn . r e l u ( t f . matmul ( c u r r e n t _ i n p u t , W) )
74 current_input = output
75
76 z = current_input
77 encoder . reverse ( )
78
79 f o r l a y e r _ i , n_output in enumerate ( dimensions [ : − 1 ] [ : : − 1 ] ) :

80 W = t f . t r a n s p o s e ( encoder [ l a y e r _ i ] ) # Computation of the weight mat rix
81 o u t p u t = ( t f . matmul ( c u r r e n t _ i n p u t , W) ) # Ouput
82 current_input = output # A c t u a l i z a t i o n
83
84 # Save of the c u r r e n t output

85 y = current_input
86
87 # Cost f u n c t i o n c a l c u l a t i o n
88 c o s t = t f . s q r t ( t f . reduce_mean ( t f . s q u a r e ( y − x ) ) )

89
90 # Dropout f u n c t i o n
91 d r o p _ o u t = t f . nn . d r o p o u t ( y , 0 . 9 0 ) # 90% t h a t e a c h e l e m e n t i s k e p t
92
93
94 r e t u r n { ’ x ’ : x , ’ z ’ : z , ’ y ’ : y , ’ dropout ’ : drop_out , ’ corrupt_prob ’ :

c o r r u p t _ p r o b , ’ c o s t ’ : c o s t , ’ w e i g h t ’ : W}
95
96
97
98 # Load o f d a t a
99 t r a i n i n g = ( np . l o a d ( ’
l s s t _ c o s m o s s n a p _ s p e c t r a _ r e s t f r a m e _ t r a i n i n g _ s e t _ 1 0 0 0 0 _ b s l _ t 7 _ 8 6 4 2 . npy ’ ) ) . T
100 w a v e l e n g t h _ t r a i n i n g = np . l o a d ( ’
l s s t _ c o s m o s s n a p _ s p e c t r a _ r e s t f r a m e _ t r a i n i n g _ s e t _ w a v e l e n g t h . npy ’ )
101 # print ( training )
102
103 # Flux c a l c u l a t i o n
104 f l u x = np . median ( t r a i n i n g [ : , 3 0 0 0 : 3 3 7 7 ] , a x i s = 1 )
105
106
107 num_samples = 2 0 0 0 # 1 7 0 0
108 n_input = 4258
109 n_hid = 1200
110 # 800
111
112 # Training stage

113 b a t c h _ s i z e = 20
114 n_epochs = 450
115
116 ae = d e n o i s i n g _ a u t o e n c o d e r ( d i m e n s i o n s =[ n _ i n p u t , n _ h i d ] )
117
118
119 p r i n t ( ’ \ nParameters computation . . . \ n ’ )

120
121 # Corruption process

122 c o r r u p t i o n = np . z e r o s ( ( t r a i n i n g . s h a p e [ 0 ] , t r a i n i n g . s h a p e [ 1 ] ) )
123 SNR = 25 # S i g n a −to −N o i s e r a t i o
124 sigma = f l u x / SNR
125
126 f o r i i n r a n g e ( 0 , num_samples ) :
127 c o r r u p t i o n [ i , : ] = sigma [ i ] ∗ np . o n e s ( t r a i n i n g . s h a p e [ 1 ] )
128
129
130 # Learning rate i n i t i a l i z a t i o n

131 # learning_rate = [0.1 , 0.05 , 0.01 , 0.001 , 0.00001] # How f a s t t h e model s h o u l d
learn
132 learning_rate = 0.0001
133
134 o p t i m i z e r = t f . t r a i n . AdamOptimizer ( l e a r n i n g _ r a t e ) . m i n i m i z e ( ae [ ’ c o s t ’ ] )
135
136
137 sess = t f . Session ( )

139
140
141 p r i n t ( ’ Decoding . . . \ n ’ )
142
143
144 f o r epoch_i in range ( n_epochs ) :

145 f o r b a t c h _ i i n r a n g e ( num_samples / / b a t c h _ s i z e ) :
146 batch_xs = t r a i n i n g [ batch_i ∗ b a t c h _ s i z e : ( batch_i +1) ∗ batch_size , : ]
147 corrupt_xs = corruption [ batch_i ∗ b a t c h _ s i z e : ( batch_i +1) ∗ batch_size , : ]
148 s e s s . run ( o p t i m i z e r , f e e d _ d i c t = { ae [ ’ x ’ ] : b a t c h _ x s , ae [ ’ c o r r u p t _ p r o b ’ ] :
corrupt_xs } )
149
150 # i f ( e p o c h _ i %100 == 0 ) :
151 p r i n t ( " Epoch " , e p o c h _ i , " / " , n_epochs , " C o s t : " , s e s s . run ( ae [ ’ c o s t ’ ] ,
f e e d _ d i c t = { ae [ ’ x ’ ] : b a t c h _ x s , ae [ ’ c o r r u p t _ p r o b ’ ] : c o r r u p t _ x s } ) )
152 # p r i n t ( "W: \ n " , s e s s . run ( ae [ ’ w e i g h t ’ ] , f e e d _ d i c t = { ae [ ’ x ’ ] : b a t c h _ x s , ae [ ’
corrupt_prob ’ ] : corrupt_xs } ) )
153 np . s a v e ( ’ Weight . npy ’ , s e s s . run ( ae [ ’ w e i g h t ’ ] , f e e d _ d i c t = { ae [ ’ x ’ ] : b a t c h _ x s ,
ae [ ’ c o r r u p t _ p r o b ’ ] : c o r r u p t _ x s } ) )
154 # Add o f t h e c u r r e n t l o s s t o t h e l i s t
155 c o s t _ l i s t [ 0 ] . append ( s e s s . run ( ae [ ’ c o s t ’ ] , f e e d _ d i c t = { ae [ ’ x ’ ] : b a t c h _ x s , ae [ ’
corrupt_prob ’ ] : corrupt_xs } ) )
156
157
158 print ( ’ \ nReconstruction . . . \ n ’ )

159
160 # P l o t example r e c o n s t r u c t i o n s
161 n _ e x a m p l e s = num_samples
162 c o r r u p t _ t e s t = 0 ∗ np . random . randn ( n_examples , t r a i n i n g . s h a p e [ 1 ] )
163
164
165 r e c o n = s e s s . run ( ae [ ’ y ’ ] , f e e d _ d i c t = { ae [ ’ x ’ ] : t r a i n i n g [ 0 : n_examples , : ] , ae [ ’

corrupt_prob ’ ] : corrupt_test } )
166
167 r e c o n _ d r o p o u t = s e s s . run ( ae [ ’ d r o p o u t ’ ] , f e e d _ d i c t = { ae [ ’ x ’ ] : t r a i n i n g [ 0 :
n_examples , : ] , ae [ ’ c o r r u p t _ p r o b ’ ] : c o r r u p t _ t e s t } )
168

169 p r i n t ( "W: \ n " , s e s s . run ( ae [ ’ w e i g h t ’ ] , f e e d _ d i c t = { ae [ ’ x ’ ] : b a t c h _ x s , ae [ ’
corrupt_prob ’ ] : corrupt_test } ) )
170 # np . s a v e ( ’ W e i g h t _ o u t p u t . npy ’ , ae [ ’ w e i g h t ’ ] )
171
172 c o d e = s e s s . run ( ae [ ’ z ’ ] , f e e d _ d i c t = { ae [ ’ x ’ ] : t r a i n i n g [ 0 : n_examples , : ] , ae [ ’

corrupt_prob ’ ] : corrupt_test } )
173 e r r o r _ d e c o d e = ( np . l i n a l g . norm ( ( t r a i n i n g [ 0 : n_examples , : ] − r e c o n [ 0 : n_examples , : ] )
, a x i s = 1 ) ) / np . l i n a l g . norm ( t r a i n i n g [ 0 : n_examples , : ] , a x i s = 1 )
174
175
176 e r r o r _ d e c o d e _ d r o p o u t = ( np . l i n a l g . norm ( ( t r a i n i n g [ 0 : n_examples , : ] − r e c o n _ d r o p o u t

[ 0 : n_examples , : ] ) , a x i s = 1 ) ) / np . l i n a l g . norm ( t r a i n i n g [ 0 : n_examples , : ] , a x i s =
1)
177
178
179 # ########################
181 # ########################
182
183
184
185 cost0 = c o s t _ l i s t [0]

186
187 # Display of the l o s s values

189 l0 , = p l t . p l o t ( cost0 , ’ r ’ , l a b e l = r ’ $ \ alpha = 0.0001 $ ’ )
190

192 plt . y l a b e l ( " Cost " )
193 plt . t i t l e ( " Cost in time " )
194 plt . show ( )
195
196 f o r p in range ( 0 , 4 ) :
197 plt . figure ( ) , plt . plot ( wavelength_training , training [p , : ] ) , plt . plot (
wavelength_training , recon [p , : ] )
198 p l t . show ( )
DAE.py

Representation Learning For Galaxy Spectra

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Representation Learning For Galaxy Spectra

Hochgeladen von

Copyright:

Verfügbare Formate

IPSA

Institut Polytechnique Bis, 63 boulevard de Brandebourg

Representation learning for red shift galaxy spectra

Supervised by Mrs. Joana Frontera-Pons

ANN Artificial Neural Network

CEA Commissariat à l’énergie atomique et aux énergies alternatives

DAE Denoising autoencoder

PCA Principal Component Analysis

SVM Vector Support Machine

VAE Variational autoencoder

2 State of the Art 3

4 Application to spectroscopic red shift 28

6 Conclusion and Future Work 47

Table 1 – Project synthesis

Synthesis by Romain Fonteyne - 2018-2019

IPSA - Python, tensorflow, numpy ;

Results Explanations about possible deviations

- Understanding of Neural Networks.

Encountered difficulties Future work

- Understanding of Deep Learning.

1.1 The Universe and Galaxies’ Red Shift

1.2 Big data treatment

Romain Fonteyne © 2019 Page 1/70

Romain Fonteyne © 2019 Page 2/70

2.1 DARTH FADER algorithm

Romain Fonteyne © 2019 Page 3/70

Romain Fonteyne © 2019 Page 4/70

Romain Fonteyne © 2019 Page 5/70

An individual is an element of Rp . As an example, the ith individual corresponds to the line of

Figure 3 – PCA graph example. Source: Wikipedia.

Romain Fonteyne © 2019 Page 6/70

Figure 4 – Scree plot (output).

Romain Fonteyne © 2019 Page 7/70

Romain Fonteyne © 2019 Page 8/70

3.1 Machine learning

In Machine Learning, the idea is that the algo-

Romain Fonteyne © 2019 Page 9/70

Figure 6 – Autoencoder diagram.

Romain Fonteyne © 2019 Page 10/70

Figure 7 – Learning Representations. Source: UVA.

Romain Fonteyne © 2019 Page 11/70

Figure 8 – Structure of an autoencoder.

Romain Fonteyne © 2019 Page 12/70

Romain Fonteyne © 2019 Page 13/70

• xcorr is used as in the previous model:

Figure 9 – Diagram of the process of a denoising autoencoder.

Romain Fonteyne © 2019 Page 14/70

1 import tensorflow as tf , m a t p l o t l i b . pyplot as p l t

Romain Fonteyne © 2019 Page 15/70

1 # Define the l o s s function

Romain Fonteyne © 2019 Page 16/70

Romain Fonteyne © 2019 Page 17/70

Hebb’s rule: "practice, practice, practice."

Romain Fonteyne © 2019 Page 18/70

As a result, we obtain the following table of values:

Table 2 – MNIST Results

Here are below a couple of recalls concerning the given values:

Romain Fonteyne © 2019 Page 19/70

(a) Accuracy as a function of the learning rate

(b) Accuracy over a number of 100 epochs

Figure 10 – Accuracy evolution

Romain Fonteyne © 2019 Page 20/70

Figure 11 – Effect of various learning rates on convergence. Source: cs231n.