Beruflich Dokumente
Kultur Dokumente
Written by
Romain Fonteyne
I am using this opportunity to express my gratitude to my supervisor, Mrs. Joana Maria Frontera-
Pons, who supported me throughout the course of this end-of-studies Master’s project. I am thankful
for her aspiring guidance and invaluably constructive criticism.
Abbreviations
AE Autoencoder
AI Artificial Intelligence
ML Machine Learning
NN Neural Network
tf Tensorflow
Symbols
A Matrix
b, bT , b−1 Bias matrix
d Distance
E Set of positive real numbers
f (xcorr , θ) Input model (function)
g(y, θ−1 ) Output model (function)
h(x) Logistic function
k Optimal number of principal components
L(x̂, x), LH (x̂, x) Loss function
n Number of variables
p Number of individuals
pt Pattern
q(x) Corrupted data function
r Correlation coefficient
S(t) Sigmoid function
W, W T , W −1 Weight matrix
y, y(x) Encoded data
x Input data
xcorr Corrupted data
X̂, x̂ Reconstructed data
x̄, ȳ Mean value
α Slope
θ Weight matrix symbol
σ Nonlinearity
σ̂X Standard deviation of X
σ̂Y Standard deviation of Y
σ̂XY Estimator of the covariance
Abstract
This Master’s thesis focuses on the estimation of galaxy red shifts and its application in machine
learning. This work is conducted within the context of a classical neural network model, for which
we explain the principle of this project as well as its limitations. In particular, we recall that red shift
is an increase in the wavelength within the studied galaxy spectrum. Considering that photometric
red shifts can only be measured for millions of galaxies, and that spectroscopic data are limited, we
demonstrate why they are essential to modern cosmology and how to estimate such precise values
through Deep Learning. We apply this learning method on data collected by the Commissariat à
l’énergie atomique et aux énergies alternatives (CEA) of Saclay and we pay particular attention to
the errors and accuracy of the data retransmitted by our denoising autoencoder (DAE). For the
success of our mission, we use Unsupervised Learning techniques in order to train our algorithm
to determine the spectral red shift in a autonomous way. The large number of spectra present
in the training data set allowed us to build a robust algorithm with adequate parameter values.
Hence, we adjusted the parameters in such a way that the DAE optimally reconstructs the data,
thus obtaining spectral values (output) in very good agreement with the input data. Lastly, we
observed that optimizing the algorithm of a DAE is necessary for an optimal functioning.
Résumé
Ce projet de fin d’études se concentre sur l’estimation des décalages spectraux de galaxies vers
le rouge (couremment appelés “red shifts”) et ses applications au sein de l’apprentissage machine.
Cette étude se place dans le contexte d’un modèle standard de réseaux de neurones pour lequel nous
expliquons le principe de cette étude, ainsi que ses limites et la manière dont le travail a été réalisé. En
particulier, nous rappelons que le décalage vers le rouge est une augmentation de la longueur d’onde
dans le spectre de la galaxie étudiée. Etant donné que les décalages vers le rouge photométriques
ne peuvent être mesurés que pour des millions de galaxies, et que le nombre de données spectro-
scopiques est limité, nous montrons pourquoi ils sont essentiels à la cosmologie moderne et comment
estimer des valeurs aussi précises grâce au Deep Learning. Nous avons appliqué cette méthode sur
des données recueillies par le Commissariat à l’énergie atomique et aux énergies alternatives (CEA)
de Saclay et porté une attention particulière aux erreurs et à la précision des données retransmises
par notre auto-encoder filtreur de bruit (Denoising Autoencoder). Pour la réussite de notre mission,
nous avons utilisé l’apprentissage non supervisé afin d’entraîner notre algorithme à déterminer le
décalage spectral de façon autonome. Le grand nombre de spectres présents au sein des données
d’entraînement nous a permis de construire un algorithme robuste avec des paramètres adéquats.
Nous avons ajusté les paramètres de telle sorte que le DAE reconstruit au mieux les données, pour
ainsi obtenir des valeurs spectrales en très bon accord avec les données initiales. Nous avons observé
que l’optimisation de l’algorithme d’un DAE est nécessaire pour un fonctionnement optimal.
Contents
Acknowledgement 1
Abbreviations 2
Symbols 3
Project Synthesis 7
1 Introduction 1
1.1 The Universe and Galaxies’ Red Shift . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Big data treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Project objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3 Representation Learning 9
3.1 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Denoising autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Introduction to machine learning algorithms . . . . . . . . . . . . . . . . . . . . . . 15
MNIST example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Spectra representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5 Further discussion 46
Bibliography 48
Webography 49
Appendices 52
Appendix 1: MNIST data decoding with 1 layer . . . . . . . . . . . . . . . . . . . . . . . . 54
Appendix 1: MNIST data decoding with 2 layers . . . . . . . . . . . . . . . . . . . . . . . 57
Appendix 3: Principal Component Analysis programs . . . . . . . . . . . . . . . . . . . . 63
Appendix 4: Denoising autoencoder for galaxy spectra red shift estimation . . . . . . . . 66
Project Synthesis
Representation learning for galaxy spectra. To investigate the use of recent advances in
deep learning to design new representations
for galaxy spectra.
Main customer Utilized tools
Realized studies
- Investigation of the deep learning techniques. - Study of the denoising autoencoders archi-
tecture to derive robust representations for the
continuum component of the galaxy spectra.
- Study of Artificial Neural Networks (ANNs).
Additionally, our work will deal with cross-correlation methods which use a discrete Fourier trans-
form to correlate a template spectrum with a galaxy spectrum, allowing the shift of the template
spectrum to become a free parameter. These methods can be computed as a simple multiplication in
Fourier space between the template and galaxy spectra. However, they require the spectra to be free
of continuum. As those methods can be computed as a simple multiplication (resulting in easier and
faster computation than performing the same procedure in real space), cross-correlation techniques
are really convenient.
0
The following assumption is made within the algorithm: any test spectrum Sλ may be represented as
a linear combination of normalized template spectra Tiλ . The DARTH FADER algorithm (Denoised
and Automatic Red shifts Thresholded with a False Detection Rate), which is a new wavelet-based
method for estimating red shifts of galaxy spectra widely automated, shows that, if there are lots of
line features (either absorption or emission) present in the spectrum it is easy to determine the red
shift via cross-correlation with a representative set of eigentemplates.
Vf
F DR =
Va
where Vf is the number of inactive pixels, and Va is the number of active pixels.
Thus a FDR ratio of 0.05 allows on average one false feature for every 20 features detected. Even
though it is usually not possible to reach this statistical accuracy. Increasing the FDR parameter does
provide significant improvement in the efficacy of the denoising. As a result, the False Discovery
Rate denoising procedure denoises the positive and negative halves of the spectrum independently,
with positivity and negativity constraints respectively.
Nevertheless, for a large test catalogue that includes a variety of galaxy types, a large number of
templates is needed to ensure the best match-up between template and test spectra. To use all of
them in the cross-correlation would be excessively time-consuming. If it was possible to reduce the
number of templates whilst still retaining most of the information content of these templates then
we can render the method more practical. Principal Component Analysis is a simple tool that allows
us to do just that: to reduce the dimensionality of this problem, that may be described as a neural
network, by extracting the most important features from our set of template spectra, the principal
components.
Definition
Principal Component Analysis (abbreviated PCA) is a statistical procedure that uses an orthogonal
transformation to convert a set of observations of possibly correlated variables (entities each of
which takes on various numerical values) into a set of values of linearly uncorrelated variables
called principal components. Therefore, the principle is to build a representation system of a reduced
dimension that keeps the distance between individuals, where the distance d, used in a set E such
that d : E × E → R+ , is written as the following:
v
u n
uX
d= t (x
i − yi )2
i=1
One of the main problems of such an analysis is the loss of information, which may be controlled
or uncontrolled.
As a mean to analyze the relationships between variables, PCA utilizes a correlation coefficient that
measures the linear link between two variables X and Y . This coefficient is defined by the following
expression:
σ̂XY
r=
σ̂X · σ̂Y
with
PN
1
σ̂XY = N i (xi − x̄)(yi − ȳ)
qP
N
σ̂X = (xi − x̄)2
qPi
N
σ̂Y = i (yi − ȳ)2
PN
1
x̄ = xi
i
N
1 PN
ȳ = yi
N i
which are respectively estimator of the covariance, standard deviations, and means.
r is ranged between −1 and 1, as a direct consequence of the triangle inequality.
A variable is an element of Rn . For instance, the j th variable corresponds to the column of elements
ak,j . Hence, in this case a variable is an element of R15 .
Considering randomly chosen values for the components of the matrix A, it is possible to get a
graph (called a “scree plot”) showing the number of principal components needed in regards to the
eigenvalues of A. Here is an example of a scree plot obtained using the Python script available in
Appendix 3:
This graph allows the user to determine the optimal number of principal components k (thus the
optimal number of principal vectors) to downsize the input-dataset dimension while having a new
set of values where the variables are linearly uncorrelated. In order to get k, it is necessary to
carefully read the scree plot. It is easily observable that, on the previous output graph, the first
part of the curve is a decreasing line which has a slope α = 4. This line stops before becoming a
slightly rounded curve. The area connecting both parts of the curve is called the “Elbow”. Finally,
the optimal k is the value where the elbow is the most concentrated. In other words, the number
of principal components that maximizes the rate of non-correlation is the intersection between the
line and the rounded parts of the curve. Therefore, according to the upper results we can conclude
that, as for this case the best k is k = 3.
• Mathematical simplicity: this factorial method only uses eigenvalue and eigenvector calcula-
tions.
• Simple results: thanks to the graphs it provides, the Principal Components Analysis makes it
possible to apprehend a large part of its results in a fast way.
• Flexibility: The PCA is a very flexible method, since it is applied to a set of data of any size
and content, as long as it is quantitative data organized in individuals/variables.
• Power of computation: The PCA may be a simple mathematical approach it is still a really
powerful method. It offers, in a few operations only, a summary and an overall view of the
relationships between the quantitative variables, results which could not have been obtained
otherwise, or only at the cost of tedious manipulations.
As a method of data analysis, PCA does not really have disadvantages per se. It is simply applied to
specific cases in order to generate a particular type of result. It would therefore make no sense to say
that it is a disadvantage that, this method, does not apply outside of this context. Similarly, since it is
a data summarization technique we can say that the loss of information necessarily generated is not
a con, but rather a condition for obtaining the result, even if it may obscure important characteristics
in pre-defined particular cases.
To counteract these two major problems, our case study will focus on a specific method whose
results will be more accurate than results coming from a principal component analysis, as the loss
of information (mathematical function) will be minimized.
The specific task to be accomplished is the problem we are trying to solve by modeling the phe-
nomenon. In the next section we will see that our problem deals with layers, hence let consider
neural networks only. In particular, the notion of neural networks has a huge impact on deep learn-
ing which exploits this concept by its very nature using unsupervised methods. An Artificial Neural
Network (ANN) is a computational model that is inspired by the way biological neural networks in
the human brain process information. Neural networks have generated a lot of excitement in Ma-
chine Learning research and industry, thanks to many breakthrough results in speech recognition,
computer vision and text processing.
Most machine learning practitioners are first exposed to feature extraction techniques through unsu-
pervised learning, so is Deep Learning. In unsupervised learning, an algorithm attempts to discover
the latent features that describe a data set’s "structure" under certain (either explicit or implicit)
assumptions. For example, low-rank singular value decomposition (of which principal component
analysis is a specific example) factors a data matrix into three reduced rank matrices that mini-
mize the squared error L(x̂, x) of the reconstructed data matrix x̂ such that L(x̂, x) = kx̂ − xk2 .
Therefore the error is zero (which corresponds to the best result achievable) if x̂ = x.
Although traditional unsupervised learning techniques will always be staples of machine learning
pipelines, representation learning has emerged as an alternative approach to feature extraction with
the continued success of deep learning. In representation learning, features are extracted from un-
labeled data by training a neural network on a secondary, supervised learning task. This learning
model by layers representation came to this world in 1950. As previously described, there are nu-
merous way of learning (supervised, unsupervised, regression, reinforcement, . . . ), but all of them
require a lot of training sets due to the amount of parameters to optimize. In order to follow these
steps, a special type of neural networks called “autoencoders” must be used, whose goal is to inter-
pret training sets of data to predict corrupted data in real a data set.
These techniques have enabled significant progress in the fields of sound and image processing,
including facial recognition, speech recognition, computer vision, automated language processing,
text classification (for example spam recognition). Potential applications are very numerous. A
spectacularly example is the AlphaGo program, which learned to play the GO game by the deep
learning method, and defeated the world champion in 2016. As a matter of fact, we can add that
vision recognition is ultra challenging as for a small 256×256 resolution and for 256 pixel values
a total of 2524,288 images are possible. This number is way bigger than the number of stars in the
visible universe (1024 ). In the meantime, there are at least 1010 possible GO games. However, in
48
order to generate results coming from thousands of learnt data, algorithms must use autoencoders.
An autoencoder is an artificial neural network typically used for the purpose of dimensionality
reduction. It is a neural network that has three layers: an input layer, a hidden (encoding) layer,
and a decoding layer. What makes an autoencoder special is that the output neurons are directly
connected to the input neurons and the goal is to get the output values to match the input values.
Therefore, the network is trained to reconstruct its inputs, which forces the hidden layer to try to
learn good representations of the inputs.
When we input an image, it is a vector in a n-dimensional space which is sent to the hidden layer
after some activation function is applied to it, in order to reduce it to a m-dimensional space. This
process happening in every neural network is called dimensionality reduction. Let us consider the
input data x ∈ Rm . We can describe the layers as follows:
• The input data is mapped onto the hidden layer (layer L2 ). This means that the autoencoder
tries to learn a function such that the encoded data y is:
y(x) = σ W T x
where W is the weight matrix such that W T = W −1 , and σ is the nonlinearity. If there is bias, this
expression will be modified in such a way that
y(x) = σ W T x + b
θ0 = θT
• The hidden layer is mapped onto the output layer (layer L3 ). The mapping is an affine trans-
formation optionally followed by a nonlinearity, and the estimated output x̂ is
x̂ = σ W −1 y + b−1
One use case of autoencoders is data compression, similar to the creation of zip files for some data
set that can be unzipped. Obviously, there exist some data losses that do not allow users to recover
the exact same quality of data in output as it was in input. In order to quantify the amount of data
lost in the process we must compute the reconstruction error. This error can be measured in many
ways, depending on the appropriate distributional assumptions on the input given the code. The
traditional mean squared error (mse) L(x̂, x) = kx̂ − xk22 , can be used. If the input is interpreted as
either bit vectors or vectors of bit probabilities, cross-entropy of the reconstruction can be used:
m
X
LH (x̂, x) = − [x̂k log xk + (1 − x̂k ) log (1 − x̂k )]
k=1
If the hidden layer is linear, the autoencoder behaves like Principal Component Analysis. When
the outputs of each layer are wired to the inputs of the successive layer, the autoencoder is called a
“stacked autoencoder”. This kind of tools is a special class of autoencoders used to train a deep net-
work. Once the stacked autoencoder is trained on some data set it is possible to use those weights to
initialize the deep network, instead of randomly initialized weights. One of the more recent appli-
cations of autoencoders is generating novel yet similar outputs to inputs. Despite the large amount
of existing autoencoder types, scientists use a newer type of autoencoder called a “variational au-
toencoder” (VAE) which learns a distribution around data, so it can generate similar but different
outputs. This type of neural network is much more flexible and customizable in their generation
behavior which leads us to the conclusion that a VAE is suitable for art generation of any kind: it
might be able to generate three-dimensional models of game characters, paintings, drawings, and
so on so forth.
There are lots of techniques that are used to prevent autoencoders from unsuccessfully reconstruct-
ing the input image, such as denoising, where the input is partially corrupted on purpose. Those
autoencoders are called “denoising autoencoders”.
A denoising autoencoder is a basic autoencoder which takes partially corrupted inputs randomly
to address the identity-function risk, which autoencoder has to recover or denoise. This technique
has been introduced with a specific approach to good representation. A good representation is one
that can be obtained robustly from a corrupted input and that will be useful for recovering the
corresponding clean input. The idea is that, if it can rebuild a data set despite it being corrupted it
will be a more robust decoder.
An approximate scheme is the following:
• The input x gets corrupted by a function q(x) such that q(x) = xcorr .
y = f (xcorr , θ) = σ (W xcorr + b)
x̂ = g(y, θ−1 ) = σ W −1 y + b−1
• The error computation is exactly the same as the previous model (use of the original x).
We remind that when there are more nodes in the hidden layer than there are inputs, the network is
risking to learn the so-called “Identity Function”, also called “Null Function”, meaning that the output
equals the input, marking the autoencoder useless. Denoising autoencoders solve this problem by
corrupting the data on purpose by randomly turning some of the input values to zero using some
noise. The amount of noise to apply to the input takes the form of a percentage. Typically, 30
percent, or 0.3, is fine, but if we have very little data, we may want to consider adding more.
When calculating the Loss function L(x̂, x), it is important to compare the output values with the
original input, not with the corrupted input. By doing in that way, the risk of learning the iden-
tity function instead of extracting features is eliminated. Therefore, denoising autoencoders are an
important tool for feature selection and extraction.
MNIST example
The MNIST dataset comprises 60,000 training examples and 10,000 test examples of the handwritten
digits 0–9, formatted as 28x28-pixel monochrome images.
To start working with MNIST let us include some necessary imports:
The code uses built-in capabilities of TensorFlow to download the dataset locally and load it into
the python variable. As a result (if not specified otherwise), the data will be downloaded into the
MNIST_data/ folder. We are also defining some of the values that will be use further in the code:
1 i m a g e _ s i z e = 28
2 i n p u t _ l a y e r _ s i z e = 10
3 learning_rate = 0.05
4 s t e p s _ n u m b e r = 1 0 0 0 # Number o f e p o c h s
5 b a t c h _ s i z e = 100
Our task is to build a classifying neural network with TensorFlow. First, we need set up the archi-
tecture, train the network (using training set) and then evaluate the result on the test set.
To feed the network with the training data, we need to flatten the digit images. Depending on the
phase (training or testing), different examples will be pushed through the classifier. The training
process will be based on the labels while comparing them to the current predictions. This is why
we need to define these two placeholders such that:
1 # Define placeholders
2 t r a i n i n g _ d a t a = t f . p l a c e h o l d e r ( t f . f l o a t 3 2 , [ None , i m a g e _ s i z e ∗ i m a g e _ s i z e ] )
3 i n p u t _ l a y e r = t f . p l a c e h o l d e r ( t f . f l o a t 3 2 , [ None , i n p u t _ l a y e r _ s i z e ] )
1 # V a r i a b l e s t o be t u n e d
2 W = t f . V a r i a b l e ( t f . random_normal ( [ i m a g e _ s i z e ∗ i m a g e _ s i z e , i n p u t _ l a y e r _ s i z e ] ,
3 stddev =0.1) )
4 b = t f . V a r i a b l e ( t f . c o n s t a n t ( 0 . 1 , s h a p e =[ i n p u t _ l a y e r _ s i z e ] ) )
5
6 # B u i l d t h e network ( o n l y o u t p u t l a y e r )
7 o u t p u t = t f . matmul ( t r a i n i n g _ d a t a , W) + b
No matter the neural network, training process works by optimizing (either maximizing or minimiz-
ing) the loss function. In our case, we would like to minimize the difference between the network
predictions and actual values of nodes (input_layer). In deep learning, we often use a technique
called Cross entropy to define the loss. However, we will be only considering the squared error
L(x̂, x). TensorFlow provides the function called tf.nn.sigmoid that allows to use the sigmoid func-
tion as the nonlinearity of our layer. Besides, the tf.reduce_mean function takes the average over
these sums. This way we get the function that can be further optimised. In our example, we use the
Adam descent method from the tf.train API, though many other descent algorithms exist:
Adam descent optimizer will work in several steps adjusting the values of the W and b variables (we
remind that b is zero). In particular, we would also like to have a way of evaluating the performance,
so that we know whether the parameters are optimum. First, we want to compare which nodes were
predicted correctly by using tf.argmax function. Then, the TensorFlow function tf.equal returns the
list of booleans so by casting the values to float and then calculating the average we finally get the
accuracy of the model:
1 # Accuracy c a l c u l a t i o n
2 c o r r e c t _ p r e d i c t i o n = t f . e q u a l ( t f . argmax ( o u t p u t , 1 ) , t f . argmax ( i n p u t _ l a y e r , 1 ) )
3 a c c u r a c y = t f . reduce_mean ( t f . c a s t ( c o r r e c t _ p r e d i c t i o n , t f . f l o a t 3 2 ) )
1 # Run t h e t r a i n i n g
2 sess = tf . InteractiveSession ()
3 s e s s . run ( t f . g l o b a l _ v a r i a b l e s _ i n i t i a l i z e r ( ) )
As mentioned before, optimizer works in steps. In our case, we run the train_step inside the loop
feeding it with the batch data: images and corresponding nodes number. Moreover, the placeholders
must be filled in by using feed_dict parameters of the function run:
1 f o r i in range ( steps_number ) :
2 # Get t h e n e x t b a t c h
3 input_batch , i n p u t _ l a y e r _ b a t c h = mnist . t r a i n . next_batch ( b a t c h _ s i z e )
4 feed_dict = { training_data : input_batch , l a b e l s : input_layer_batch }
5
6 # Run t h e t r a i n i n g s t e p
7 t r a i n _ s t e p . run ( f e e d _ d i c t = f e e d _ d i c t )
We can make use of the accuracy defined previously to monitor the performance on the batches
during the training process. By adding the following code, we will print out the value every 50
steps. Nevertheless, this range of steps can be modified by the user. Thus we obtain:
1 # P r i n t t h e a c c u r a c y p r o g r e s s on t h e b a t c h e v e r y 50 s t e p s
2 i f i %50 == 0 :
3 train_accuracy = accuracy . eval ( feed_dict = feed_dict )
4 p r i n t ( " S t e p %d , t r a i n i n g b a t c h a c c u r a c y %g %% " %( i , t r a i n _ a c c u r a c y ∗ 1 0 0 ) )
After the training is finished, the goal is to check the network performance on the data it has not
previously seen - in the test set. We can reuse accuracy and feed it with the training data instead of
the training batch. Therefore, we can write the following lines of code:
1 # E v a l u a t e on t h e t e s t s e t
2 t e s t _ a c c u r a c y = a c c u r a c y . e v a l ( f e e d _ d i c t = { t r a i n i n g _ d a t a : m n i s t . t e s t . images ,
3 i n p u t _ l a y e r : mnist . t e s t . i n p u t _ l a y e r } )
4 p r i n t ( " T e s t a c c u r a c y : %g %% " %( t e s t _ a c c u r a c y ∗ 1 0 0 ) )
We remind that, in simple terms, the eval() method is used to evalue an expression such as a string
as a Python expression.
1 f o r i in range ( 1 0 ) :
2 p l t . subplot ( 1 , 10 , i +1)
3 p l t . imshow ( m n i s t . t e s t . i m a g e s [ i ] . r e s h a p e ( 2 8 , 2 8 ) )
4 plt . axis ( ’ off ’ )
5 p l t . show ( )
Since the program is done, we will gather the results inside a table.
In order to compare useful results some parameters will vary. More precisely, the batch size and the
learning rate will be the two main parameters whose value will change, so that conclusions can be
made. At last, we will consider a bias equal to 0.1 then 0.5 to check potential changes.
We remind that learning rate α is a hyper-parameter that controls how much we are adjusting the
weights of our network with respect the loss gradient. The lower the value, the slower we travel
along the downward slope. While this might be a good idea (using a low learning rate) in terms of
making sure that we do not miss any local minima, it could also mean that we’ll be taking a long
time to converge—especially if we get stuck on a plateau region.
Here is the formula showing the relationship:
∂
Wnew = W1 − α × J(W1 )
∂W1
Thus if the learning rate is too large the gradient descent may fail to converge, or even diverge.
Typically learning rates must be configured by the user. At best, we the user would leverage on
past experiences (training test sets) to gain the intuition on what is the best value to use in setting
learning rates. Furthermore, the learning rate affects how quickly our model can converge to a local
minima. Thus getting it right from the get go would mean lesser time for us to train the model.
Number of layers Hidden (encoder) Batch size Epochs Learning rate Accuracy (%)
1 784 10 100 0.1 74.15
1 784 100 1,000 0.1 88.80
1 784 10 100 0.05 77.63
1 784 100 1,000 0.05 88.79
1 784 10 100 0.01 82.67
1 784 100 1,000 0.01 89.96
1 784 10 100 0.001 80.52
1 784 100 1,000 0.001 91.84
1 784 10 100 0.00001 7.46
1 784 100 1,000 0.00001 36.43
2 784 + 32 10 100 0.1 46.99
2 784 + 32 100 1,000 0.1 13.99
2 784 + 32 10 100 0.05 57.15
2 784 + 32 100 1,000 0.05 15.00
2 784 + 32 10 100 0.01 4.74
2 784 + 32 100 1,000 0.01 12.00
2 784 + 32 10 100 0.001 47.08
2 784 + 32 100 1,000 0.001 10.99
2 784 + 32 10 100 0.00001 0.6
2 784 + 32 100 1,000 0.00001 7.99
- The number of layers has been arbitrary chosen for an example observation. This number may be
increased by the user himself in such a way that the algorithms proceeds within 3 layers or more.
- The number of nodes (hidden) corresponds to a product between the number of rows and the
number of columns for an image. Indeed, as a MNIST image is a picture of 28 × 28 pixels, the first
hidden layer has 784 weights.
- The batch size has been chosen in such a way that the batch size-epochs ratio is equal to 10.
However, this value could be multiplied by 10 or 100 for better results.
Though it is often hard to get the optimal α, the below diagram demonstrates the different scenarios
one can fall into when configuring the learning rate.
According to this scheme gathering different cases for learning rates convergence, we can showcase
several scenarios which would end up to a bad reconstruction. As an example, with low learning
rates, the loss improves slowly, then training accelerates until the learning rate becomes too large
and loss goes up. In the end, the training process diverges. Therefore, choosing a learning rate
greater than 10−4 is an appropriate choice.
The below results are showing the reconstruction evolution for the different chosen learning rates.
The obtained results are quite outstanding, as for a constant number of steps, and equal to 4, 000
the reconstruction accuracy evolves in an exponential way as the learning rate becomes 10 times
smaller. As a result, using a learning rate of 0.00001 the denoising autoencoder is able to distinguish
the numbers 0, 1, 2, 3, 6, 7, 8 and 9. Meanwhile, it is possible to see that the loss difference between
α = 0.001, α = 0.0001 and α = 0.00001 is really small. Indeed, the below graph shows the amount
of loss data over time and a batch size of 100 for the different learning rates:
Thus we conclude that the higher the number of epochs is the faster the loss of the lowest optimized
learning rate will converge to zero. In particular, the lowest tolerated learning rate α here being
0.001, the squared error of such a learning rate will be minimized when the number steps for each
layer is maximized. Theoretically 2, 000 epochs would be sufficient to have a great convergence.
Srainnow = L + N + C = 0 + 0 + C = C
where L contains the spectral line information, N is the noise and C the continuum.
These spectra are also an incredible source of information for who knows how to decipher them. In
particular, they can give an indication about the speed of the studied body. Everything people know
about galaxies and the whole universe comes from the light we perceive - the only exception is the
direct exploration of some objects of the solar system such as the Moon, comets and asteroids.
The spectra reconstruction presented in the next section will aim to reconstruct the absorption
lines, but also the emission lines of any spectra. Suppose an excitation of a pure gas, for example
hydrogen. Bright lines will appear on the spectrum, without a continuous background, because only
the hydrogen element is present. Those lines are called emission lines. Suppose a galaxy such as
Andromeda (made of billions of stars, emitting in all wavelengths a continuous spectrum. If a gas, for
example hydrogen, is interposed between the source and the observer, this hydrogen will "absorb"
the light at the wavelengths corresponding to this gas. In the spectrum, this light will disappear,
and we will obtain an absorption spectrum with dark lines. These lines are called absorption lines.
This process is the same for any galaxies and other celestial objects emitting some light.
The general appearance of the spectrum of a galaxy, and especially of a star, makes it possible to put
it in one of the spectral classes which also have common characteristics. Indeed, stars are classified
according to their surface temperature. The main classes are, from the warmest to the coldest are
O, B, A, F, G0, G2, G5, K, M, where the Sun is a G2 star.
Finally, we notice that the determination of the class allows scientists to determine the absolute
luminosity (which allows to find the distance with the measurement of the magnitude), the mass,
the radius (thus the average density), and the lifetime of the studied galaxy or star.
This instrumentation, this technique and the software that accompanies them allow amateur as-
tronomers to access many scientific activities formerly reserved for professional astronomers only.
Considering the quality of the material accessible to the amateurs, it develops numerous and active
collaborations between amateurs and professionals.
v =H ×d
where v is the galaxy’s radial outward velocity, d is the galaxy’s distance from Earth, and H is the
constant of proportionality called the Hubble constant such that H = 71 ± 4 km/s/Mpc. This means
that a galaxy 1 megaparsec away will be moving away from us at a speed of 71 km/sec, while another
galaxy 100 megaparsecs away will be receding at 100 times this speed. So essentially, the Hubble
constant reflects the rate at which the universe is expanding.
Red shift is an increase in wavelength within the galaxy spectra which means a decrease in wave
frequency, and thus an increasing distance between Earth and the targeted galaxy. Indeed, for a
frequency ν the wavelength λ is defined by the following expression:
c
λ=
ν
with c the speed of light.
Hence, we can say that a red shift is also the result of a decrease of energy and thus a decrease in
the number of photons, as
E = hν
Finally, to determine an object’s distance, we only need to know its velocity. Velocity is measurable
thanks to the Doppler shift. However, it should be noted that, on very large scales, Einstein’s theory
predicts departures from a strictly linear Hubble law. The amount of departure, and the type, de-
pends on the value of the total mass of the universe. In this way a plot of or red shift as function of
the distance, which is a straight line at small distances, can tell us about the total amount of matter
in the universe and may provide crucial information about the mysterious dark matter.
Christian Doppler, Austrian physicist (1803-1853) and Hippolyte Fizeau, French physicist (1818-
1896), independently discovered the variation of the frequency of a perceived sound when a sound
source moves with respect to an observer. When applied to light, this Doppler-Fizeau effect gener-
ates an offset of the frequencies emitted by a moving source relative to an observer.
By taking the spectrum of a distant object, such as a galaxy, astronomers can see a shift in the
lines of its spectrum and from this shift determine its velocity. Putting this velocity into the Hubble
equation, they determine the distance. Note that this method of determining distances is based on
observation (the shift in the spectrum) and on a theory (Hubble’s Law).
Indeed, let us consider a point light source O, emitting a monochromatic light towards an observer
A. The source and the observer are located at a distance d.
Hence, at the time t1 a first lightwave is sent by O towards A, and A receives this signal at the time
t01 such that
d
t01 = t1 +
c
At t2 = t1 + ∆t, this source is moving away from the observer with a distance equal to v∆t with a
considered velocity v.
As a result, A receives the signal from O’ at:
v∆t d
t02 = t2 + +
c c
We now are able to determine the time delay between the receiving of the two signals by the ob-
server. This delay is given by:
v∆t d d
t02 − t01 = t2 + + − t1 −
c c c
Hence
v∆t v∆t
t02 − t01 = t2 − t1 + = ∆t +
c c
By factorizing by ∆t = t2 − t1 we obtain
v
t02 − t01 = ∆t 1 +
c
λ
T = = ∆t
c
and
λ0
T0 = = ∆t0
c
Therefore:
v
0
λ =λ 1+
c
Consider the offset of the emissivity wavelengths at the reception ∆λ such that
∆λ = λ0 − λ
we thus obtain:
λ0 − λ v ∆λ
= =
λ c λ
∆λ
z=
λ
If the theory is not correct, the distances determined in this way are all nonsense. Most astronomers
believe that Hubble’s Law does, however, hold true for a large range of distances in the universe.
We can observe that if the speed of the galaxy is positive - meaning that it moves away from us -
∆λ is positive (as there is a red shift). The shift is made towards the greatest wavelengths, towards
the red for visible light.
In the meanwhile, a galaxy approaching us – meaning that v is negative – will see its light shifted
to blue for visible light. As an example, this is the case for the Andromeda galaxy which is close to
ours. At last, we shall not forget that the measured velocity is not the actual velocity of the galaxy.
Indeed, red shift measurements estimate the radial velocity only – a component of the true velocity
on the observer’s targeting axis.
When the velocity measured by the Doppler-Fizeau effect is no longer negligible compared with the
speed of light, Newtonian classical mechanics is no longer appropriate to describe the phenomenon
properly. We must introduce relativity.
To understand this process, we must analyze the following equation, which allows speeds greater
than that of light, and is therefore unsuitable. The relation to be applied then is the following one:
λ0 1 + vc
=q
λ 1 − vc2
2
At last, physicists confirmed several years ago that there is also a "gravitational red shift". Relativity
predicts a dilation of time near large densities of matter. This affects the wavelengths coming from
the light passing nearby, shifting them towards the red.
The final objective is to compute the previous mathematical expressions so that the algorithm esti-
mates the red shift z for each observable galaxy spectrum within the amount of data given in the
test set. This would result in the estimation of red shift for any input data of galaxy spectra of the
same range, and in the end, it would make researchers work easier.
In the matrix representation the input, hidden and output nodes are represented by three vectors i
(input nodes vector), h (hidden nodes vector) and o (output nodes vector) respectively. The weights
connecting each layer are represented by a matrix. W (weight matrix) connects the input layer with
the hidden layer, and W T (weight matrix from the hidden layer to the output) connects the hidden
layer with the output layer. See figure below:
In this thesis, we decided to use the Python language to implement our algorithms. This one has two
important advantages for this project: Python is executable without compiling and very light syntax
(which allows to program and debug very quickly), and Python – as a functional programming
language – libraries can be used with an ease. In this case, we will be using the tensorflow library
which provides powerful mathematical computation.
We remind that TensorFlow is an open-source machine learning library created by Google whose
goal is to allow users to rationalize the development and runs of advanced analytical applications.
This library has better computational graph visualizations – which are indigenous when compared
to other libraries like Torch and Theano – and has the advantage of the seamless performance, quick
updates and frequent new releases with new features.
1. Compute the three layers: input layer, hidden layer, and decoding layer.
2. Compute the amount of noise: compute the corruption level, in percentage, chosen by the
user himself.
1 W_init_max = 4 ∗ np . s q r t ( 6 . / ( n b _ v i s i b l e + n b _ h i d d e n ) )
2 W _ i n i t = t f . random_uniform ( s h a p e = [ n b _ v i s i b l e , n b _ h i d d e n ] ,
3 m i n v a l = −W_init_max ,
4 maxval = W_init_max )
This can also be done using a second approach by determining the uniform function. According to
Probabilities, a law of probability is uniform on a range [a; b] if its probability density is the function
f defined on the same range by:
1
f (x) =
b−a
This can also be written as:
1
b−a
∀x ∈ [a; b]
f (x) =
0 ∀x ∈
/ [a; b]
Z b
f (x)dx = 1
a
Hence:
Z b
kdx = 1
a
k × [x]ba = k(b − a) = 1
Finally
1
k= = f (x)
b−a
in such a way that:
1
W0 = ±
number of visible units
This may be implemented within the autoencoder program using the following lines of code (in
Python) with the help of the Tensorflow library:
The squared error is a measure of the correctness of the network. It is calculated for each input
pattern pt . Therefore, the higher the correction is the lower the squared error is.
Every activation function (or nonlinearity) takes a single number and performs a certain fixed math-
ematical operation on it. Here are below some activation functions often used:
1
S(t) =
1 + e−t
Let us consider the logistic function h such that:
L
h(x) =
1+ e−k(x−x0 )
Hence, we can conclude that S is an increasing function in time, and h is increasing for values of
x ≤ x0 and decreasing for values of x ≥ x0 . Therefore, the sigmoid function is a special case of the
logistic function where:
• k = 1, where k is the parameter controlling how steep the change from the minimum to the
maximum value is.
d 1 e−t
= = S(t) (1 − S(t))
dt 1 + e−t (1 + e−t )2
Although sigmoid functions have a higher run time than ReLU and hyperbolic tangent functions,
they are simple to use and easy to train. Moreover, sigmoid functions are continuous with compact
support and real-valued (see previous results), and dense in L∞ .
Finally, the tangent hyperbolic function will be chosen as the nonlinearity to be used for our layers,
one by one. Using the following Matlab commands
1 t = linspace ( −6 ,6 ,100) ;
2 x = 1 . / ( 1 + exp (− t ) ) ;
3 y = tanh ( t ) ;
4 plot ( t , x , ’ r ’ ) ;
5 h o l d on ;
6 plot ( t , y , ’b ’ ) ;
7 t i t l e ( ’ S i g m o i d and h y p e r b o l i c t a n g e n t f u n c t i o n s ’ ) ;
8 xlabel ( ’ t ’ ) ;
9 ylabel ( ’y ’ ) ;
10 l e g e n d ( ’ S i g m o i d ’ , ’ Tanh ’ )
we can draw the sigmoid and the hyperbolic tangent functions, so that the below graph is obtained:
Figure 18 – Curve of the sigmoid and hyperbolic tangent functions using Matlab.
or
where tf.matmul(Estimated_X, W) does the cross product between the matrices Estimated_X (which
corresponds to the matrix gathering the corrupted data) and W.
sinh x e2x − 1
tanh x = = 2x
cosh x e +1
so that
tanh x = 2 · S(2x) − 1
Hence, we conclude that the hyperbolic tangent function is a shifted or rescaled version of the
sigmoid function. The sigmoid activation function has the potential problem that it saturates at
zero and one while tanh saturates at plus and minus one. Therefore, if the activity in our neural
network during training is close to zero then the gradient for the sigmoid activation function may
go to zero. This is called "the vanishing gradient problem".
Conclusions
Resulting from the above conclusions, we now know that the we must use at least 2 input, hidden
and decoding layers ensured to decode data from a given test set with the highest number of epochs
possible. The best ways to improve the neural network accuracy are the techniques of data optimiza-
tion, algorithm tuning and hyper-parameter optimization. There, we can conclude that the neural
network should not overfit to perform well. One of the main parameters acting on overshooting
is the choice of the activation function. Activation functions map the non-linear functional inputs
to the outputs. Hence, activation functions are highly important and choosing the right activation
function helps our model to learn better.
Consider the signal-to-noise ratio SNR and the flux Φ such that SN R ∝ Φ. Plotting the recon-
structed spectra we obtain:
Figure 19 – True data-Reconstructed data comparison for a S/N ratio of 10 and 20.
The above results show the reconstruction of the same spectrum using different values:
- On the top left corner, the number of hidden values nhidden was 20 for a SNR of 10.
- On the top right corner, nhidden = 100 with a SNR of 10.
- On the middle-left, nhidden = 1, 000 with a SNR of 10.
- On the middle-right, nhidden = 100 with a SNR of 20.
- On the bottom left corner, nhidden = 20 with a SNR of 20.
- On the bottom right corner, nhidden = 1, 000 with a SNR of 20.
However, as observed there is still a distance (i.e. a squared error not equal to zero) between the true
signal and the reconstructed signal. This is due to a problem of constant accuracy in the model.
In addition, here are tables gathering several results for an unchanged SNR equal to 38:
Figure 20 – Reconstructed input after 1,000 epochs for a signal-to-noise ratio of 38.
We can clearly observe that a too high SNR gives curves absolutely incorrect. This imply that in the
future we will implement back a SNR such that it is equal to 25.
Finally, we could observe that the optimal learning rate is α 0.0001 thanks to the following curve
obtained for a small amount of epochs:
Indeed, the curve shows well that below such a learning rate the cost is increasing again.
In the end, we have gathered enough data to conclude that: the optimal number of epochs is equal
to 400, the learning rate shall not exceed 10−4 , and the SNR must be approximately equal to 25.
Despite the optimization of all those parameters, the neural network is still having difficulties to
reach an excellent reconstruction. The non-improvement of the accuracy from this model may come
from overfitting. As explained, overfitting can be a serious problem in deep learning. In order to
solve this problem, we can use a special technique to reverse it. Dropout is a technique developed to
solve this exact problem whose idea is to randomly drop units in a deep neural network. Moreover,
it is one of the biggest advancements in deep learning to come out in the last few years.
Learning the relationship between the inputs and the outputs of a dataset is a very complicated
procedure. When using a very small data set, the relationship maybe a result of noise in the input
sample. As the dataset for training which is used contains thousands of data, this assumption is no
longer correct.
Dropout refers to randomly and temporary removing a unit, either in a hidden or a visible layer, and
all of its incoming and outgoing connections. Using Tensorflow, we can initialize dropout as below:
Using all the results from above, we can understand that the batch size has a real impact on the
learning accuracy. To demonstrate this, let us plot the cost function for two different batch sizes:
Figure 23 – Cost function for a batch size = 20 and SNR = 25 (left), and for a batch size = 40 and
SNR = 23 (right).
Those two curves were obtained for a number of epochs equal to 400. For a doubled batch size, the
cost value is twice higher. This lets suppose that a good batch size would be equal to 10. However,
the graph from below shows us also that, the signal-to-noise ratio is really important to get the
lowest losses possible. Indeed, for the same amount of epochs but with a SNR of 30 we get:
This confirms that the best SNR value is 25, and thus the best batch size is 20.
As a result, the reconstructed signals look like the following.
To conclude, the results are almost perfect which make the output almost similar to the input. A
small problem of optimization is still present. A change of activation function into ReLU or Leaky
ReLU may solve the problem. See the below results.
Figure 27 – Accurate reconstruction with ReLU activation function of a galaxy spectrum (2).
Figure 29 – Accurate reconstruction with ReLU activation function of a galaxy spectrum (4).
Z +∞
w(t) = u(t) ⊗ v(t) = u∗ (τ − t)v(τ )dτ
−∞
Unlike convolution, the integration variable, τ , has the same sign in the arguments of u(...) and
v(...) so the arguments have a constant difference instead of a constant sum.
Cross correlation is used to find where two signals match, where u(t) is the test waveform - which
would be the initial input spectrum. The signal v(t) is the signal that contains u(t) with an added
noise in our model. Therefore, we would implement this inside the code as the following:
where x represents the array containing the variables and observations as the input, and y the
array containing the variables and observations as the output once reconstruction is done. When a
galaxy spectrum is cross-correlated with the given templates, its found emission lines are masked
out, implying that the red shift is derived from the absorption features.
It must be remembered that the reconstructing process was too long for experiencing higher values
for main parameters such as the batch size: a deeper evaluation can emphasize the time spending
over each iteration and thus find the best time consuming-cost ratio for slow machines. Only with
a powerful machine can a full appreciation of the DAE’s current reconstruction time occur.
At this point the denoising autoencoder does not work with its full power in the reconstruction
of spectra. Users should be aware about the potential enhancement that can be achieved with the
current algorithm.
[1] D. P. Machado, A. Leonard, J.-L. Starck, F. B. Abdalla, and S. Jouvel, Darth Fader: Using wavelets
to obtain accurate red shifts of spectra at very low signal-to-noise, ESO, CEA Saclay, France, August
25, 2018
[2] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, Pierre-Antoine Manzagol, Extracting and
Composing Robust Features with Denoising Autoencoders, Léon Bottou, University of Montreal, July
5, 2008
[3] J. Frontera-Pons, F. Sureau, J. Bobin, E. Le Floc’h, Unsupervised feature-learning for galaxy SEDs
with denoising autoencoders, ESO, CEA Saclay, France, May 17, 2017
[4] Pascal Vincent, Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Net-
work with a Local Denoising Criterion, Léon Bottou, University of Montreal, December 2010
https://www.astromatic.net/
https://towardsdatascience.com/
https://openclassrooms.com/
https://medium.com/
https://www.learnopencv.com/
https://stackoverflow.com/
https://github.com/
https://stats.stackexchange.com/
https://blog.goodaudience.com/
https://fr.wikipedia.org/
https://www.futura-sciences.com/
https://www.python-course.eu/
https://www.kaggle.com/
1 # −∗− c o d i n g : u t f −8 −∗−
2 """
3 C r e a t e d on S a t Nov 24 1 6 : 1 4 : 0 5 2 0 1 8
4
5 @author : Romain F o n t e y n e
6
7 T h i s program r u n s MNIST i m a g e s w i t h 1 l a y e r
8 """
9
10 # ################
11 # ## IMPORTS # # #
12 # ################
13
14
15 import tensorflow as t f
16 import m a t p l o t l i b . pyplot as p l t
17 from t e n s o r f l o w . e x a m p l e s . t u t o r i a l s . m n i s t i m p o r t i n p u t _ d a t a
18
19 # Read d a t a
20 m n i s t = i n p u t _ d a t a . r e a d _ d a t a _ s e t s ( " MNIST_data / " , o n e _ h o t = True )
21
22
23 # ##################################
24 # ## PARAMETERS INITIALIZATION # # #
25 # ##################################
26
27 i m a g e _ s i z e = 28
28 i n p u t _ l a y e r _ s i z e = 10
29 learning_rate = [0.1 , 0.05 , 0.01 , 0.001 , 0.00001]
30 steps_number = 1000
31 b a t c h _ s i z e = 100
32 stddev = 0.0
33 mean = 0 . 0
34
35 accuracy_list = []
36 l o s s _ l i s t = []
37
38
39 # #####################
40 # ## MAIN PROGRAM # # #
41 # #####################
42
43 f o r i in range ( len ( l e a r n i n g _ r a t e ) ) :
44 # Define placeholders
48 # V a r i a b l e s t o be t u n e d
49 W = t f . V a r i a b l e ( t f . random_normal ( [ i m a g e _ s i z e ∗ i m a g e _ s i z e , i n p u t _ l a y e r _ s i z e ] ,
stddev =0.1) )
50 b = t f . V a r i a b l e ( t f . c o n s t a n t ( 0 . 0 , s h a p e =[ i n p u t _ l a y e r _ s i z e ] ) )
51
52 # B u i l d t h e network ( o n l y o u t p u t l a y e r )
53 o u t p u t = t f . matmul ( t r a i n i n g _ d a t a , W) + b
54
59 # Training step
60 t r a i n _ s t e p = t f . t r a i n . AdamOptimizer ( l e a r n i n g _ r a t e [ i ] ) . m i n i m i z e ( l o s s )
61
62 # Accuracy c a l c u l a t i o n
63 c o r r e c t _ p r e d i c t i o n = t f . e q u a l ( t f . argmax ( o u t p u t , 1 ) , t f . argmax ( i n p u t _ l a y e r ,
1) )
64 a c c u r a c y = t f . reduce_mean ( t f . c a s t ( c o r r e c t _ p r e d i c t i o n , t f . f l o a t 3 2 ) )
65
66 # Run t h e t r a i n i n g
67 sess = tf . InteractiveSession ()
68 s e s s . run ( t f . g l o b a l _ v a r i a b l e s _ i n i t i a l i z e r ( ) )
69
70
71 f o r i in range ( steps_number ) :
72 # Get t h e n e x t b a t c h
73 epoch_loss = 0
74 input_batch , i n p u t _ l a y e r _ b a t c h = mnist . t r a i n . next_batch ( b a t c h _ s i z e )
75 feed_dict = { training_data : input_batch , input_layer : input_layer_batch }
76
77 _ , l = s e s s . run ( [ t r a i n _ s t e p , l o s s ] ,
78 feed_dict )
79
80 e p o c h _ l o s s += l
81
82 # Run t h e t r a i n i n g s t e p
83 t r a i n _ s t e p . run ( f e e d _ d i c t = f e e d _ d i c t )
84
87
93 # E v a l u a t e on t h e t e s t s e t
94 t e s t _ a c c u r a c y = a c c u r a c y . e v a l ( f e e d _ d i c t = { t r a i n i n g _ d a t a : m n i s t . t e s t . images ,
i n p u t _ l a y e r : mnist . t e s t . l a b e l s } )
95 p r i n t ( " T e s t a c c u r a c y : %g %% " %( t e s t _ a c c u r a c y ∗ 1 0 0 ) )
96
101 a c c u r a c y _ l i s t . append ( t e s t _ a c c u r a c y )
102 # l o s s _ l i s t . append ( l o s s _ f l o a t )
103
104 # ########################
105 # ## OUTPUT DISPLAYS # # #
106 # ########################
107
108 f o r i in range ( 1 0 ) :
109 p l t . subplot ( 1 , 10 , i +1)
110 p l t . imshow ( m n i s t . t e s t . i m a g e s [ i ] . r e s h a p e ( 2 8 , 2 8 ) )
111 plt . axis ( ’ off ’ )
112 p l t . show ( )
MNIST_1_layer.py
1 # −∗− c o d i n g : u t f −8 −∗−
2 """
3 C r e a t e d on Tue Nov 27 2 2 : 0 0 : 2 7 2 0 1 8
4
5 @author : Romain F o n t e y n e
6
7 T h i s program r u n s MNIST i m a g e s w i t h 2 l a y e r s
8 """
9
10 # ################
11 # ## IMPORTS # # #
12 # ################
13
14 # Import of Tensorflow
15 import tensorflow as t f
16
17 # I m p o r t o f t h e MNIST d a t a
18 from t e n s o r f l o w . e x a m p l e s . t u t o r i a l s . m n i s t i m p o r t i n p u t _ d a t a
19
20 # I m p o r t o f math l i b r a r i e s
21 import m a t p l o t l i b . pyplot as p l t
22 import m a t p l o t l i b . p a t c h e s as mpatches
23 from m a t p l o t l i b . l e g e n d _ h a n d l e r i m p o r t H a n d l e r L i n e 2 D
24 from numpy i m p o r t l o a d t x t
25 i m p o r t numpy a s np
26 from p y l a b i m p o r t r c P a r a m s
27
28
29 # ##################################
30 # ## PARAMETERS INITIALIZATION # # #
31 # ##################################
32
39 # S i z e o f t h e p r i n t wanted
40 rcParams [ ’ f i g u r e . f i g s i z e ’ ] = 20 ,20
41
42 # I n i t i a l i z a t i o n o f l i s t f o r f u t u r e a c c u r a c y and l o s s v a l u e s
43 accuracy_list = [[] ,[] ,[] ,[] ,[]]
46 # Learning rate i n i t i a l i z a t i o n
47 learn_rate = [0.1 , 0.05 , 0.01 , 0.001 , 0.00001] # How f a s t t h e model s h o u l d
learn
48
49 # L o o p i n g t h r o u g h t h e f i r s t 10 t e s t i m a g e s and d i s p l a y
50 " " " f o r i in range ( 1 0 ) :
51 p l t . subplot ( 1 , 10 , i +1)
52 p l t . imshow ( X _ t e s t [ i ] . r e s h a p e ( 2 8 , 2 8 ) , cmap = ’ Greys ’ )
53 plt . axis ( ’ off ’)
54 p l t . show ( ) " " "
55
73 # I n i t i a l i z a t i o n o f t h e n o d e s by l a y e r
74 n_nodes_inpl = 784 # Encoder
75 n _ n o d e s _ h l 1 = 32 # E n c o d e r
76 n _ n o d e s _ h l 2 = 32 # D ec ode r
77 n _ n o d e s _ o u t l = 7 8 4 # D eco der
78
79
80 # D e f i n i t i o n o f t h e b a t c h s i z e and number o f e p o c h s
81 b a t c h _ s i z e = 1 0 0 # How many i m a g e s t o u s e t o g e t h e r f o r t r a i n i n g
82 hm_epochs = 1 0 0 0 # How many t i m e s t o go t h r o u g h t h e e n t i r e d a t a s e t
83 t o t _ i m a g e s = X _ t r a i n . s h a p e [ 1 ] # T o t a l number o f i m a g e s
84
85 # #####################
86 # ## MAIN PROGRAM # # #
87 # #####################
88
95 hidden_1_layer_vals = {
96 ’ w e i g h t s ’ : t f . V a r i a b l e ( t f . random_normal ( [ n _ n o d e s _ i n p l , n _ n o d e s _ h l 1 ] ) ) ,
97 ’ b i a s e s ’ : t f . c o n s t a n t ( 0 . 0 , shape = [ n_nodes_hl1 ] ) }
98
99 # Second h i d d e n l a y e r h a s 3 2 ∗ 3 2 w e i g h t s and 0 b i a s e s
100 " " " hidden_2_layer_vals = {
101 ’ w e i g h t s ’ : t f . V a r i a b l e ( t f . random_normal ( [ n _ n o d e s _ h l 1 , n _ n o d e s _ h l 2 ] ) ) ,
102 ’ b i a s e s ’ : t f . V a r i a b l e ( t f . random_normal ( [ n _ n o d e s _ h l 2 ] ) ) } " " "
103
104
105 hidden_2_layer_vals = {
106 ’ w e i g h t s ’ : t f . V a r i a b l e ( t f . random_normal ( [ n _ n o d e s _ h l 1 , n _ n o d e s _ h l 2 ] ) ) ,
107 ’ b i a s e s ’ : t f . c o n s t a n t ( 0 . 0 , shape = [ n_nodes_hl2 ] ) }
108
109
115
116 output_layer_vals = {
117 ’ w e i g h t s ’ : t f . V a r i a b l e ( t f . random_normal ( [ n _ n o d e s _ h l 2 , n _ n o d e s _ o u t l ] ) ) ,
118 ’ b i a s e s ’ : t f . c o n s t a n t ( 0 . 0 , shape = [ n_nodes_outl ] ) }
119
120
121 # C r e a t i o n o f t h e p l a c e h o l d e r ( image w i t h s h a p e 7 8 4 g o e s i n )
122 i n p u t _ l a y e r = t f . p l a c e h o l d e r ( ’ f l o a t ’ , [ None , 7 8 4 ] )
123
140 # Cost f u n c t i o n d e f i n i t i o n
141 meansq = t f . reduce_mean ( t f . s q u a r e ( o u t p u t _ l a y e r − o u t p u t _ t r u e ) )
142
143 # Optimizer d e f i n i t i o n
144 o p t i m i z e r = t f . t r a i n . A d a g r a d O p t i m i z e r ( l e a r n _ r a t e [ j ] ) . m i n i m i z e ( meansq )
145
151 # Accuracy c a l c u l a t i o n
152 c o r r e c t _ p r e d i c t i o n = t f . e q u a l ( t f . argmax ( o u t p u t _ l a y e r , 1 ) , t f . argmax (
input_layer , 1) )
153 a c c u r a c y = t f . reduce_mean ( t f . c a s t ( c o r r e c t _ p r e d i c t i o n , t f . f l o a t 3 2 ) )
154
166 i f i %100 == 0 :
167 t r a i n _ a c c u r a c y = a c c u r a c y . e v a l ( f e e d _ d i c t = { i n p u t _ l a y e r : epoch_x , \
168 o u t p u t _ t r u e : epoch_x } )
169
172 # Add o f t h e c u r r e n t a c c u r a c y t o t h e l i s t
173 a c c u r a c y _ l i s t [ j ] . append ( t e s t _ a c c u r a c y ∗ 1 0 0 )
174
185
186
187 # ########################
188 # ## OUTPUT DISPLAYS # # #
189 # ########################
190
212 f i r s t _ l e g e n d = p l t . l e g e n d ( h a n d l e s =[ a0 ] , l o c = 1 , frameon = F a l s e )
213 p l t . l e g e n d ( h a n d l e r _ m a p = { a0 : H a n d l e r L i n e 2 D ( nu mp oi nt s = 4 ) } )
214
228 a x _ l e g e n d = p l t . l e g e n d ( h a n d l e s =[ l 0 ] , l o c = 1 , frameon = F a l s e )
229 p l t . l e g e n d ( h a n d l e r _ m a p = { l 0 : H a n d l e r L i n e 2 D ( nu mp oi nt s = 4 ) } )
230
MNIST_2_layers.py
1 %% A u t h o r s : Romain F o n t e y n e
2 %% Date : 0 4 / 1 2 / 2 0 1 8
3
4 % T h i s s c r i p t c a l c u l a t e s t h e p r i n c i p a l components and v e c t o r s
5 % u s i n g t h e P r i n c i p a l Component A n a l y s i s
6
7 rng ’ d e f a u l t ’
8
9 M = 1 5 ; % Number o f o b s e r v a t i o n s
10 N = 8 ; % Number o f v a r i a b l e s o b s e r v e d
11
12 % Made−up d a t a
13 X = r a n d (M, N ) ;
14
18 % Do t h e PCA
19 [ c o e f f , s c o r e , l a t e n t , ~ , e x p l a i n e d ] = pca ( X ) ;
20
21 % C a l c u l a t e e i g e n v a l u e s and e i g e n v e c t o r s o f t h e c o v a r i a n c e m a t r i x
22 c o v a r i a n c e M a t r i x = cov ( X ) ;
23 [ V , D] = e i g ( c o v a r i a n c e M a t r i x ) ;
24
29 % M u l t i p l y t h e o r i g i n a l d a t a by t h e p r i n c i p a l component v e c t o r s t o g e t t h e
p r o j e c t i o n s o f t h e o r i g i n a l d a t a on t h e
30 % p r i n c i p a l component v e c t o r s p a c e . T h i s i s a l s o t h e o u t p u t " s c o r e " . Compare . . .
31 d a t a I n P r i n c i p a l C o m p o n e n t S p a c e = X∗ c o e f f
32 score
33 % The columns o f X∗ c o e f f a r e o r t h o g o n a l t o e a c h o t h e r . T h i s i s shown w i t h . . .
34 corrcoef ( dataInPrincipalComponentSpace )
35
36 % The v a r i a n c e s o f t h e s e v e c t o r s a r e t h e e i g e n v a l u e s o f t h e c o v a r i a n c e m a t r i x ,
and a r e a l s o t h e o u t p u t " l a t e n t " . Compare
37 % these three outputs
38
39 var ( dataInPrincipalComponentSpace ) ’
40
43 s o r t ( d i a g (D) , ’ descend ’ )
PCA.m
5 @author : Romain F o n t e y n e
6 """
7
8 i m p o r t numpy a s np , m a t p l o t l i b
9 import m a t p l o t l i b . pyplot as p l t
10
18 U , S , V = np . l i n a l g . s v d ( A )
19 e i g v a l s = S ∗ ∗ 2 / np . cumsum ( S ) [ −1] # Calculate eigenvalues
20
21
22 # #######################
23 # ## OUTPUT RESULTS # # #
24 # #######################
25
34 l e g = p l t . l e g e n d ( [ ’ E i g e n v a l u e s from SVD ’ ] , l o c = ’ b e s t ’ , b o r d e r p a d = 0 . 3 ,
35 shadow= F a l s e , prop = m a t p l o t l i b . f o n t _ m a n a g e r . F o n t P r o p e r t i e s ( s i z e =
’ small ’ ) ,
36 markerscale =0.4)
37 leg . get_frame ( ) . set_alpha ( 0 . 4 )
38 l e g . d r a g g a b l e ( s t a t e = True )
39
40 p l t . show ( )
Scree_Plot–PCA.py
1 # −∗− c o d i n g : u t f −8 −∗−
2 """
3 C r e a t e d on Thu J a n 03 1 5 : 1 3 : 5 4 2 0 1 8
4
10
11 # ################
12 # ## IMPORTS # # #
13 # ################
14
17 # Import of Tensorflow
18 import tensorflow as t f
19 # I m p o r t o f numpy
20 i m p o r t numpy a s np
21 import os
22 # Import of m a t p l o t p l i b
23 import m a t p l o t l i b . pyplot as p l t
24 import m a t p l o t l i b . p a t c h e s as mpatches
25 from m a t p l o t l i b . l e g e n d _ h a n d l e r i m p o r t H a n d l e r L i n e 2 D
26
27 # ##################################
28 # ## PARAMETERS INITIALIZATION # # #
29 # ##################################
30
31 # I n i t i a l i z a t i o n o f l i s t f o r f u t u r e a c c u r a c y and l o s s v a l u e s
32 accuracy_list = [[] ,[] ,[] ,[] ,[]]
33 cost_list = [[] ,[] ,[] ,[] ,[]]
34
35
36 # ######################
37 # ## MAIN FUNCTION # # #
38 # ######################
39
43 Parameters
44 −−−−−−−−−−
48 Returns
49 −−−−−−−
50 x : Tensor
51 I n p u t p l a c e h o l d e r t o t h e network
52 z : Tensor
53 I n n e r −most l a t e n t r e p r e s e n t a t i o n
54 y : Tensor
55 Output r e c o n s t r u c t i o n o f t h e i n p u t
56 c o s t : Tensor
57 O v e r a l l c o s t to use f o r t r a i n i n g
58 """
59 x = t f . p l a c e h o l d e r ( t f . f l o a t 3 2 , [ None , d i m e n s i o n s [ 0 ] ] , name= ’ x ’ )
60 #x = t f . reshape ( x , [1000 , 4258])
61 c o r r u p t _ p r o b = t f . p l a c e h o l d e r ( t f . f l o a t 3 2 , [ None , d i m e n s i o n s [ 0 ] ] )
62 c u r r e n t _ i n p u t = ( c o r r u p t _ p r o b ∗ ( t f . random_normal ( s h a p e = t f . s h a p e ( x ) , mean
= 0 . 0 , s t d d e v = 1 , d t y p e = t f . f l o a t 3 2 , s e e d =None , name=None ) ) ) + x
63 # corrupt_prob = t f . reshape ( corrupt_prob , [1000 , 4258])
64 # current_input = t f . reshape ( current_input , [1000 , 4258])
65 print ( corrupt_prob )
66 # Corruption process
67
68 encoder = [ ]
69 f o r l a y e r _ i , n_output in enumerate ( dimensions [ 1 : ] ) :
70 n_input = i n t ( current_input . get_shape ( ) [ 1 ] )
71 W = t f . V a r i a b l e ( t f . random_uniform ( [ n _ i n p u t , n _ o u t p u t ] , − 1 . 0 / np . s q r t (
n _ i n p u t ) , 1 . 0 / np . s q r t ( n _ i n p u t ) ) , name= "W" )
72 e n c o d e r . append (W)
73 o u t p u t = t f . nn . r e l u ( t f . matmul ( c u r r e n t _ i n p u t , W) )
74 current_input = output
75
76 z = current_input
77 encoder . reverse ( )
78
87 # Cost f u n c t i o n c a l c u l a t i o n
88 c o s t = t f . s q r t ( t f . reduce_mean ( t f . s q u a r e ( y − x ) ) )
90 # Dropout f u n c t i o n
91 d r o p _ o u t = t f . nn . d r o p o u t ( y , 0 . 9 0 ) # 90% t h a t e a c h e l e m e n t i s k e p t
92
93
96
97
98 # Load o f d a t a
99 t r a i n i n g = ( np . l o a d ( ’
l s s t _ c o s m o s s n a p _ s p e c t r a _ r e s t f r a m e _ t r a i n i n g _ s e t _ 1 0 0 0 0 _ b s l _ t 7 _ 8 6 4 2 . npy ’ ) ) . T
100 w a v e l e n g t h _ t r a i n i n g = np . l o a d ( ’
l s s t _ c o s m o s s n a p _ s p e c t r a _ r e s t f r a m e _ t r a i n i n g _ s e t _ w a v e l e n g t h . npy ’ )
101 # print ( training )
102
103 # Flux c a l c u l a t i o n
104 f l u x = np . median ( t r a i n i n g [ : , 3 0 0 0 : 3 3 7 7 ] , a x i s = 1 )
105
106
107 num_samples = 2 0 0 0 # 1 7 0 0
108 n_input = 4258
109 n_hid = 1200
110 # 800
111
116 ae = d e n o i s i n g _ a u t o e n c o d e r ( d i m e n s i o n s =[ n _ i n p u t , n _ h i d ] )
117
118
126 f o r i i n r a n g e ( 0 , num_samples ) :
127 c o r r u p t i o n [ i , : ] = sigma [ i ] ∗ np . o n e s ( t r a i n i n g . s h a p e [ 1 ] )
128
129
134 o p t i m i z e r = t f . t r a i n . AdamOptimizer ( l e a r n i n g _ r a t e ) . m i n i m i z e ( ae [ ’ c o s t ’ ] )
135
136
140
141 p r i n t ( ’ Decoding . . . \ n ’ )
142
143
150 # i f ( e p o c h _ i %100 == 0 ) :
151 p r i n t ( " Epoch " , e p o c h _ i , " / " , n_epochs , " C o s t : " , s e s s . run ( ae [ ’ c o s t ’ ] ,
f e e d _ d i c t = { ae [ ’ x ’ ] : b a t c h _ x s , ae [ ’ c o r r u p t _ p r o b ’ ] : c o r r u p t _ x s } ) )
152 # p r i n t ( "W: \ n " , s e s s . run ( ae [ ’ w e i g h t ’ ] , f e e d _ d i c t = { ae [ ’ x ’ ] : b a t c h _ x s , ae [ ’
corrupt_prob ’ ] : corrupt_xs } ) )
153 np . s a v e ( ’ Weight . npy ’ , s e s s . run ( ae [ ’ w e i g h t ’ ] , f e e d _ d i c t = { ae [ ’ x ’ ] : b a t c h _ x s ,
ae [ ’ c o r r u p t _ p r o b ’ ] : c o r r u p t _ x s } ) )
154 # Add o f t h e c u r r e n t l o s s t o t h e l i s t
155 c o s t _ l i s t [ 0 ] . append ( s e s s . run ( ae [ ’ c o s t ’ ] , f e e d _ d i c t = { ae [ ’ x ’ ] : b a t c h _ x s , ae [ ’
corrupt_prob ’ ] : corrupt_xs } ) )
156
157
160 # P l o t example r e c o n s t r u c t i o n s
161 n _ e x a m p l e s = num_samples
162 c o r r u p t _ t e s t = 0 ∗ np . random . randn ( n_examples , t r a i n i n g . s h a p e [ 1 ] )
163
164
167 r e c o n _ d r o p o u t = s e s s . run ( ae [ ’ d r o p o u t ’ ] , f e e d _ d i c t = { ae [ ’ x ’ ] : t r a i n i n g [ 0 :
n_examples , : ] , ae [ ’ c o r r u p t _ p r o b ’ ] : c o r r u p t _ t e s t } )
168
175
178
179 # ########################
180 # ## OUTPUT DISPLAYS # # #
181 # ########################
182
183
184
196 f o r p in range ( 0 , 4 ) :
197 plt . figure ( ) , plt . plot ( wavelength_training , training [p , : ] ) , plt . plot (
wavelength_training , recon [p , : ] )
198 p l t . show ( )
DAE.py