Applied Mathematical Modelling: Niclas Ståhl, Gunnar Mathiason, Göran Falkman, Alexander Karlsson

Applied Mathematical Modelling 70 (2019) 365–377
Contents lists available at ScienceDirect
Applied Mathematical Modelling

journal homepage: www.elsevier.com/locate/apm
Using recurrent neural networks with attention for detecting

problematic slab shapes in steel rolling
Niclas Ståhl∗, Gunnar Mathiason, Göran Falkman, Alexander Karlsson
School of Informatics, University of Skövde, Box 408, Skövde 541 28, Sweden
a r t i c l e i n f o a b s t r a c t
Article history: The competitiveness in the manufacturing industry raises demands for using recent data
Received 29 June 2018 analysis algorithms for manufacturing process development. Data-driven analysis enables
Revised 27 November 2018
extraction of novel knowledge from already existing sensors and data, which is necessary
Accepted 17 January 2019
for advanced manufacturing process refinement involving aged machinery. Improved data
Available online 23 January 2019
analysis enables factories to stay competitive against newer factories, but without any
Keywords: hefty investment. In large manufacturing operations, the dependencies between data are
Attention mechanism highly complex and therefore very difficult to analyse manually. This paper applies a deep
Recurrent neural networks learning approach, using a recurrent neural network with long short term memory cells
Interpretable AI together with an attention mechanism to model the dependencies between the measured
Steel rolling product shape, as measured before the most critical manufacturing operation, and the
final product quality. Our approach predicts the ratio of flawed products already before
the critical operation with an AUC-ROC score of 0.85, i.e., we can detect more than 80 %
of all flawed products while having less than 25 % false positive predictions (false alarms).
In contrast to previous deep learning approaches, our method shows how the recurrent
neural network reasons about the input shape, using the attention mechanism to point out
which parts of the product shape that have the highest influence on the predictions. Such
information is crucial for both process developers, in order to understand and improve
the process, and for process operators who can use the information to learn how to better
trust the predictions and control the process.
© 2019 Elsevier Inc. All rights reserved.
1. Introduction
1.1. Background and motivation
The manufacturing industry is a highly competitive sector. The investment costs for machinery and equipment are very
large and, due to these high costs, a production line is fixed to a certain machinery and configuration for a very long time.
This poses a challenge for older factories, since they have to stay competitive to newer factories built using more recent
technologies. In such a setting, it is of great importance to keep refining and improving the existing production lines. To this
end, knowledge about the production processes is crucial. Such knowledge can be extracted from collected data with the
help of machine learning (ML), which allows for automatic knowledge detection and the discovery of patterns in collected
data.
∗
Corresponding author.
E-mail address: niclas.stahl@his.se (N. Ståhl).
https://doi.org/10.1016/j.apm.2019.01.027
0307-904X/© 2019 Elsevier Inc. All rights reserved.
366 N. Ståhl, G. Mathiason and G. Falkman et al. / Applied Mathematical Modelling 70 (2019) 365–377
While there has been a lot of theoretical and conceptual work on how ML methods can be included in production
processes (see for instance Lee et al. [1] and Wang et al. [2]), there has so far been very little focus on how to implement
and utilize such systems in practice. One reason is that the implementation of these methods is non-trivial and it is often
difficult to select the adequate methods to use and correctly evaluate and understand the results of these methods. This,
together with a general lack of ML expertise and knowledge within the manufacturing industry, constitutes a great obstacle
for adopting these techniques in the continuous development of this sector. Thus, as a sought-after demonstration of how
ML methods can be used to improve a manufacturing process, this paper shows how the shape of slabs, i.e, steel blocks,
is analysed before the slabs are rolled into thin sheets and subsequently reeled into coils, with the objective to, as early
as possible, detect deformations in the slabs that would increase the risk of the sheets to be obliquely reeled, causing the
resulting coil to be telescoped (i.e., being askewed). When telescoping occurs, a costly procedure to straighten out the steel
sheet is required. It is therefore of great importance to the manufacturer to know when the risk of telescoping is high, so
that suitable countermeasures can be taken in order to minimize the total number of telescoped coils. It is also of great
value to know the reasons behind the cause of telescoping, in order decide how to update the manufacturing process and
where future investments are motivated.
The data used for this study consists of time series of measurements over slabs. Such data is generally considered to
be difficult to analyse, mainly due to non-stationary and non-linear dependencies over time [3]. Due to this fact, currently,
when process developers study the process, only basic aggregated summary statistics are used without any further analysis
of the sequential data. Another challenge with the data is that multiple kinds of products, with different sizes, are manufac-
tured and, thus, the number of measurements differ between different slabs. One family of methods that has been shown
to be efficient for dealing with such complex data is deep learning (DL) [4]. In this paper, we use one such method, namely
a recurrent neural network (RNN) [5] with long short term memory cells (LSTM-cells) [6] to analyse the shape of the slabs.
One drawback of using such a method is that it is difficult to interpret and understand the reasoning behind the predic-
tions made and, thus, it is infeasible to gain a deeper understanding of neither the inner workings of the method nor the
production process itself. To overcome this drawback, an attention mechanism [7] is added to the network. This allows for
the highlighting of areas of the shape of the slab from which the network draws its final conclusion, and, hence, allows
the operator to better understand and therefore take a more suitable action in order to correct the highlighted flaws and
possibly minimize the risk of telescoping.
When applying the RNN with LSTM cells, combined with an attention mechanism, in order to predict the ratio of flawed
products before the critical operation, we achieved an area under the receiver operating characteristic curve (AUC-ROC)
score of 0.85. As an example, this allows for the detection of more than 80% of all flawed steel coils while having less than
25% false positive predictions (false alarms). In addition to resulting in a high AUC-ROC score, the presented method also
shows how the recurrent neural network reasons, in contrast to most other deep learning methods. Using the attention
mechanism, the reasoning behind the prediction can be visualized in order to extract knowledge about the process and to
further understand which shapes that may cause telescoping. Using this visualization it can be further investigated what
type of slab shape patterns that may increase the risk of telescoping. Such information is useful for both process developers,
for improving the overall process, and for process operators who can use the information to better understand process and
thereby learn how to better control the process.
1.2. Related work
Many authors, such as Lee et al. [1] and Wang et al. [2], have identified and described the ongoing transformation to-
wards factories with embedded intelligent software and decision making based on gathered process data. While such gen-
eral, but abstract, methods and ideas can be applied to most manufacturing processes, there is a lack of work that study
the practical implementation of the methods in question, especially as applied to the manufacturing industry. One example
where the latter aspects are considered is the work by Wuest et al. [8], in which both supervised and unsupervised learning
are used to monitor and predict different possible states of the production process. A similar approach was presented by
Laha et al. [9], who tried to model the complex process of steel making, using several ML methods, such as random forests
and artificial neural networks. Both Wuest et al. [8] and Laha et al. [9] came to the conclusion that ML methods can be used
to improve the steel making process. However, these methods are not utilized due to lack of relevant data and knowledge
about ML. The suggestion is therefore that more work are needed to highlight the advantage of using ML in the steel making
process at the same time as suitable and easy to use softwares are developed and introduced to the industry. Beside ML,
there are also several authors that apply novel mathematical models in order to model the steel making process, one exam-
ple is Bambach et al. [10] who is using fluid dynamics to model the behavior of the slab during rolling. There are also some
works that combine methods from both fields, such as Santos et al. [11] who use AI methods to find the best parameters to
a mathematical model of the spray cooling in a continuous casting process.
Lately, there has been some development towards using DL and, more specifically, convolutional neural networks (CNNs)
for various computer vision tasks in the steel production process. One such task, which has been studied by several authors,
is the detection of impurities in the steel visible to a computer through a camera [12–15]. Another application of CNNs in
steel rolling factories is presented by Lee and Kim [16], which shows that a CNN can be used to visually identify, localize
and track a given slab along the production line. Most of these works aim to develop systems that can be deployed in the
line of production. The drawback of such systems is that only defects and flaws that can be visually detected from an image
N. Ståhl, G. Mathiason and G. Falkman et al. / Applied Mathematical Modelling 70 (2019) 365–377 367
Fig. 1. A schematic over the steps in the steel rolling process that shows where in the process the data is being collected. The final label that is used for
classification is given after the manual inspection of the finished product.
can be identified. Hence, a defect can be missed if it is covered by cinder or grime from the furnace. In contrast, Azimi et al.
[15] focuses on the analysis of finished steel sheets, after the production is completed and after the steel has been polished,
and, hence, the resolution and the quality of the captured images are much higher.
To the best of our knowledge, no previous author has presented any work showing how recurrent neural networks with
an attention mechanism can be used in the manufacturing industry. However, such networks are commonly used for ma-
chine translation and generation of image captions [7,17,18]. Even though such problems may seem completely different from
the analysis of steel sheet, they have several common characteristics. For example, all these problems have data that consist
of series of different lengths. There are also a globally emerging patterns that emerges from local interactions between the
datapoints in the series. Recurrent neural networks with an attention mechanism is further described in Section 2.4.
1.3. The steel rolling process
The steel rolling process considered in this paper contains four major steps. These steps are shown in Fig. 1 and are as
follows:
1) The first step in the process is the furnace, in which the slab is heated.
2) After the furnace, the slab is transported to the roughing mill where the first rolling is done, reducing the thickness of
the slab to a level where it can be processed in the subsequent step, the steckel mill. There is an operator controlling
the distance between the rolls in the roughing mill, but, this adjustment can only be conducted between charges.
3) After the slab has entered the steckel mill it is rolled, back and forth through a pair of rolls, several times until it
reaches its final thickness. This is the part of the process line that is mostly controlled by human operators. Here, an
operator has the opportunity to try to correct observed flaws through the adjustment of two rulers that are positioned
just before the rolls.
4) Finally, after the steckel mill, the final steel sheet is reeled into a coil in the downcoiler. After the coil has cooled
down it is manually inspected by an operator, who determines if it is telescoped or has any other flaws.
1.4. Challenges in the steel rolling process
Steel rolling is a complex and complicated process and there are many challenges that can be addressed in order to
improve it. However, in this section we only concern challenges that are believed to be connected to the problem of tele-
scoping. Here, the main challenge is to prevent deformations of the material can occur since it is believed that the root
cause of telescoping are irregularities in the material. Such irregularities can arise in the roughing mill when the roughing
mill is not optimally configured. In this case, there is a risk that the slab is unevenly rolled, causing some parts of the slab
to contain more mass. However, the roughing mill is hard to configure, especially when multiple slabs of different length,
width and steel quality are processed in sequence. Irregularities can also form in the roughing mill if the slab is unevenly
heated in the furnace, making it stiffer in colder areas. Since the metal sheet is stretched during rolling, small irregularities
can be amplified in the steckel mill, if not being corrected. Therefore, it is of utter importance to detect these flaws and
correct them as early as possible.
2. Preliminaries
This section describes the methods used in the conducted experiments to achieve the goal of predicting if a slab will
be telescoped or not and to further understand the reasons for what may cause the telescoping. In the first part of this
section, a basic description of artificial neural networks is given. Secondly, an overview of a traditional recurrent neural
network is presented. In the last part of the section, the long short term memory cell is presented together with the atten-
tion mechanism, the latter allowing for the interpretation of what a recurrent neural network focus on when it makes its
predictions.
First Second
Input Output
hidden hidden
layer layer
layer layer
x1
x2
y1
x3
x4
Fig. 2. A multi layered artificial neural network, having two hidden layers where each have 5 neurons and a final output layer.
2.1. Artificial neural networks
An artificial neural network (ANN) is a commonly used general machine learning model that is inspired by how the human
brain is believed to function [19]. Neural network models are also the main building block of most deep learning models,
which have been successfully applied to many domains, such as image recognition and natural language processing [4]. The
reason ANN models are so effective is that they are able to learn internal representations of the raw data and then combine
these representations into more abstract internal representations [4]. This approach allows ANNs to solve a problem by first
splitting it into several, easier, sub-problems, solving these and then combining the sub-results into a final, complete result.
The main building block of ANNs is the artificial neuron, which is defined as an unit that performs the mathematical
operation:

n
f (w,b) (x ) = f xi ∗ wi + b , (1)
i=0
where f is a almost everywhere differentiable and non-linear function, typically the sigmoid function or the rectified linear
function. The non-linear function operates on a weighted sum of n input signals, xi , where the weighs, wi , and the bias, b,
are parameters that are learned during the training of the model. Neurons are often organised in a layered fashion, where
all neurons in each layer have the same input, and where the output of each layer is propagated as input to the next layer of
neurons. Such neural networks are often called feed forward neural networks. A schematic representation of a multi-layered
feed forward network is shown in Fig. 2. When organising neurons in a layered fashion, Eq. (1) can be rewritten as:
f (w,b) (x ) = f (xw + b), (2)
where x is a vector containing all input signals, w is a matrix consisting of the trainable weights of all neurons in that
layer and b is a vector of all biases. A feed forward neural network could, hence, be defined as an iterative application of
Eq. (2) on the initial input signal. Thus, a n-layered feed forward network is defined as:
F (x ) = f (w(n) ,b(n) ) ◦ f (w(n−1) ,b(n−1) ) ◦ · · · ◦ f (w(1) ,b(1) ) (x ), (3)
where ◦ is function composition. Since all f (w(i) ,b(i) ) are almost everywhere differentiable it is possible to differentiate F(x)
in almost all cases. Hence, given a dataset of pairs (xi , yi ) it is possible to use gradient based methods, such as backpropa-
gation [20], to minimize the difference between F(xi ) and the desired output yi . This difference is often denoted as the loss.
Common ways of specifying the loss is to use the mean squared error (MSE), when the desired output is continuous, or the
cross entropy error, if the output is binary. Since this paper concerns a case where the output is binary the cross entropy
error, as given in Eq. (4), is used in our implementation.

loss(y, yˆ ) = − yi log yî − (1 − yi ) log(1 − yî ). (4)
i
Here yi is the desired (binary) output and yî is the predicted output, given by F(xi ).
2.2. Recurrent neural networks
A recurrent neural network (RNN) [5] is a special type of ANN where there exist cyclic connections between neurons,
unlike basic feed forward networks that are acyclic. Such cyclic connections enables the network to keep an inner state.
This allow the network to act on information from previous steps in the computation sequence and, thus, exhibit dynamic
temporal behaviour. This makes it advantageous to use RNNs for the analysis of sequential data, such as time series [21].
Despite that RNNs, in their standard execution, have shown many promising results when it comes to the analysis of
sequences, they still have difficulties to capture long range dependences, due to vanishing gradients [22]. However, this
xt ht−1 xt ht−1
Input Gate it Output Gate ot
Memory cell
xt
ct ht
ht−1
ft Forget Gate
xt ht−1
Fig. 3. All parts of the LSTM cell and how these are connected. The functions of all the internal gates and how the state is updated are described in
Eqs. (5) to (9).
problem has partially been solved by the introduction of long short term memory (LSTM) cells [6], which are presented in
the next section. Another modification of RNNs that can facilitate the interpretation of how an RNN reasons about the input
and why it draws a given conclusion is the attention mechanism, which is presented in Section 2.4.
A second problem with RNNs, which still persist even when using LSTM cells, is that these networks have difficulties
to remember something over a long input sequence. Hence, when an RNN reads a long sequence, most information about
what happens in the beginning is lost. One solution to this flaw is to use bidirectional networks [23]. A bidirectional network
consists of two separate RNNs, in which one network analyses a sequence from the start to the end, while another network
analyses the sequence in the opposite direction. Using two separate networks traversing the sequence in different directions
results in one network having the beginning fresh in memory, the other network having the ending fresh in memory. Thus,
the risk that an important information from the beginning of the sequence will be forgotten is greatly decreased.
2.3. Long short term memory cells (LSTM)
One disadvantage of the standard RNN is that when such a network is trained with backpropagation, the gradients will
either become zero or go to infinity. This makes it difficult to train these networks, and the consequence is that it is impos-
sible to learn long range dependencies [22]. To solve the problem of the vanishing gradients, Hochreiter and Schmidhuber
[6] introduced the LSTM cell. In contrast to a standard RNN, LSTM cells contain an explicit state. The value stored in this
state is then regulated by gates within the cell. These gates have specific rules defining when to store, update or forget the
value in the internal state. All these rules are described in detail below, where the operations in a layer consisting of n LSTM
cells at time step t are shown. These steps aim to calculate the final output, ht , of the LSTM cells, which have the internal
state ct given the input vector xt of size m. An overview of all parts of the LSTM cell is also presented in Fig. 3.
The first step in the LSTM layer is to calculate the output of the forget gate, which decides how much of the current
state that should be forgotten when calculating the new value of the inner state. This is done by:
ft = σ (w f xt + u f ht−1 + b f ), (5)
where σ is the sigmoid function, bf is the bias term, which is a vector of size n, wf is a n × m matrix of trainable weights
and uf is a n × n matrix with trainable weights. The same will be true for all variables b, W and U with different subscripts.
The next step is then to calculate the output of the input gate, which controls how much the input and the output in the
last time step should be considered when updating the internal state. This is done by:
it = σ (wi xt + ui ht−1 + bi ). (6)

The final gate is the output gate, which controls how much of the internal state that should be propagated as the output
of the current layer. This is calculated by:
ot = σ (woxt + uoht−1 + bo ). (7)

The internal cell state is then updated by the following equation:
ct = ft ct−1 + it tanh(wc xt + uc ht−1 + bc ). (8)

Here denotes element-wise multiplication between the two vectors. The final output of the layer is then given by:
ht = ot tanh(ct ). (9)
2.4. Attention mechanism
The attention mechanism for RNNs was first introduced by Bahdanau et al. [7] in order to solve the problem of machine
translation and was inspired by how humans translate text from one language to another. When it was introduced, it out-
performed and distinguished itself from previous approaches, which first read the input sentence into a vector, representing
its context, and then used this vector to produce a translated sentence. An RNN with the attention mechanism does instead
annotate each word with a hidden representation. When the next word in the output sequence should be produced, the
RNN, instead of predicting the next word, predicts which of the hidden annotations it should “pay attention to” and then
uses these hidden annotations to produce the next word in the output sequence. This approach produces the appealing
property that it is possible to derive which parts of the input that are important in the decision making when selecting the
next step in the output sequence. Using this property Bahdanau et al. [7] did, for example, show that the RNN with the
attention mechanism focused on the correct words in the input sentence when selecting the correct semantical form of the
corresponding words in the produced translation.
The problem that is solved with an RNN with an attention mechanism is to produce the desired output sequence
y0 , y1 , . . . , ym given the input sequence x, which consist of x0 , x1 , . . . , xn . Such an output sequence can be found iteratively
by sampling the next step in the output sequence from the conditional probability distribution that depends on the in-
put sequence and the previously drawn sample in the output sequence. To be able to draw such a sample, the conditional
probability
p(yi |yi−1 , x0 , . . . , xn ), (10)
must be calculated. This is however infeasible in most cases. Instead, Eq. (10) can be approximated by the non-linear func-
tion:
p(yi |yi−1 , x0 , . . . , xn ) ≈ g(yi−1 , si , ci ), (11)
where g is an RNN, si is the internal state of that RNN and ci is the current context, a vector holding information of which
inputs that are important at the current step. The context is derived from both the current state, si , and the input sequence
x. The first step, in order to find the context, is to input x to an RNN. The RNN will then find an annotation for each step
in the input sequence, denoted hj . This annotation captures the essential information about the current step and its vicinity.
After the RNN has stepped through the whole input sequence, the attention mechanism of the network decides how much
attention that should be put on the annotation provided at each step. The attention that should be given to the jth state of
the RNN when predicting output yi , is denoted α ji and is calculated as:
e(a(si−1 ,h j ))
αi j = (a(si−1 ,hk ))
. (12)
k=0 e
Note that the attention at time step i is normalized in order to sum up to one. Thus α ij describes how much attention
that should be put on the notation hij at time step i in relation to all other annotations. In Eq. (12), a is a scoring function,
deciding the importance of the output hj . There have been several suggestions to which function to use as the scoring
function. The most common approaches are to use either a bilinear scoring function as presented by Luong et al. [17] or
take the original approach presented by Bahdanau et al. [7] and use a separate ANN with trainable weights and the tanh
function as the activation function in the last layer. The context, derived from the input x and the state of the RNN, si−1 , is
calculated as the weighted sum:

n
ci = αi j h j . (13)
j=0
Hence, the context at step i, ci , is the sum of all the annotations hj that the RNN has made at each step in the sequence,
weighted by how much attention that should be put to that particular annotation.
3. Approach
This section describes the approach we used in order to predict if a slab will be telescoped, and to further understand
the reasons for what may cause telescoping. This section starts by describing the data collection phase and how data are
collected from the manufacturing process. The next part describes our design choices when selecting the structure of the
artificial neural networks that are used in the experiments presented in the next section.
Fig. 4. Illustration of the profile of a steel slab and an illustration of the two features that are collected from it.
3.1. Data
The data considered in this paper is collected from the steel rolling process described in Section 1.3. The data are col-
lected just before the slab enters the steckel mill, as shown in Fig. 1, using multiple sensors. The data that are recorded and
used for the presented analysis concern the width of the slab at each reading and how much the outer side deviates from
its targeted position. An illustration of what is captured by the sensors is shown in Fig. 4. The data that are utilized in this
study, originates from six months of production. During this period, around 10,0 0 0 coils were produced and between 5 to 10
percent of them were to some extent telescoped. The same sensors were used during this period and the data that are read
by the sensors are collected at a fixed sampling rate. This can be problematic for the analysis, since the speed of different
slabs may differ. Slabs also have different lengths and thus, the number of readings differs between each produced coil. The
maximum number of readings for any slab in the dataset, i.e., the maximum number of data points, for a given slab is 145
and the minimal number of data points is 44. In the first baseline experiments, the data are treated in the same way as the
data are treated in the current processing systems that are used for analyses. In this system it is required that data have a
fixed number of values. To this end, each sequence of measurements is first split into three different sets, one for the head
which consists of the first 25 measurements, one for the tail which consists of the 25 last measurements and one set of
the body which is the measurements in between of the head and the tail. In order to analyse these sets with conventional
machine learning methods, the minimum and maximum value as well as the mean and standard deviation for each set are
extracted. Thus, each slab will always be attributed by 24 values, 12 concerning the width and 12 concerning the center
line deviation, independent of the number of measurements that were conducted. This is contrasted with an analysis using
an RNN, where such pre-processing is no longer needed. The only pre-processing that is conducted in this case, in order to
make it easier for the RNN to detect abrupt changes in slab shapes, is to add two derived features to the original dataset,
namely the absolute differences between two consecutive readings of the shape and two consecutive readings of the outer
side’s deviation from its targeted position.
3.2. Network architecture and implementation
As described in greater detail in the next section, there are several conducted experiments.
The first experiments are baseline experiments that evaluate the performance of a feed forward neural network on ag-
gregated data. In these experiments, three different architectures are evaluated, where the networks have two hidden layers.
The first hidden layer has 64 neurons in all experiments while the number of neurons in the second layer either consists
of 8, 16 or 32 neurons. In the following experiments, six different RNN architectures are evaluated, three with an attention
mechanism, such as described in Section 2.4, and three without the attention mechanism. The three experiments without
attention are conducted in order to evaluate the impact of the attention mechanism on the predictive power of the model.
The main component of all recurrent neural networks used in the experiments is a bidirectional RNN with LSTM cells
[23]. The reason why we select a bidirectional RNN is that the studied sequence consist of many readings and the risk that
important events would have been forgotten when the end is reached would be high if just a traditional RNN would have
been used. The attention mechanism is added in order to be able to visualize which parts of the slab that are important
when taking the decision if a slab is telescoped or not. While such attention mechanisms mainly are used in sequence
to sequence learning, there is nothing that prevent these from being used for the prediction of a single value, as in our
case. However, since only a single value is predicted, the output cannot depend on the previous step, and Eq. (11) will
only depend on the context and the internal state of the network. Only providing a single output would also have the
consequence that the state of the system (si−1 ) in Eq. (12) always have the same value as the last state of RNN that reads
the input. A conceptual description of our model is shown in Fig. 5. Here it is shown how the input from the sensors first is
read and analysed by a bidirectional RNN. The RNN annotates each step in the input sequence with a hidden representation.
The sequence of hidden representations is then first used by the attention mechanism to show how much attention the
Fig. 5. The conceptual model used within the presented research for the predicting telescoping. Here the attention is colour-coded so that darker red
represents areas with high attention while the light yellow represent low attention areas. (For interpretation of the references to color in this figure
legend, the reader is referred to the web version of this article.)
network should put on every single step. Then, in the following step the hidden annotations are summarized into a single
vector, where each step is weighted by the amount of attention the network decided to put on that particular step. The last
step in order to achieve the final classification is to give this vector to a single neuron that perform the classification.
As described in Section 3.1, the input data to the RNN-models consist of the deviation from the predefined outer border
and the width of the slab. The changes of these readings, in comparison to the last time step, are also added to the input,
in order to make it easier for the RNN to detect large fluctuations. Hence, the input to the RNN-model consist of a vector
of four values per time step. Furthermore, there are a maximum of 145 time steps in each sequence. The input from the
sensor readings is first propagated through a single hidden layer, having of 64 neurons. This allows the network to learn
a new hidden representation of the input. This representation is then propagated to a bidirectional RNN with LSTM-cells.
Depending on the experiment set up, this layer either has 8, 16, or 32 LSTM-cells. The final annotation for a given step, hj ,
is then calculated by propagating the signal from the LSTM layer through another hidden layer consisting of 4 neurons. In
some of the experiments an attention mechanism, which is described in Section 2.4 is then applied to the output of this
layer, yielding the amount of attention that the network should put on each step in the sequence. The distribution of the
attention is then used to visualize what the network decides to focus on and hence, what is important in order to decide if
the slab would be telescoped or not.
4. Experimental setup
The main purpose of the presented research is to highlight the benefits, in predictive power and interpretability, of using
RNNs with an attention mechanism. To this end, three different types of experiments are conducted. The first experiment
is a baseline which shows how the data are treated and analysed in the process currently. In this experiment the sequence
of measurements over a given slab is first aggregated into several summary statistics. The ability to correctly classify slabs
as telescoped or not of several conventional machine learning methods and several versions feed forward neural networks,
given the summary statistics of the slab, are then evaluated. The conventional methods that are used as a baseline are three
commonly used classification methods, namely; random forest, logistic regression and support vector machines (SVM)[24]. The
two following experiments show how the analysis can be improved by applying RNNs that operate on the raw sequence
data instead of on summary statistics. The first of these two experiments investigates the benefit of using an RNN compared
to the baseline. This experiment is also used to check how the attention mechanism affect the predictive power. In the
main and final experiment the RNNs ability to provide correct classifications, the results of the RNNs with the attention
mechanism are compared to the baseline and the other RNNs using the same architecture but without this mechanism.
How the performance of the networks evaluated and validated is described below.
Table 1
Summary of performances (on the test sets) on all evaluated methods. Here it is shown that
RNNs with an attention mechanism outperform conventional methods.
Method Area under Area under

ROC curve Precision-Recall curve
Random Forest 0.819 ± 0.014 0.459 ± 0.022

Logistic regression 0.749 ± 0.01 0.402 ± 0.031
SVM 0.813 ± 0.01 0.429 ± 0.02
ANN 8 neurons 0.805 ± 0.009 0.426 ± 0.022
RNN 8 neurons 0.816 ± 0.021 0.484 ± 0.04
Attention RNN 8 neurons 0.85 ± 0.017 0.51 ± 0.059
ANN 16 neurons 0.803 ± 0.012 0.417 ± 0.029
RNN 16 neurons 0.832 ± 0.008 0.474 ± 0.025
ANN 32 neurons 0.793 ± 0.013 0.401 ± 0.035
RNN 32 neurons 0.839 ± 0.011 0.483 ± 0.04
In addition to the main experiments, a minor investigation of how the number of neurons in the feed forward network
as well as in the RNNs affect the result is also conducted through the evaluation of multiple network architectures, as seen
in Section 3.2. The implementation of the networks were done using Python 2.7 and the two deep learning libraries Keras
version 2.1.5 [25] and Theano version 1.0.1 [26].
4.1. Training and evaluation of the networks
To validate the results from the experiments a 10-fold cross validation are used. Hence, the data are randomly divided
into ten sets. All networks, described in the previous section, are then trained on 9 of these sets while the remaining set
is kept as a test set. This process is then repeated ten times until all sets have been used as a test set. The training is
conducted by optimizing the values of the internal weights in order to minimize the binary cross entropy between the real
classifications of the slabs and the predicted classifications. The optimization of the weights is carried out using the ADAM
optimization algorithm with a learning rate of 0.001 [27], a batch size of 64 and it is run for 100 epochs. This takes less
than one hour on a regular desktop computer, equipped with a NVIDIA TITAN Xp GPU card, for the largest RNN. To prevent
overfitting of the model to the training data, a separate validation set consisting of 10% of the original data is used for early
stopping in the training of all neural networks model.
A problematic feature of the data is that the balance between the two different classes is very unbalanced and the num-
ber of non-telescoped samples are much more common than the telescoped ones. There are more than nine non-telescoped
samples for every telescoped sample. Due to this, the result cannot be evaluated by merely measuring the percentage of
correctly classified samples. Instead the area under the precision-recall curve [28] and the area under the receiver operating
characteristic (ROC) curve are used to evaluate how good the models are at separating the two classes from each other [29].
In order to provide a result that is not biased by any outliers, 10-fold cross validation is used and, the average values are
reported.
5. Result
Using an RNN with an attention mechanism, the area under the ROC curve was 0.85 and the area under the precision-
recall curve was 0.51. The area under the ROC curve was 0.03 units greater and the area under the precision-recall curve
was 0.05 units greater than the best baseline model (random forest), which achieved a mean AUC-ROC score of 0.82. Using
paired t-tests, these difference were shown to be significant, with p-values which are less than 0.01.
Both the mean ROC curve and the mean precision-recall curve for the RNN with attention mechanism and 32 LSTM-
cells is presented in Fig. 7. In this figure, it is for example shown that over 80% of the telescoped samples can be classified
correctly, while only less than 25% of the non-telescoped samples are misclassified. The mean AUC-ROC scores for all ex-
periments and architectures are presented in Fig. 6. Here, it is shown that there is a performance gain when using an RNN
instead of a feed forward neural network, which is limited to the analysis of summary statistics. This figure also shows that
all architectures with an attention mechanism achieved approximately the same AUC-ROC score, while the networks with-
out the attention mechanism were improved when the number of LSTM-cells increased. However, the networks without the
attention mechanism could still not perform as well as those with an attention mechanism. These results are summarized
in Table 1.
Using the attention mechanism, it is also possible to highlight areas that are important for the classification. In contrast
to the classification, there is no ground truth on which sections of the slab the classification depends on. However, we
can observe that the same sections of the slab are important when the presented model classifies slabs and when domain
experts do. Such important characteristics are, for example, jags and curves in the beginning or the end of the slab. This
is described in Fig. 8, where it is shown how the RNN decides to focus its attention when presented with three different
Fig. 6. The area under the ROC curve (a), and the area under the precision-recall curve (b) on the test set for the 10 fold cross validation, for all evaluated
methods. The methods using recurrent neural networks outperformed the other more conventional methods independently of the measured used. The
mean value is marked with a cross and the median is marked with a straight line, in these plots.
Fig. 7. The mean ROC curve (a) and the mean precision-recall curve (b). These are achieved on the test set in the experiments using an RNN with attention
mechanism and 32 LSTM cells. The red line represents the mean, while each grey line represents a given experiment. The blue dashed lines show the curve
that can be achieved by randomly guessing the outcome, under the assumption that 10% of the data consist of telescoped samples. In (a) there are three
dotted orange lines which show how high the false positive rate is when 50%, 85% and 95%, respectively, of that all telescoped samples are correctly
classified. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Fig. 8. Graphical representations of three different slabs, coloured by how much attention the presented RNN will give to a certain fraction of the slab. The
darker red represent high attention while the bright yellow colour represents very little attention. Thus, parts that are coloured in dark red are much more
important for the final classification than parts that are coloured in bright yellow colour. The shown shapes are exaggerated in order to better understand
the reasoning of the RNN. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
example slabs. The first of these three slabs, 8(a), is a straight and unproblematic slab without any flaws and, hence, there
is nothing special for the RNN to focus its attention on. The second slab, 8(b), has a small bend in the middle. For this slab,
the network decides to put most of its attention around this bend but also focuses on the beginning of the slab where the
outside slowly drifts away from the desired position. The last slab, 8(c), is greatly curved in the beginning. Such a defect
would most likely cause problem in the production line and the risk of it to become telescoped is high. The network has
indeed decided to focus its attention mainly on the initial part of that slab, which is curved.
6. Discussion
It is shown that the RNNs performed significantly better than the feed forward neural networks that are used as a base-
line comparison. This is most likely due to the fact that some information of the shape of the slab is lost when aggregated
summary statistics of the shape is considered instead of the raw measurements. As described in the previous section, the
networks with the attention mechanism performed better than the ones without. This is not very surprising. The networks
without the attention mechanism have the disadvantage of being forced to compress the whole sequence of sensor measure-
ments into a single vector, which size depends on the amount of LSTM cells. The networks with the attention mechanism do,
on the other hand, annotate each step of the sequence with a vector and then, after the whole sequence has been traversed,
select which of these vectors to focus its attention on. Thus, the risk of loosing important information when traversing the
sequence is much less for such networks.
The use of areas under the ROC-curve and the precision-recall curve do not only enables a fair evaluation of the good-
ness of the model, such metrics do also show the relation between the proportion of detected telescoped samples and the
number of false positive warnings. This can further be used in order to fine tune the system, to detect as many telescoped
samples as possible while still keeping the false positive warnings at an acceptable level in order to make a process operator
trust the predictions from the system. While there are no previous studies of similar processes which allow us to compare
the achieved AUC-ROC score with, we believe that an AUC-ROC score of 0.85 is good, due to all unknown factors in the
production line after the data collection as well of the uncertainties in the provided labels.
The data that are considered in this study were collected during several months. During this time, several parts of the
machinery were worn down and replaced, for example the rolls in the rougher. While such replacements may alter the
distribution of slab shapes, before the steckel mill, we believe that this would not affect the performance of the algorithm
since the problematic patterns would still look the same. However, changes in the configuration of the steckel mill and the
wore of its rolls may be one cause of misclassification. But, we believe that these cases would be far more rare, compared
to misclassification caused by mistakes of humans operating the steckel mill. Another part of the misclassification of the
presented networks are due to the uncertainty in the binary classification label that is provided by human operators. This
uncertainty is due to that different operators may classify the same coil differently. We therefore do believe that the pre-
sented approach can be even better and achieve a higher AUC-ROC score if this uncertainty is included into the training of
the networks. This could, for example, be achieved by multiple operators classifying every coil and making all these clas-
sifications available, instead of just a single classification label. There is an unbalance in the number of samples between
the two different classes, since only 5 to 10 percent of all samples have some degree of telescoping. This is not considered
in the presented approach and thus, there is an opportunity to further refine the model and the approach. However, such
improvements are not likely to have any drastic impact on the results.
One of the goals of the presented research is to develop a model that should be suitable for a decision support system for
operators in the production line. We believe that the RNN with an attention mechanism is special suited for this case, since
it would allow for a transparent decision support system which would allow operators to understand the reasoning of the
algorithm. The implementation of the decision support system is still to be developed, to allow for full evaluation. Still, our
current work has confirmed several hypothesizes and heuristics of the domain experts available in the project. For a current
evaluation of the effectiveness of our proposal, and to motivate a full implementation, a survey was given to stakeholders
of the production line. This included operators, process developers and managers. Basing their responses on the running
demonstrator, stakeholders could agreed that the approach can pin point several production issues. They agreed that this
work is actionable, motivating 5–7 concrete proposed production changes or production investments. As a consequence,
two follow-up projects have been proposed. Apart from an implementation, a deeper analysis, with high-fidelity modeling,
of the production step located before the slab profile measurement was requested. Other concrete outcomes include the
confirmation of an saying among operators that ”if the beginning of the slab has the shape of a sad smile (negative side
bend), the risk of telescoping is high”. Our network did indeed put a lot of attention to this particular slab segment, such
that it showed high risk of telescoping for such occurrences. Discussions with domain experts also confirms that the RNN
seems to base its classifications on the same data as experienced human operators do. Further, the attention mechanism
also highlights slab shapes that have not before been considered to correlate with telescoping. They are under further in-
vestigation for determining how more precisely they affect product quality. We believe that one of the main advantages of
this system is that the RNN seems to base its decisions on the same slab segments shapes as the domain experts, and this
will probably increase the acceptance and trust by operators in a more automated production system.
7. Summary and conclusions
We present a case where a bidirectional RNN with LSTM cells and an attention mechanism is used to detect steel sheets
that have a high risk of getting telescoped when coiled. This method achieved an area under the ROC curve of 0.85. For
example, this means that the model can detect 80% of all telescoped samples, while the misclassification of non-telescoped
samples is still below 25%. To the best of our knowledge, there have not been any similar approaches to solve this problem.
The performance of the RNN is significantly better than the performance of all baseline methods, which only considered
summary statistics of the sensor readings. This shows the possible gains that can be achieved by analysing raw sequential
data, instead of aggregated summary statistics, which is commonly done today. Other deep learning methods have been
used to solve other problems in the steel manufacturing industry and achieved state of the art results when it comes to
predicting the quality of the product. But, it is often of very hard to interpret the reasoning of these methods and it is often
infeasible to extract any knowledge about the given process. In this paper, we show how an attention mechanism can be
used to overcome this problem and present a way to visualize the inner reasoning of the network. This visualization allows
the algorithm to show what type of knowledge it has learned about the process. This knowledge can in turn be vital for
understanding the process fully, in order to decide and motivate new process improvements.
Acknowledgement
This work was supported by Vinnova and Jernkontoret under the project Dataflow (project number 2017-01531). We
would like to thank Andreas Persson and Joakim Ebervik at Outokumpu Stainless AB for their valuable collaboration.
References
[1] J. Lee, H.-A. Kao, S. Yang, Service innovation and smart analytics for industry 4.0 and big data environment, Procedia CIRP 16 (2014) 3–8.
[2] S. Wang, J. Wan, D. Zhang, D. Li, C. Zhang, Towards smart factory for industry 4.0: a self-organized multi-agent system with big data based feedback
and coordination, Comput. Netw. 101 (2016) 158–168.
[3] C. Li, Y. Yang, S. Liu, A new method to mitigate data fluctuations for time series prediction, Appl. Math. Model. 65 (2019) 390–407.
[4] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (2015) 436–444.
[5] F.J. Pineda, Generalization of back-propagation to recurrent neural networks, Phys. Rev. Lett. 59 (1987) 2229–2232, doi:10.1103/PhysRevLett.59.2229.
[6] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (1997) 1735–1780, doi:10.1162/neco.1997.9.8.1735.
[7] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, CoRR (2014). abs/1409.0473
[8] T. Wuest, C. Irgens, K.-D. Thoben, An approach to monitoring quality in manufacturing using supervised machine learning on product state data, J.
Intell. Manuf. 25 (2014) 1167–1180, doi:10.1007/s10845- 013- 0761- y.
[9] D. Laha, Y. Ren, P.N. Suganthan, Modeling of steelmaking process with effective machine learning techniques, Expert Syst. Appl. 42 (2015) 4687–4696,
doi:10.1016/j.eswa.2015.01.030.
[10] M. Bambach, A.-S. Häck, M. Herty, Modeling steel rolling processes by fluid-like differential equations, Appl. Math. Model. 43 (2017) 155–169.
[11] C.A. Santos, J.A. Spim Jr, M.C. Ierardi, A. Garcia, The use of artificial intelligence technique for the optimisation of process parameters used in the
continuous casting of steel, Appl. Math. Model. 26 (2002) 1077–1092.
[12] J. Masci, U. Meier, D. Ciresan, J. Schmidhuber, G. Fricout, Steel defect classification with max-pooling convolutional neural networks, in: Proceedings
of the International Joint Conference on Neural Networks, IEEE, 2012, pp. 1–6, doi:10.1109/IJCNN.2012.6252468.
[13] J. Masci, U. Meier, G. Fricout, J. Schmidhuber, Multi-scale pyramidal pooling network for generic steel defect classification, in: Proceedings of the
International Joint Conference on Neural Networks, IEEE, 2013, pp. 1–8, doi:10.1109/IJCNN.2013.6706920.
[14] D. Soukup, R. Huber-Mörk, Convolutional neural networks for steel surface defect detection from photometric stereo images, in: G. Bebis, R. Boyle,
B. Parvin, D. Koracin, R. McMahan, J. Jerald, H. Zhang, S.M. Drucker, C. Kambhamettu, M. El Choubassi, Z. Deng, M. Carlson (Eds.), Proceedings of the
International Symposium Vision Computer, LNCS, 8887, Springer International Publishing, Cham, 2014, pp. 668–677, doi:10.1007/978- 3- 319- 14249- 4_
64.
[15] S.M. Azimi, D. Britz, M. Engstler, M. Fritz, F. Mücklich, Advanced steel microstructural classification by deep learning methods, Sci. Rep. 8 (2018) 1–14,
doi:10.1038/s41598- 018- 20037- 5.
[16] S.J. Lee, S.W. Kim, Localization of the slab information in factory scenes using deep convolutional neural networks, Expert Syst. Appl. 77 (2017) 34–43,
doi:10.1016/j.eswa.2017.01.026.
[17] T. Luong, H. Pham, C.D. Manning, Effective approaches to attention-based neural machine translation, in: Proceedings of the Conference Empirical
Methods Natural Language Processes, Association for Computational Linguistics, 2015, pp. 1412–1421, doi:10.18653/v1/D15-1166. URL: http://aclweb.
org/anthology/D/D15/D15-1166.pdf.
[18] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, Y. Bengio, Show, attend and tell: Neural image caption generation with visual
attention, in: F. Bach, D. Blei (Eds.), Proceedings of the International Conference on Machine Learning, Proc. Mach. Learn. Res., 37, PMLR, Lille, France,
2015, pp. 2048–2057. URL: http://proceedings.mlr.press/v37/xuc15.html.
[19] S.B. Kotsiantis, I.D. Zaharakis, P.E. Pintelas, Machine learning: a review of classification and combining techniques, Artif. Intell. Rev. 26 (2006) 159–190,
doi:10.1007/s10462- 007- 9052- 3.
[20] D.E. Rumelhart, R. Durbin, R. Golden, Y. Chauvin, Backpropagation: The basic theory, in: Backpropagation: Theory, Architectures and Applications, 1995,
pp. 1–34.
[21] J.T. Connor, R.D. Martin, L.E. Atlas, Recurrent neural networks and robust time series prediction, IEEE Trans. Neural Netw. 5 (1994) 240–254, doi:10.
1109/72.279188.
[22] R. Pascanu, T. Mikolov, Y. Bengio, On the difficulty of training recurrent neural networks, in: S. Dasgupta, D. McAllester (Eds.), Proceedings of the
International Conference on Machine Learning, Proc. Mach. Learn. Res., 28, PMLR, Atlanta, Georgia, USA, 2013, pp. 1310–1318. URL: http://proceedings.
lr.press/v28/pascanu13.html.
[23] M. Schuster, K.K. Paliwal, Bidirectional recurrent neural networks, IEEE Trans. Signal Process. 45 (1997) 2673–2681, doi:10.1109/78.650093.
[24] N.M. Nasrabadi, Pattern recognition and machine learning, J. Electr. Imaging 16 (2007) 049901.
[25] F. Chollet, et al., Keras, 2015, (https://keras.io).
[26] Theano Development Team, Theano: a Python framework for fast computation of mathematical expressions, arXiv e-prints1605.02688 (2016).
[27] D.P. Kingma, J. Ba, Adam: a method for stochastic optimization, CoRR (2014). abs/1412.6980.
[28] J. Davis, M. Goadrich, The relationship between precision-recall and ROC curves, in: Proceedings of the 23rd International Conference on Machine
Learning, ACM, 2006, pp. 233–240.
[29] A.P. Bradley, The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognit. 30 (1997) 1145–1159,
doi:10.1016/S0 031-3203(96)0 0142-2.

Applied Mathematical Modelling: Niclas Ståhl, Gunnar Mathiason, Göran Falkman, Alexander Karlsson

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Applied Mathematical Modelling: Niclas Ståhl, Gunnar Mathiason, Göran Falkman, Alexander Karlsson

Hochgeladen von

Copyright:

Verfügbare Formate

Applied Mathematical Modelling 70 (2019) 365–377

Contents lists available at ScienceDirect

Applied Mathematical Modelling

Using recurrent neural networks with attention for detecting

1.1. Background and motivation

1.2. Related work

1.3. The steel rolling process

1.4. Challenges in the steel rolling process

2.1. Artiﬁcial neural networks

2.2. Recurrent neural networks

Input Gate it Output Gate ot

2.3. Long short term memory cells (LSTM)

it = σ (wi xt + ui ht−1 + bi ). (6)

ot = σ (woxt + uoht−1 + bo ). (7)

ct = ft ct−1 + it tanh(wc xt + uc ht−1 + bc ). (8)

2.4. Attention mechanism

p(yi |yi−1 , x0 , . . . , xn ), (10)

p(yi |yi−1 , x0 , . . . , xn ) ≈ g(yi−1 , si , ci ), (11)

3.2. Network architecture and implementation

Method Area under Area under

Random Forest 0.819 ± 0.014 0.459 ± 0.022

4.1. Training and evaluation of the networks

7. Summary and conclusions

Das könnte Ihnen auch gefallen