Sie sind auf Seite 1von 12

​TIME SERIES ANALYTICS

​Shallow RNNs
​ ​Submitted in partial fulfillment of the requirement for the degree of
​ ​MBA-BUSINESS ANALYTICS

Supervisor
​Dr. Anurag Chaturvedi

University School of Management &


Entrepreneurship

Made By: ​NAMRIT MEHTA(2K19/BMBA/11)


RITIKA(2K19/BMBA/13)
ACKNOWLEDGEMENT

We both students of MBA Business Analytics of 2nd year in university school of


Management and entrepreneurship, Delhi Technology University will be using this
opportunity to express our gratitude to everyone who supported us throughout the
project.

We are sincerely grateful to Dr. Anurag chaturvedi who guided us for the
completion of this report. We would also like to thank our teacher for providing us
with knowledge about the critical aspect of the topics related to this report helping
us whenever needed.

NAMRIT MEHTA
RITIKA
ABSTRACT

Recurrent Neural Networks (RNNs) take away long-term dependence, which is why it is
a key component of standard data sequencing operations. However, the RNN sequence
status means significant cost for long-term sequences even if Hardware supports
similarity. Making long-term dependencies, but acknowledging the similarities,
introducing RNNs that do not appear. In this structure, the first layer separates the input
sequence and uses multiple independent RNNs. The second layer uses the removal of the
first layer using the second RNN thus taking a long lean. We offer theoretical refinement
of our construction under the weak ideas that we affirm on the real world benches. In
addition, we show that with the division of the time series, our process leads to a much
improved RNN time without compromising accuracy. For example, we could incorporate
audio key audio separation into smaller Cortex M4 devices (100MHz processor, 256KB
RAM, no DSP available) that would not be possible using standard RNN models.
Similarly, using ShaRNN in the popular Listen-Attend-Spell (LAS) architecture for
phonetic separation, we can reduce the adhesion to the phonetic separation by 10-12x
while maintaining high accuracy.
INTRODUCTION
We focus on the challenging task of separating time series on small devices, a problem
arising from a few industrial and consumer applications, where small-edge devices make
sense, monitoring and predicting limited time and resources. An illustrative example is
the interactive stick of people with visual impairments, who are able to perceive touches
that are considered to be ear-tracking sensors. A series of time or sequential data
naturally indicates a temporary dependence. Sequential models such as RNNs are well
suited in this context because they can report temporary dependencies by trying to find a
relationship to previous entries. However, applying RNNs directly to predicting the
conditions mentioned above is a challenge. As noted by several authors, the sequence
situation in which RNNs process data basically limits the similarity leading to
significant training costs and costs. In particular, by the division of the time series, at the
time of consideration, the processing time scales with the size, T, of the receiving
window, which is not acceptable in limited resource settings.
33rd Conference on Neural Information Processing Systems (NeurIPS 2019),
Vancouver, Canada.
The proposed solution in the literature is to replace consecutive processing with
feed-forward and convolutional networks. The main insight used here is that many
systems require less acceptable windows, and that this size can be increased by
tree-sized networks and refined convolutions. However, feedforward / convolutional
networks use large working memory, making it difficult to deploy on smaller devices.
For this reason, some methods also do not work in our setting. For example, the function
of finding a standard audio keyword with a decent setup of 32 filters itself will require a
working memory of 500KB and probably more than 32X calculation than the basic
RNN model.
Shallow RNNs. To address these challenges, we create the RNN architecture of a
corresponding / duplicate novel while maintaining the available field length (T) and
basic RNN size. Specifically, we propose 2 simple formats that we call ShaRNN. Both
ShaRNN layers are composed of a set of common neural networks that operate
independently. Specifically, each consecutive data point (reception window) is divided
into independent sections called size k blocks, and the allocated RNN works on each
brick independently, thus ensuring a smaller model size and shorter duplication. That is,
the lower layer of ShaRNN restarts from the original state after all the steps of k ăă T, so
there is a shorter duplication. Results of the same T {k RNNs are included as the second
RNN sequence, which then triggers the prediction
Tq during consideration in the following two settings:
(a) Separation: here we compare the proportions of independent RNNs and thus allow
for the acceleration of many study structures,
(b) Streaming: here we use audio slides and reuse calculations from older slide windows
/ receiving fields.
We also note that, unlike the proposed feed transfer methods or reduced RNN methods,
our proposal accepts fully receiving fields and therefore does not lead to data loss. We
also improved ShaRNN by incorporating it with the latest MI-RNN method to reduce
the size of the receiving windows; we call the emerging method MI-ShaRNN.
While the feedforward layer can be used instead of our RNN in the next layer, those
layers lead to significant growth in model size and RAM performance to be acceptable
for smaller devices.
LITERATURE REVIEW

The author proposes a novel and general architecture that, to the best of my knowledge, has not
been described before. Thus the idea of the "shallow" two layer RNN architecture as well as the
accompanying theoretical analysis and experimental results are all novel. Quality: The claims
appear correct, although I have some confidence in not having missed important issues only for
claim 1 and 2. The experiments are comprehensive and instill confidence in the proposed
architecture and theoretical guarantees. The code they provide appears about average for this
type of research prototypes. Clarity. Most of the paper is clear and easy to follow. There are
however a few typos and sentences that could be improved with some additional proof reading.
(See below for some of the typos I spotted) Significance. The simplicity of the method
combined with the well-motivated use case of embedded devices with constrained resources
mean that I see this paper as a useful contribution, from which many are likely to benefit and
thus worthy of NeurIPS. Question and comments: When running over 5 random seeds, what
kind of variance is observed? It would be worth mentioning this at least the supp material, to get
a sense of the statistical relevance of the results. 46: ensuring a small model size -> I believe
the model size would not be smaller than that of a standard RNN, if so the claim appears a bit
misleading Claim 1 appears correct as stated, but the formulation is a bit convoluted, in the
sense that one typically would be given T and w, and can decide on a k; whereas in the current
formulation it appears as if you are given a T and q and can pick an arbitrary k based on that,
which is not really the case. Line 199: from this sentence it is not very clear how SRNN is
combined with MI-RNN, it would be good to give a little more details given that all results use
this model are based on a Shallow extension of MI-RNN. In the same vein the empirical
analysis would be a little stronger if the results of SRNN without MI-RNN would be reported
too. Minor: 37: standard -> a standard 81: main contributions -> our main contributions 90:
receptive(sliding) -> [space is missing] 135: it looks like s starts at 0 where all other indices start
at 1; including line 171 where s starts at 0 137: fairly small constant -> a fairly small constant
138: that is -> which is 139: tiny-devices -> tiny devices 152: I would find it slightly more
readable if the first index v^{(2)} was 1 instead of T/k; if you need an index at all at this point
154: should be v^{(1)} not v^{(2)} 159: tru RNN -> a true RNN 159: principal -> principle 172:
for integer -> for some integer 240: it's -> its 267: ablation study -> an ablation study latency
budget of 120ms -> it's not clear to me where this exact limit comes from is it a limit of the
device itself somehow? 318: steps of pionts threby 314: ully -> fully In the MI-RNN paper
[10] they benchmark against GesturePod-6, where the current paper benchmarks against
GesturePod-5, are they different? If so in what way?
The authors propose shallow RNNs, an efficient architecture for time series classification.
shallow RNNs can be parallelized as the time sequences are broken down into subsequences
that can be processed independently from each other by copies of the same RNN. Their outputs
are then passed to a similarly structured second layer. Multi-layer SRNN extends this to more
than two layers. The paper includes both a runtime analysis (claims 1 and 2) and an analysis of
the approximation accuracy of the shallow RNN compared to a traditional RNN. The idea is
straight forward but the paper scores very low on clarity. The authors opt for symbol definitions
instead of clear descriptions, especially in the claims. The claims are a central contribution of
the paper BUT UNNECESSARILY HARD TO PARSE. The implications of the claims are not
described by the authors. That's why I scored their significance as low. Here are specific points
that are unclear from the paper: l.133-140 Shouldn't the amortized inference cost for each time
step be C1 i.e. O(1)? Why would you rerun the RNN on each sliding window? l. 165 The
heavy use of notation is distracting from getting an understanding of what window size w and
partition size k you usually use. Is usually k larger than w or the other way around? This makes
it hard to understand how the SRNN architecture interacts with streaming. When the data is
coming in in streams, are the streams partitioned and the partitions distributed or are the streams
distributed Claim 1 * You already defined $X^s$. Defining it here again just distracts from the
claim. * q is the ration between w and k (hence it depends on k). It is weird that your statement
relates k to q which depends on k. please explain. Claim 2 * Choice of k in Claim 2 seems
incompatible with Claim 1. In Claim 1 k = O(sqrt(T)) in Claim 2 k = O(1). Claim 3 * What is
M? What is $\Nabla^M_h$? Claim 3 and 4 * Are those bounds tight enough to be useful?
Given a specific problem, can we compute how much accuracy we expect to lose by using a
specific SRNN? * Can we use these bounds together with the runtime analysis of claims 1 and 2
to draw tradeoff between accuracy and inference cost like in Figure 2? To me the strength of
this paper is the proposed model and ist implementation on small chips (video in the
supplement) as well as the empirical study. I would have been curious for a discussion on how
the proposed architecture relates to convolutional networks. It seems to me that by setting w
small, k small and L large, you almost have a convolutional network where the filter is a small
RNN instead of a typical filter. In the introduction, it is mentioned that CNNs are considered
impractical. I am curious; could it be that in the regimes for which the accuracy of SRNN is
acceptable (Claims 3 and 4) they are actually also impractical? Complexity similar to CNNs?
Overall this is a well-written paper with proper motivation, clear design, and detailed theoretical
and empirical analysis. The authors attempt to improve the inference efficiency of RNN models
with limited computational resources while keeping the length of its receptive window. This is
achieved by using a 2-layer RNN, whose first layer parallelly processes small bricks of the
entire time series and the second layer gathers outputs from all bricks. The authors also extend
SRNN in the streaming setting with similar inference complexity. One concern about the bound
in Claim 1 in the streaming setting: In line 137: w is required to be fairly small constant
independent of T. In line 166: w = k * q (w is a multiple of k, and thus k needs to be small
constant) In line 173: The bound becomes O(\sqrt{qT} * C_1) iff k=\sqrt{T/q}, which is not
o(1). Therefore, I was expecting analysis in practical applications with large T and small w. In
SRNN, will the O(T/k) extra memory cost be an issue during inference? The extension of
multi-layer SRNN in Section 3.2 provides at least O(log T) inference complexity. The bound
here is too ideal, but it would be great to see empirically how SRNN performs by adding more
shallow layers. The empirical improvements over LSTM and MI-RNN on multiple tasks are
impressing.
Performance and Deployability

We compare the two-layer approach of MI-ShaRNN by comparing it to other art forms,


in various banking databases, tagging for both accuracy and budgeting. We show that the
proposed 2 framework of MI-ShaRNN shows significant improvement in learning time
while also improving accuracy. For example, in the Google-13 database, MI-ShaRNN
achieves 1% higher accuracy than basic methods while providing 5-10x improvements
in imaging costs. A compelling feature of architecture is that it allows for the reuse of
computational computing, which leads to its use on very small devices. In particular, we
strongly suggest that this method can be used for real-time sequence separation on
devices such as those based on ARM Cortex M4 microprocessor3 small with only
256KB RAM, 100MHz clock speed and no Dedicated Signal Stabilization (DSP)
hardware. Finally, we demonstrate that we can replace the bi-LSTM encoder-decoder
based on LAS architecture with ShaRNN while maintaining the best accuracy of
publicly available TIMIT databases. This enables us to deploy LAS properties in fashion
distribution with a one-second lag in phonetic prediction and Op1q paid costs each time;
The standard LAS model may remain asleep for about 8 seconds as it processes all 8
seconds of sound before generating predictions.
Theory
We provide a doctrinal correction with the structure of ShaRNN and indicate that
significant similarities can be achieved if the network satisfies certain weak
assumptions. We also point out that additional layers can be incorporated into building
structures that lead to successive uses. While we may not try this concept here, we
realize that it provides the potential for clear development in the understanding period.

In summary, the following are our main contributions:


• We show that under weak assumptions, duplication of RNNs and as a result, data costs
can be significantly reduced.
• We demonstrate this performance by demonstrating two ShaRNN projects (and
MI-ShaRNN) that use only shallow RNNs with a small amount of repetition.
• We quantify MI-ShaRNN (development of ShaRNN and MI-RNN) in most databases
and note that it reads almost as accurate models as standard RNNs and MI-RNN. Due
to the limited duplication, ShaRNN maintains a calculation cost of 5-10x more than the
basic methods. We are using the MI-ShaRNN model in a small real-time audio
keyword microcontroller, which, prior to this operation, could not be known for
standard RNNs due to the high cost of logging through audio (flash) windows. We also
use ShaRNN in LAS architecture to enable phonetic fragmentation with less than one
second lag usage in prediction.
Related Work
Stack Construction. Our multi-line RNN is similar to stack RNNs read in books but not
related. The goal of Stacked RNNs is to produce complex models and use standard
RNNs. Each layer is completely repetitive, and feeds the output of the first layer to the
next level. The next level is another completely repetitive RNN. Thus, the combined
RNN formation leads to an increase in model size and duplication, which leads to a much
worse time of observation than conventional RNNs. Common Nets (Training). Typical
jobs in RNNs are mainly dealing with challenges that arise during training. Especially
with a large T-receiving window, RNNs suffer from the disappearance and explosion of
gradient problems. Many activities suggest avoiding this problem in many ways such as
building Gated or adding residual connections to RNNs or by blocking learned
parameters. Many recent works attempt to reduce the number of gates and limits to
reduce the size of the model but as a result suffer from poor quality of life, because they
are still completely repetitive. In contrast to these tasks, our focus is on reducing model
size and time to think and view these activities as relevant to our paper. Repetitive Nets
(Time for Reflection). Recent operations have already begun to focus on RNN
installation costs. suggests learning to skip communication that can avoid exploring all
hidden regions. exploits domain information that true signature is much shorter than
tracking to reduce the length of slide windows. Both methods are compatible and we
actually use the second one in our operating system. The recent work of the refined
RNNs is interesting. While it may serve as a viable solution, we note that, in its original
form, the refined RNN also has a first layer that is completely repetitive, and therefore
impractical. Another solution to introducing stretching is in the first layer to improve
comprehension time. However, skipping skips the steps and as a result can miss the
critical local context. Finally, CNN-based methods allow for high similarity in successive
tasks but as discussed in section 1, it also leads to significantly higher RAM usage
compared to RNNs, so it cannot be considered for deployment on smaller devices.

Empirical Results
We are conducting a study study:
a) the effectiveness of MI-ShaRNN with hidden state measurements of both the Rp1q and
Rp2q layers to understand how its accuracy meets the basic models in all different model
sizes,
b) cost-effectiveness improvements by MI-ShaRNN that produce more common
time-line planning problems in addition to the basic and MI-RNN models,
c) if MI-ShaRNN would allow certain timetable configuration functions on devices based
on a small Cortex M4 processor with only 100MHz processor and 256KB RAM. Note
that MI-ShaRNN uses ShaRNN in addition to the prescribed data points provided by
MI-RNN. MI-RNN is known for better performance than basic LSTMs, so naturally
MI-ShaRNN has better performance than ShaRNN. Therefore, we present the results of
the MI-ShaRNN and compare them with the MI-RNN to show the benefits of the
ShaRNN approach. Datasets: We measure our approach to standard data sets from a
variety of domains such as audio keyword detection (Google-13), wake name detection
(STCI-2), activity recognition (HAR-6), sports activity recognition (DSA- 19), action
recognition (GesturePod-5). The number following the hyphen in the database name
indicates the number of classes in the database. in the appendix for more information on
data sets. All data sets are available online with the exception of STCI-2 which is a
related data acquisition dataset. Basics: We compare our MI-ShaRNN (LSTM) algorithm
by comparing the basic LSTM method with the MI-RNN (LSTM) method. Note that
MI-RNN and MI-ShaRNN are built into the RNN cell. For simplicity and consistency,
we have selected LSTM as the base cell for all methods, but we can train each of them
with other RNN cells such as GRU or FastRNN. We used all the algorithms in
TensorFlow and used Adam to train the models. Cortex M4 device access code is written
in C and integrated on the device. All numbers presented are limited to 5 independent
runs. The implementation of our algorithm is released as part of the EdgeML library.
Hyperparameter Selection:
The main hyperparameters are:
a) hidden country sizes on both MI-ShaRNN layers.
b) MI-ShaRNN brick size. In addition, the number of time T steps is associated with each
database. MI-RNN lowers T and works with T 1-T time measures. We provide 2 results:
Shipping to Cortex M4: multi-channel accuracy vs processing time (ms) on M4 device
with 256KB RAM and 100MHz processor. Based on low-latency keyword (Google-13),
the total budget for the input time is 120 ms. Base MI-RNN MI-ShaRNN 16 32 16 32
(16, 16) (32,16) Acc. 86.99 89.84 89.78 92.61 91.42 92.67 Costs 456 999 226 494 70.5
117 with various state hidden dimensions to indicate the trade involved in choosing this
hyperparameter. We prefer k «? T for some variation of using wr.t length of step ω for
each database; we also provide ablation research to show the impact of various k options
with precision and set costs. Comprehension comparison: compares the accuracy of
MI-ShaRNN with the bases and MI-RNN with different sizes hidden in R1 and R2. In
terms of forecasting expertise, MIShaRNN works much better than the basics and
competes with MI-RNN across all databases. For example, with only “8”, MI-ShaRNN is
able to achieve 94% accuracy in the Google-13 database while the MI-RNN model is
used for T “49 and LSTM T-99 steps. steps. That is, with 8 deep repetitions, MI-ShaRNN
is able to compete with the definition of 49 and 99 deep LSTMs. For tuition costs, we
study the costs paid per data point per instant window setting. That is, the basic and
MI-RNN of each slide window restores all predictions from the beginning. However,
MI-ShaRNN may also use computer census in the first layer which has resulted in
significant savings in installation costs. We report cost considerations as additional
floating point functions (flops) each model will need to accomplish in all of the new ones.
For convenience, we treat both installation and duplication at the same cost. The number
of offline numbers is small and almost equal in every way so we ignore them. Table 1
clearly shows that to achieve excellent accuracy, MI-ShaRNN reaches 10x faster than
basic and up to 5x faster than MI-RNN, or in a single hardware configuration. Figure 2
shows the transaction accurately for the sale of three sets of data sets. We see that in the
range of the desired accuracy values, MI-ShaRNN is 5-10x faster than the bases. Next,
we combine precision and MI-ShaRNN flops with different brick sizes k (see Figure 3
Appendix). As expected, k „? The preparation of the T requires very few flops for
acquisition, but the matter of precision is very complex. In this database, we do not
recognize any specific practice of accuracy; all accuracy values ​are the same, regardless
of k. Google-13 deployment to Cortex M4: We use ShaRNN to install a real-time
keyword detection model (Google-13) on a Cortex M4 device. For time series separations
(Phase 3), we will need to slide windows and complete classes in each window. Due to
the low RAM performance of M4 devices (256KB), for real-time detection, the system
needs to complete the following tasks within a budget of 120ms: collect data from the
microphone bath, process it, produce ML-based trends and smooth predictions of the
final single output. Typical LSTM models for this function work in windows 1s, their
performance produces a vector of 32 ˆ 99 feature; here T “99. So, even with a small
LSTM (hidden size 16), it takes 456ms to process a single window, exceeding the time
budget. MI-RNN is fast but still needs 225ms. Recently, a number of CNN-based
methods were also designed to detect a site with a low keyword. However, with only 40
filters used for 32ˆ99 standard bank-filter features, the required working memory balloon
up to «500KB is more than the standard memory capacity of M4 devices. Similarly, the
computer requirement for such architects

Das könnte Ihnen auch gefallen