Beruflich Dokumente
Kultur Dokumente
Shallow RNNs
Submitted in partial fulfillment of the requirement for the degree of
MBA-BUSINESS ANALYTICS
Supervisor
Dr. Anurag Chaturvedi
We are sincerely grateful to Dr. Anurag chaturvedi who guided us for the
completion of this report. We would also like to thank our teacher for providing us
with knowledge about the critical aspect of the topics related to this report helping
us whenever needed.
NAMRIT MEHTA
RITIKA
ABSTRACT
Recurrent Neural Networks (RNNs) take away long-term dependence, which is why it is
a key component of standard data sequencing operations. However, the RNN sequence
status means significant cost for long-term sequences even if Hardware supports
similarity. Making long-term dependencies, but acknowledging the similarities,
introducing RNNs that do not appear. In this structure, the first layer separates the input
sequence and uses multiple independent RNNs. The second layer uses the removal of the
first layer using the second RNN thus taking a long lean. We offer theoretical refinement
of our construction under the weak ideas that we affirm on the real world benches. In
addition, we show that with the division of the time series, our process leads to a much
improved RNN time without compromising accuracy. For example, we could incorporate
audio key audio separation into smaller Cortex M4 devices (100MHz processor, 256KB
RAM, no DSP available) that would not be possible using standard RNN models.
Similarly, using ShaRNN in the popular Listen-Attend-Spell (LAS) architecture for
phonetic separation, we can reduce the adhesion to the phonetic separation by 10-12x
while maintaining high accuracy.
INTRODUCTION
We focus on the challenging task of separating time series on small devices, a problem
arising from a few industrial and consumer applications, where small-edge devices make
sense, monitoring and predicting limited time and resources. An illustrative example is
the interactive stick of people with visual impairments, who are able to perceive touches
that are considered to be ear-tracking sensors. A series of time or sequential data
naturally indicates a temporary dependence. Sequential models such as RNNs are well
suited in this context because they can report temporary dependencies by trying to find a
relationship to previous entries. However, applying RNNs directly to predicting the
conditions mentioned above is a challenge. As noted by several authors, the sequence
situation in which RNNs process data basically limits the similarity leading to
significant training costs and costs. In particular, by the division of the time series, at the
time of consideration, the processing time scales with the size, T, of the receiving
window, which is not acceptable in limited resource settings.
33rd Conference on Neural Information Processing Systems (NeurIPS 2019),
Vancouver, Canada.
The proposed solution in the literature is to replace consecutive processing with
feed-forward and convolutional networks. The main insight used here is that many
systems require less acceptable windows, and that this size can be increased by
tree-sized networks and refined convolutions. However, feedforward / convolutional
networks use large working memory, making it difficult to deploy on smaller devices.
For this reason, some methods also do not work in our setting. For example, the function
of finding a standard audio keyword with a decent setup of 32 filters itself will require a
working memory of 500KB and probably more than 32X calculation than the basic
RNN model.
Shallow RNNs. To address these challenges, we create the RNN architecture of a
corresponding / duplicate novel while maintaining the available field length (T) and
basic RNN size. Specifically, we propose 2 simple formats that we call ShaRNN. Both
ShaRNN layers are composed of a set of common neural networks that operate
independently. Specifically, each consecutive data point (reception window) is divided
into independent sections called size k blocks, and the allocated RNN works on each
brick independently, thus ensuring a smaller model size and shorter duplication. That is,
the lower layer of ShaRNN restarts from the original state after all the steps of k ăă T, so
there is a shorter duplication. Results of the same T {k RNNs are included as the second
RNN sequence, which then triggers the prediction
Tq during consideration in the following two settings:
(a) Separation: here we compare the proportions of independent RNNs and thus allow
for the acceleration of many study structures,
(b) Streaming: here we use audio slides and reuse calculations from older slide windows
/ receiving fields.
We also note that, unlike the proposed feed transfer methods or reduced RNN methods,
our proposal accepts fully receiving fields and therefore does not lead to data loss. We
also improved ShaRNN by incorporating it with the latest MI-RNN method to reduce
the size of the receiving windows; we call the emerging method MI-ShaRNN.
While the feedforward layer can be used instead of our RNN in the next layer, those
layers lead to significant growth in model size and RAM performance to be acceptable
for smaller devices.
LITERATURE REVIEW
The author proposes a novel and general architecture that, to the best of my knowledge, has not
been described before. Thus the idea of the "shallow" two layer RNN architecture as well as the
accompanying theoretical analysis and experimental results are all novel. Quality: The claims
appear correct, although I have some confidence in not having missed important issues only for
claim 1 and 2. The experiments are comprehensive and instill confidence in the proposed
architecture and theoretical guarantees. The code they provide appears about average for this
type of research prototypes. Clarity. Most of the paper is clear and easy to follow. There are
however a few typos and sentences that could be improved with some additional proof reading.
(See below for some of the typos I spotted) Significance. The simplicity of the method
combined with the well-motivated use case of embedded devices with constrained resources
mean that I see this paper as a useful contribution, from which many are likely to benefit and
thus worthy of NeurIPS. Question and comments: When running over 5 random seeds, what
kind of variance is observed? It would be worth mentioning this at least the supp material, to get
a sense of the statistical relevance of the results. 46: ensuring a small model size -> I believe
the model size would not be smaller than that of a standard RNN, if so the claim appears a bit
misleading Claim 1 appears correct as stated, but the formulation is a bit convoluted, in the
sense that one typically would be given T and w, and can decide on a k; whereas in the current
formulation it appears as if you are given a T and q and can pick an arbitrary k based on that,
which is not really the case. Line 199: from this sentence it is not very clear how SRNN is
combined with MI-RNN, it would be good to give a little more details given that all results use
this model are based on a Shallow extension of MI-RNN. In the same vein the empirical
analysis would be a little stronger if the results of SRNN without MI-RNN would be reported
too. Minor: 37: standard -> a standard 81: main contributions -> our main contributions 90:
receptive(sliding) -> [space is missing] 135: it looks like s starts at 0 where all other indices start
at 1; including line 171 where s starts at 0 137: fairly small constant -> a fairly small constant
138: that is -> which is 139: tiny-devices -> tiny devices 152: I would find it slightly more
readable if the first index v^{(2)} was 1 instead of T/k; if you need an index at all at this point
154: should be v^{(1)} not v^{(2)} 159: tru RNN -> a true RNN 159: principal -> principle 172:
for integer -> for some integer 240: it's -> its 267: ablation study -> an ablation study latency
budget of 120ms -> it's not clear to me where this exact limit comes from is it a limit of the
device itself somehow? 318: steps of pionts threby 314: ully -> fully In the MI-RNN paper
[10] they benchmark against GesturePod-6, where the current paper benchmarks against
GesturePod-5, are they different? If so in what way?
The authors propose shallow RNNs, an efficient architecture for time series classification.
shallow RNNs can be parallelized as the time sequences are broken down into subsequences
that can be processed independently from each other by copies of the same RNN. Their outputs
are then passed to a similarly structured second layer. Multi-layer SRNN extends this to more
than two layers. The paper includes both a runtime analysis (claims 1 and 2) and an analysis of
the approximation accuracy of the shallow RNN compared to a traditional RNN. The idea is
straight forward but the paper scores very low on clarity. The authors opt for symbol definitions
instead of clear descriptions, especially in the claims. The claims are a central contribution of
the paper BUT UNNECESSARILY HARD TO PARSE. The implications of the claims are not
described by the authors. That's why I scored their significance as low. Here are specific points
that are unclear from the paper: l.133-140 Shouldn't the amortized inference cost for each time
step be C1 i.e. O(1)? Why would you rerun the RNN on each sliding window? l. 165 The
heavy use of notation is distracting from getting an understanding of what window size w and
partition size k you usually use. Is usually k larger than w or the other way around? This makes
it hard to understand how the SRNN architecture interacts with streaming. When the data is
coming in in streams, are the streams partitioned and the partitions distributed or are the streams
distributed Claim 1 * You already defined $X^s$. Defining it here again just distracts from the
claim. * q is the ration between w and k (hence it depends on k). It is weird that your statement
relates k to q which depends on k. please explain. Claim 2 * Choice of k in Claim 2 seems
incompatible with Claim 1. In Claim 1 k = O(sqrt(T)) in Claim 2 k = O(1). Claim 3 * What is
M? What is $\Nabla^M_h$? Claim 3 and 4 * Are those bounds tight enough to be useful?
Given a specific problem, can we compute how much accuracy we expect to lose by using a
specific SRNN? * Can we use these bounds together with the runtime analysis of claims 1 and 2
to draw tradeoff between accuracy and inference cost like in Figure 2? To me the strength of
this paper is the proposed model and ist implementation on small chips (video in the
supplement) as well as the empirical study. I would have been curious for a discussion on how
the proposed architecture relates to convolutional networks. It seems to me that by setting w
small, k small and L large, you almost have a convolutional network where the filter is a small
RNN instead of a typical filter. In the introduction, it is mentioned that CNNs are considered
impractical. I am curious; could it be that in the regimes for which the accuracy of SRNN is
acceptable (Claims 3 and 4) they are actually also impractical? Complexity similar to CNNs?
Overall this is a well-written paper with proper motivation, clear design, and detailed theoretical
and empirical analysis. The authors attempt to improve the inference efficiency of RNN models
with limited computational resources while keeping the length of its receptive window. This is
achieved by using a 2-layer RNN, whose first layer parallelly processes small bricks of the
entire time series and the second layer gathers outputs from all bricks. The authors also extend
SRNN in the streaming setting with similar inference complexity. One concern about the bound
in Claim 1 in the streaming setting: In line 137: w is required to be fairly small constant
independent of T. In line 166: w = k * q (w is a multiple of k, and thus k needs to be small
constant) In line 173: The bound becomes O(\sqrt{qT} * C_1) iff k=\sqrt{T/q}, which is not
o(1). Therefore, I was expecting analysis in practical applications with large T and small w. In
SRNN, will the O(T/k) extra memory cost be an issue during inference? The extension of
multi-layer SRNN in Section 3.2 provides at least O(log T) inference complexity. The bound
here is too ideal, but it would be great to see empirically how SRNN performs by adding more
shallow layers. The empirical improvements over LSTM and MI-RNN on multiple tasks are
impressing.
Performance and Deployability
Empirical Results
We are conducting a study study:
a) the effectiveness of MI-ShaRNN with hidden state measurements of both the Rp1q and
Rp2q layers to understand how its accuracy meets the basic models in all different model
sizes,
b) cost-effectiveness improvements by MI-ShaRNN that produce more common
time-line planning problems in addition to the basic and MI-RNN models,
c) if MI-ShaRNN would allow certain timetable configuration functions on devices based
on a small Cortex M4 processor with only 100MHz processor and 256KB RAM. Note
that MI-ShaRNN uses ShaRNN in addition to the prescribed data points provided by
MI-RNN. MI-RNN is known for better performance than basic LSTMs, so naturally
MI-ShaRNN has better performance than ShaRNN. Therefore, we present the results of
the MI-ShaRNN and compare them with the MI-RNN to show the benefits of the
ShaRNN approach. Datasets: We measure our approach to standard data sets from a
variety of domains such as audio keyword detection (Google-13), wake name detection
(STCI-2), activity recognition (HAR-6), sports activity recognition (DSA- 19), action
recognition (GesturePod-5). The number following the hyphen in the database name
indicates the number of classes in the database. in the appendix for more information on
data sets. All data sets are available online with the exception of STCI-2 which is a
related data acquisition dataset. Basics: We compare our MI-ShaRNN (LSTM) algorithm
by comparing the basic LSTM method with the MI-RNN (LSTM) method. Note that
MI-RNN and MI-ShaRNN are built into the RNN cell. For simplicity and consistency,
we have selected LSTM as the base cell for all methods, but we can train each of them
with other RNN cells such as GRU or FastRNN. We used all the algorithms in
TensorFlow and used Adam to train the models. Cortex M4 device access code is written
in C and integrated on the device. All numbers presented are limited to 5 independent
runs. The implementation of our algorithm is released as part of the EdgeML library.
Hyperparameter Selection:
The main hyperparameters are:
a) hidden country sizes on both MI-ShaRNN layers.
b) MI-ShaRNN brick size. In addition, the number of time T steps is associated with each
database. MI-RNN lowers T and works with T 1-T time measures. We provide 2 results:
Shipping to Cortex M4: multi-channel accuracy vs processing time (ms) on M4 device
with 256KB RAM and 100MHz processor. Based on low-latency keyword (Google-13),
the total budget for the input time is 120 ms. Base MI-RNN MI-ShaRNN 16 32 16 32
(16, 16) (32,16) Acc. 86.99 89.84 89.78 92.61 91.42 92.67 Costs 456 999 226 494 70.5
117 with various state hidden dimensions to indicate the trade involved in choosing this
hyperparameter. We prefer k «? T for some variation of using wr.t length of step ω for
each database; we also provide ablation research to show the impact of various k options
with precision and set costs. Comprehension comparison: compares the accuracy of
MI-ShaRNN with the bases and MI-RNN with different sizes hidden in R1 and R2. In
terms of forecasting expertise, MIShaRNN works much better than the basics and
competes with MI-RNN across all databases. For example, with only “8”, MI-ShaRNN is
able to achieve 94% accuracy in the Google-13 database while the MI-RNN model is
used for T “49 and LSTM T-99 steps. steps. That is, with 8 deep repetitions, MI-ShaRNN
is able to compete with the definition of 49 and 99 deep LSTMs. For tuition costs, we
study the costs paid per data point per instant window setting. That is, the basic and
MI-RNN of each slide window restores all predictions from the beginning. However,
MI-ShaRNN may also use computer census in the first layer which has resulted in
significant savings in installation costs. We report cost considerations as additional
floating point functions (flops) each model will need to accomplish in all of the new ones.
For convenience, we treat both installation and duplication at the same cost. The number
of offline numbers is small and almost equal in every way so we ignore them. Table 1
clearly shows that to achieve excellent accuracy, MI-ShaRNN reaches 10x faster than
basic and up to 5x faster than MI-RNN, or in a single hardware configuration. Figure 2
shows the transaction accurately for the sale of three sets of data sets. We see that in the
range of the desired accuracy values, MI-ShaRNN is 5-10x faster than the bases. Next,
we combine precision and MI-ShaRNN flops with different brick sizes k (see Figure 3
Appendix). As expected, k „? The preparation of the T requires very few flops for
acquisition, but the matter of precision is very complex. In this database, we do not
recognize any specific practice of accuracy; all accuracy values are the same, regardless
of k. Google-13 deployment to Cortex M4: We use ShaRNN to install a real-time
keyword detection model (Google-13) on a Cortex M4 device. For time series separations
(Phase 3), we will need to slide windows and complete classes in each window. Due to
the low RAM performance of M4 devices (256KB), for real-time detection, the system
needs to complete the following tasks within a budget of 120ms: collect data from the
microphone bath, process it, produce ML-based trends and smooth predictions of the
final single output. Typical LSTM models for this function work in windows 1s, their
performance produces a vector of 32 ˆ 99 feature; here T “99. So, even with a small
LSTM (hidden size 16), it takes 456ms to process a single window, exceeding the time
budget. MI-RNN is fast but still needs 225ms. Recently, a number of CNN-based
methods were also designed to detect a site with a low keyword. However, with only 40
filters used for 32ˆ99 standard bank-filter features, the required working memory balloon
up to «500KB is more than the standard memory capacity of M4 devices. Similarly, the
computer requirement for such architects