Sie sind auf Seite 1von 16

1 The modelling toolbox

To deal with the various project data in a principled fashion, it was decided by the project consortium have WP4 develop a data processing toolbox. WP4 will be the primary user, but the toolbox was to be designed such that other project partners can use it after a short introduction. A comprehensive graphical interface was not envisioned, since this level of ease-of-use goes beyond the needs of the project and detracts too much manpower from algorithm development and data analysis. All software libraries and scripts mentioned here are available to project partners through Martin Felder upon request. Note however that devising and tuning ML algorithms is a very problem-specic task, so many of the analysis scripts are likely to change as the real data come in. The rst part of this Deliverable therefore focuses on the toolbox itself, and its usage, while the second part presents initial results obtained by using the toolbox through scripts.

1.1 Overview
The toolbox takes the form of a library written in the Python programming language. It was not developed solely for NBT, but contains algorithm contributions from other projects, especially in the eld of reinforcement learning, which is not a focus of NBT. This has the added benet of allowing us to quickly try out other approaches than the ones originally planned for in the project proposal, in case unforeseen insights about the data are gained. Python was chosen for several reasons as a programming language, most notably because it is a very easy and clean language to learn, and can be used to add new features and experiments quickly. Owing to its origins as a scripting language, Python commands can tried out on the command line and are easy to debug. This functionality is of course bought through a certain speed disadvantage compared to compiled languages. Still, the use of optimised math plug-ins from a growing scientic community, and several options for optimising core functions make our library fast enough for all data processing applications in NBT. In particular, the SciPy package in conjunction with the matplotlib visualisation library provides a completely free and portable alternative to commercial software like Matlab and IDL. Our machine learning toolbox is called PyBrain (PYthon-Based Reinforcement learning, Articial Intelligence and Neural networks library). Its general concept is to encapsulate different data processing algorithms in what we call Modules. A minimal Module contains a forward implementation and a collection of free parameters that can be adjusted, usually through some machine learning algorithm. Modules have an input and an output buffer, plus corresponding error buffers which are used in error backpropagation algorithms. Modules are assembled into Networks by connecting them via Connectors, which again contain a number of adjustable parameters, the connection weights. Note that a Network itself is again a Module, such that it is easy to build hierarchical networks as well. Shortcuts exist

for building the most common network architectures, but in principle this system allow almost arbitrary connectionist systems to be assembled. The free parameters of the Network are adjusted by means of a Trainer, which in the supervised case uses a Dataset to learn the optimum parameters from examples. Validation of the parameters usually requires a separate test dataset, unless the ML method chosen is very fast to train, such that cross-validation can be used, as detailed in Section 1.4.

Raw Data

Preprocessing

PyBrain Dataset
train

PyBrain Dataset
test

Model
SVN, FFN, LSTM

Trainer
BP, SVM, Evolution

Classication

Validation

Visualisation

Results

Figure 1: Data analysis with the PyBrain toolkit.

Figure 1 shows a general overview of the data processing chain. The remaining sections provide details and examples for ingesting data from project partners (Section 1.2), and for analyses using feed-forward neural networks and recurrent neural networks (Section 1.3),

For reinforcement learning experiments, a simulation environment with an associated optimisation task is used instead of a Dataset. While this might be useful for NBT a some point, it is currently not foreseen and will thus be omitted from discussion.

support vector machines (Section 1.4) and some novel, experimental algorithms that have been implemented so far (Section 1.5).

1.2 Preprocessing
For preprocessing the data from different groups a collection of Matlab scripts is employed. The GU group uses Matlab extensively to store and analyse their data. SSSA also uses Matlab and LabView to record and treat their data. Hence writing a data preprocessor and converter in Matlab can be considered a canonical solution.

1.2.1 Microneurography preprocessing. While the currently available microneurography data are not nal, the data format will most likely not change much, therefore our preprocessing chain can be used for the upcoming WP2 project data as well. Initial calibration, error checking, and spike extraction is performed by GU, as described in Deliverable 4.1. We can then select options to perform any or all of the following steps: 1. Filter by number of spikes and experimental parameters (velocity, force, etc.). 2. Convert spikes into a series of interspike intervals (ISI). 3. Split the series into time windows of different size, number and location. 4. Calculate a selection of common statistics plus a congurable histogram on the time windows. 5. Assemble feature vectors by combining data from one or several windows. 6. Split the data into training and test dataset. 7. Equalise class distribution. 8. Normalise the features. 9. Store as ASCII or netCDF les. Alternatively, the raw sequences of ISIs can be stored into a le, or converted to a temporal sampling representation more akin to the original measurements, where each time window of, say, 2 ms holds a 1 if a spike occurs in it, and a 0 otherwise. These data can then be processed with sequence learning methods.

1.2.2 Mechanical and simulated data. Data from the articial nger V1 and simulated data are easier to handle than spike trains, since they are merely continuous streams of oating point numbers. Measurements from WP5 and simulations by WP3 are delivered as ASCII les with metaparameters encoded in the le name. Preprocessing these data involves: 1. Filter by experimental parameters (velocity, surface, etc.). 2. Split the series into time windows of different size. 3. Assemble feature vectors by combining data all data channels of one window. 4. Split the data into training and test dataset. 5. Normalise the features. 6. Store as ASCII or netCDF les. Figure 2 shows a simple example of preprocessing steps 2 to 4. Again, storing the entire time series as a sequence is also possible.
v_7_80 6 V1 4 V2 V3 V4 2 sensor readout sensor readout sensor readout 0.08 0.1 0.12 0.14 time / s 0.16 0.18 0.2 1 0 1 2 3 4 4 5 6 0 4 3 2 2 4 v_20_80 6

6 0

Figure 2: Preprocessing preliminary data from a MicroTAF sensor experiment conducted at SSSA by windowing. Here, one input pattern is dened as all data points that fall within a window of size 10 time steps (20 ms). Patterns placed into 0.2are alternately 0.4 0.6 0.8the training time / s and test dataset.

1.2.3 PyBrain datasets. In general, reading any kind of data into SciPy arrays is straightforward in Python. The PyBrain DataSet class and its subclasses SupervisedDataSet, SequentialDataSet and ClassificationDataSet can then be initialised via

from pybrain . datasets import SupervisedDataSet mydata = SupervisedDataSet ( inputs , targets )

where inputs is an array whose rst dimension is the number of samples, and whose second dimension is the number of features per sample. The same goes for parameter target. In case of classication data, the targets will simply be the class numbers, from zero to N 1, if there are N classes. One complication arises when training neural networks on classication data: It has been found that encoding the classes in a one-of-many representation is advantageous. This means there are as many output neurons as there are classes, with the one being active at a given time encoding the current class. Conversion to and from this representation is facilitated by the ClassificationDataSet class:

# t a r g e t s a r e c l a s s numbers mydata = ClassificationDataSet ( inputs , targets ) mydata . _convertToOneOfMany ( ) # t a r g e t s a r e now one of many . To c o n v e r t back : mydata . _convertToClassNb ( )

1.3 Neural networks


Neural networks are one of the fundamental ML techniques implemented in PyBrain. A biological neural network consist of a collection of neurons linked together in a certain way, often (but not always) in layers. In a simplied view, each cell collects incoming action potentials through its dendrites, processes their accumulated effect in a fairly simple manner, and forwards another electrical signal through its axon to a number of follow-on neurons, to which it is connected via synapses of varying conduction efciency. Let w ji be the weight associated with the synapse connecting neurons i and j of an arbitrary simulated network. Then, the activation a of neuron j is given by
L

aj =
i =1

w ji oi + w j0 .

(1)

Here, oi is the output of neuron i, there are L neurons connected to neuron j, and w j0 is the bias term of neuron j. Bias is often included in the sum by dening an on-neuron with o0 1, which is assumed connected to all neurons in the network. The response of a simulated neuron to its activation a j is given by a transfer function

o j = f j (a j ).

(2)

In the simplest case, the transfer function can be selected to be the identity function, or another linear function. Input and output layers often use linear transfer functions. However, in order for the network to exhibit nonlinear properties, f () has to be nonlinear at least for one hidden layer. Smooth, S-shaped (=sigmoid) functions behave like threshold or linear functions, depending on scaling, and can be differentiated. The PyBrain module SigmoidLayer implements the logistic function

f ( x) =

1 1 + exp( x)

(3)

which is equivalent to the hyperbolic tangent.

1.3.1 Feed-forward neural networks In PyBrain, the utility script buildNetwork can be used to construct neural networks of the type just described (also called multi-layer perceptrons). By default, each layer of neurons is completely connected to the next one. The result of a sample call is graphically depicted in Figure 3.

LinearLayer

SigmoidLayer

LinearLayer

FullConnection

FullConnection

Figure 3: A neural network resulting from the command buildNetwork(3,5,2). Circles with lines and S-shapes denote linear and sigmoid neurons, respectively. Names of the corresponding PyBrain modules making up this network are given.

Training a neural network means nding the best choice of weight parameters W = (w)i j based on some training data set. This cannot be done analytically, therefore some iterative gradient descent algorithm is usually employed. Assume there is a set of paired input and target vectors, { xn , t n }, and we want to construct an FNN with output y( xn ; W ), to model the conditional probability function p( t n | xn ) according to the maximum likelihood principle, ie. such that the error function

E=

1 2

| y( xn ; W ) t n |2 .
n

(4)

is minimised. This can be achieved by demanding

E =0 W

i.e.

E = 0, wi j

wi j = (W )i j

(5)

Finding this minimum is the task of a gradient descent algorithm, the most famous one in this context is the so-called backpropagation of errors. For sake of brevity, we will not lay out its

full mathematical details here, which are described in the original literature [Rumelhart et al., 1986] as well as in all standard texts on neural network methods [e.g. Bishop, 2006; Richard O. Duda, 2000]. In PyBrain, the algorithm is schematically implemented as follows: 1. Calculate activations of all neurons in the network given one input pattern (Equation 1). This is the forward pass. 2. Calculate the error function (Equation 4) by comparing with the target. 3. Based on this, calculate the error gradient with respect to the weights for each neuron, starting at the output layer. This is the backward pass. 4. Adjust the weights by a step along the gradient, the size of which is determined by the algorithm parameters learnrate and momentum. 5. Go back to step 1, using the next pattern in the training data set. Complete presentation of all patterns in the set is called an epoch. If epoch is nished, randomise the order of patterns and start over. Most of the algorithms complexity is hidden from the PyBrain user. The parameters learnrate (essentially a scaling factor for the gradient step) and momentum (a heuristic to avoid local minima) may have to be manually adjusted, by specifying them in the trainer creation call. By default, learnrate=0.01 and momentum=0. The optimal learnrate varies around this default by typically an order of magnitude, while the momentum term is usually set to either 0, 0.1, or 0.9. Listing 1 shows a complete example of how to train an FNN in PyBrain. Note that training the network in batches of epochs with test data performance evaluation in between usually serves to prevent overtting, by stopping the training run once the error on the test set starts increasing again. This procedure is called early stopping regularisation. Figure 4 shows the error development during a typical training run. FNNs can be considered one of the best understood machine learning tools, and have been successfully applied to a great number of problems in different elds. The backpropagation algorithm has undergone a lot of changes and enhancements, among which the above mentioned momentum term was one of the most successful. Another very successful development is the Resilient Propagation algorithm [RPROP; Riedmiller and Braun, 1993]. It has evolved into several subtypes described in [Igel and Hsken, 2003]. We have implemented a version of RPROP called RPROP- in PyBrain. To use it, the BackpropTainer in Listing 1 needs to be replaced with RPropMinusTrainer. RPROP adaptively tracks the required update step width for every weight separately, therefore the parameters momentum and learnrate are not necessary. This simplicity combined with its very stable and fast training performance on most data sets make RPROP the method of choice for almost all problems encountered.

1.3.2 Recurrent neural networks The difference between recurrent networks (RNNs) and FNNs is that the former have circular connections feeding the output of certain neurons back to their own or other neurons input.

10

# ! / u s r / bin / env python # Example s c r i p t f o r feed forward network usage i n PyBrain . # load t h e n e c e s s a r y components from pybrain . datasets from pybrain . utilities from pybrain . tools . shortcuts from pybrain . supervised . trainers

import import import import

ClassificationDataSet percentError buildNetwork BackpropTrainer

# load t h e t r a i n i n g d a t a s e t trndata = ClassificationDataSet . loadFromFile ( traindata.svm ) # n e u r a l networks work b e t t e r i f c l a s s e s a r e encoded using # one o u t p u t neuron per c l a s s trndata . _convertToOneOfMany ( ) # same f o r t h e independent t e s t d a t a s e t tstdata = ClassificationDataSet . loadFromFile ( testdata.svm ) tstdata . _convertToOneOfMany ( ) # b u i l d a feed forward network with 20 hidden u n i t s , p l u s # a c o r r e s p o n d i n g b a c k p r o p ag a t i o n t r a i n e r fnn = buildNetwork ( trndata . indim , 20 , trndata . outdim ) trainer = BackpropTrainer ( fnn , dataset =trndata , momentum =0.1 ) # r e p e a t 5 times for i in range ( 5 ) : # t r a i n t h e network f o r 10 epochs trainer . trainEpochs ( 10 ) # e v a l u a t e t h e r e s u l t on t h e t r a i n i n g and t e s t d a t a trnresult = percentError ( trainer . testOnClassData ( ) , trndata [ class ] ) tstresult = percentError ( trainer . testOnClassData ( dataset = tstdata ) , tstdata [ class ] ) # p r i n t the r e s u l t print "epoch: %4d" % trainer . totalepochs , \ " train error: %5.2f%%" % trnresult , \ " test error: %5.2f%%" % tstresult Listing 1: Example Python script to train an FNN in PyBrain.

11

Figure 4: FNN training on two datasets of the type shown in Figure 2, with different window size. The task was to distinguish between different sample classes (sandpaper grades). Since the total number of data points available did not change, a window size of 2 yields 5 times as many patterns than window size 10 to train on, but each pattern contains less information. Therefore the error decrease is slower, but overtting does not occur as rapidly.

While there are many different ways to do this [Jordan, 1986; Elman, 1990; Lang et al., 1990], we will restrict the discussion here on the type displayed in Figure 5, where only the hidden layer feeds back into itself. This means patterns are now presented by stepping through time sequences, not in random order. Let xt |RN be the input pattern vector at time step t . For the hidden layer activations ah of an RNN with M hidden neurons, Equation 1 has to be modied to yield
N M

ah =
i =1

whi xit

+
k =1

whk otk1 .

(6)

With this seemingly small change the properties of the system change signicantly. An RNN can provably approximate any measurable sequence-to-sequence mapping to arbitrary accuracy, given enough hidden neurons [Hammer , 2000]. Training algorithms have to take into account that errors occurring at the current pattern may have their source in past patterns. The main developments in this area are real time recurrent learning [RTRL; Robinson and Fallside, 1987] and backpropagation through time [BPTT; Werbos, 1988; Williams and
t We disregard bias here, which can easily be implemented by appending a constant x0 = 1t to the input

12

output

...

input t=1 t=2 t=3

Figure 5: The type of RNNs discussed here has a hidden layer feeding back into itself. Another perspective is to unfold it over time, such that the dependence of current outputs on past inputs becomes clearer.

Zipser , 1994]. Again, we refrain from presenting the mathematical details here, and refer to the original literature and further discussions in [Graves, 2008]. While generic RNNs can be assembled in PyBrain, there is rarely a reason to do so because of their practical limitations: It was found that error information from previous time steps tends to either exponentially decrease or blow up. This so-called vanishing gradient problem Hochreiter [1991]; Bengio et al. [1994] was found to limit the number of time steps over which the RNN can remember relevant information to about ten [Hochreiter et al., 2001] which makes it comparable to a FNN using a window of 10 time steps. While there were several more or less successful attempts to overcome this problem, a breakthrough was achieved eventually by Hochreiter and Schmidhuber [1997] with the introduction of the Long ShortTerm Memory (LSTM) network. Its hidden layer consists of specialised blocks of neurons called memory cells (Figure 6) which allow the gradient information to be preserved over long time delays. They are hence particularly suited for problems that involve signal correlations over many time steps, like music generation [Eck and Schmidhuber , 2002], speech recognition Graves and Schmidhuber [2005] and handwriting recognition [Liwicki et al., 2007; Graves et al., 2008]. It is thus hoped that they will also yield good results at detecting tactile features at high sampling rates. In PyBrain, LSTM networks can be constructed and trained almost as shown in Listing 1, namely by calling the buildNetwork function with option hiddenclass=LSTMLayer, and using a SequentialDataSet for training.

A SupervisedDataSet or ClassificationDataSet can also be used, but in this case each input pattern is treated as a separate sequence with a length of one time step.

13

peephole

output

Figure 6: An LSTM cell is build around of a central neuron, called the constant error carousel, which re-cycles status information from one time step to the next. Small blue circles indicate multiplicative connections. Whether the status in inuenced by the input is controlled by the input gate neuron, while the output gate controls passing on status information. More recent additions are the forget gate to reset the status, and a peephole connection to directly access it.

forget gate x

output gate CEC

input gate

net input

1.4 Support vector machines (SVM)


SVMs [Vapnik , 1995] belong to the class of maximum margin classiers. They separate two classes of data by a hyperplane, as sketched in Figure 7. A hyperplane can be dened by:

x, w + b = 0
where w is normal to the hyperplane, b is the offset, x are the points on the hyperplane, a, b = i ai bi . The classier

class(x) = sign ( w, x + b)
then separates the data into the classes +1 and 1. The goal of an maximum margin classier is to nd the hyperplane with the widest separation between classes. For SVMs, this is done by quadratic programming techniques on the dual Lagrange formulation, as described e.g. in the excellent tutorials of Burges [1998] and Smola and Schlkopf [2004]. Eventually, the hyperplane is described through the support vectors, which basically are the data points at the margin. Classication of unknown data is performed by comparing it against the support vectors only, not against the full training data. This makes SVMs very sparse and thus scalable. Linear boundaries do of course not always yield good classier. In fact, it can be shown that mapping the input data into a higher dimension, the feature space, by means of some transformation (), the feature map, enables one to model complex boundaries in the original data space (Figure 8). The kernel trick is a way of avoiding explicit mapping of x into the high-dimensional space, since the SVM algorithm only needs its scalar product, the kernel K: K ( x, w) = ( x), (w) (7)

14

+1 b= + > ,w +b=0 1 <x > ,w b= <x w>+ , <x

2/||w||
Figure 7: A maximum margin classier, like an SVM, strives to maximise the separation 2/ w between two classes (circles and dots). In this linear case, the separating hyperplane (thick red line) is dened via three support vectors (two crosses and one circle on the thin red lines) .

|b|/||w||

The most common kernels used apart from the trivial linear kernel are Gaussian kernels, also called radial basis function (RBF) kernels:

K (a, b) = exp

ab 2 2

The RBF-SVM thus has the advantage of requiring only two crucial parameters (plus some numerical parameters of lesser import):

is the width of the Gaussian kernel function. C is a regularisation parameter constraining the amount of slack allowed in the solution. A higher C means stronger punishment of misclassication.
Training an SVM is an iterative procedure and usually very fast, compared to training a neural network. Also, since quadratic programming is deterministic, there is no need to carry out multiple trials. However, the result depends strongly on the two meta-parameters C and , for which it is difcult to give default values they depend heavily on the data set. Therefore, it is advisable to use some of the speed gain to systematically search for the best meta parameters. Figure 9 shows the graphical representation of a typical classication performance surface over C and . Searching this entire grid at high resolution is very time consuming, therefore we implemented a design-of-experiments search procedure, GridSearchDOE, following recommendations of Staelin [2003], who has found it to be very efcient and robust. Classication performance is evaluated here using stratied N -fold cross-validation: Data from each class are randomly split into N parts, then a test data set is formed train out of the rst part of each class. The rest of the data becomes the training set. Training is carried out until convergence, and SVM performance evaluated on the test set. This procedure is repeated for parts 2 to N , and the performance results are averaged. The N -fold increase in computation time can only be afforded for fast methods like SVMs. We usually use N = 5, since an 80/20 split is also quite common when constructing training/test sets for single trials.

15

Figure 8: Separation of two classes of points with a RBF-SVM. The separating hyperplane in the (innite-dimensional) Gaussian feature space maps to a very involved boundary in data space.

One SVM-specic problem is their inability to discern more than two classes. Multi-class data sets thus need to be somehow reduced to binary problems. There is an ongoing discussion of how to best achieve this Hsu and Lin [2002]; Rifkin and Klautau [2004]; El-Yaniv et al. [2006]. Common and simple, but still relatively per formant solutions are one-vs-one: Split the data in to pairs of classes and train an SVM on each pair. When faced with unknown data, present it to all such SVMs and calculate the distances from the boundary, d = w , x + b for each one. Then use a voting mechanism to decide the class it is in. The rationale here is that only the SVMs that have been trained on the correct class will make a sizable contribution, while contributions from the others cancel out. Alternatively, it is possible to derive class membership probabilities from the raw distances [Wu et al., 2004]. one-vs-rest: Separate one class from the rest of the data and train an SVM on this problem. Repeat for each class. This is probably the simplest way of generating binary problems, but may still yield good results due to the sparsity of SVMs. Our toolkit provides two different options for using SVMs for classication: We have implemented a native implementation in PyBrain to test different algorithms and multiclass solutions. The software is working, but relatively slow and complex, hence it is used primarily as a power tool for particularly hard problems. As an alternative for regular problems, as encountered so far in NANOBIOTACT, we have also designed a wrapper around the popular and highly optimised LIBSVM library Chang and Lin [2001], the use of which is shown in Listing 2. Note that the structure is somewhat different to Listing 1, due to the different nature

16

Figure 9: Performance of an SVM classier on sample microneurography data, conditioned on its metaparameters kernel width and slack C . lg() denotes the binary logarithm.

of the algorithm as compared to FNNs, and some compromises that had to be made to cater to the library interface.

1.5 Experimental tools and algorithms


Several more or less experimental algorithms that have shown great promise on synthetic benchmarks and articial data sets have been implemented in PyBrain, to be tested and evaluated on the project data.

1.5.1 Evolino EVOlution of systems with LINear Outputs [Evolino; Schmidhuber et al., 2007; Wierstra et al., 2005] is a new class of methods that evolve the weights leading to the nonlinear, hidden nodes of RNNs. Since it is very difcult to evolve accurate networks, only the hidden layers are evolved, which the output is calculated from the hidden state by means of an optimal linear mapping. Both pseudo-inverse based linear regression and linear SVMs lend themselves to this second step. The neuroevolution part is performed through enforced subpopulations (ESP), by co-evolving the weights of different neurons, or LSTM cells, separately. Listing 3 sketches how to run an Evolino experiment in PyBrain Listing 1. The option outputbias=False is necessary because the weights to the output layer are computed directly and thus do not need a bias to facilitate learning.

17

# ! / u s r / bin / env python # Example s c r i p t f o r SVM c l a s s i f i c a t i o n using PyBrain and LIBSVM # load t h e n e c e s s a r y components from pybrain . datasets from pybrain . utilities from svmunit from svmtrainer

import import import import

ClassificationDataSet percentError SVMUnit SVMTrainer

# load t h e t r a i n i n g and t e s t d a t a s e t s trndata = ClassificationDataSet . loadFromFile ( traindata.svm ) tstdata = ClassificationDataSet . loadFromFile ( testdata.svm ) # i n i t i a l i z e t h e SVM module and a c o r r e s p o n d i n g t r a i n e r svm = SVMUnit ( ) trainer = SVMTrainer ( svm , trndata ) # t r a i n t h e SVM design of experiments g r i d s e a r c h trainer . train ( search ="GridSearchDOE" ) # pass d a t a s e t s through t h e SVM t o g e t performance trnresult = percentError ( svm . forwardPass ( dataset = trndata ) , trndata [ class ] ) tstresult = percentError ( svm . forwardPass ( dataset = tstdata ) , tstdata [ class ] ) print "train error: %5.2f%%" % trnresult , \ ", test error: %5.2f%%" % tstresult Listing 2: Example script for SVM classication using PyBrain and the LIBSVM wrapper.

18

# load Evolino modules and sequence e v a l u a t o r from pybrain . supervised . trainers . evolino import EvolinoTrainer from pybrain . tools . validation import testOnSequenceData
...

# load d a t a s e t s e t c .
...

# b u i l d a lstm network with 20 hidden u n i t s , and t h e t r a i n e r net = buildNetwork ( trndata . indim , 20 , trndata . outdim , hiddenclass =LSTMLayer , outputbias =False ) trainer = EvolinoTrainer ( net , dataset =trndata , evalfunc = testOnSequenceData ) # t r a i n i n g loop
...

trnresult = testOnSequenceData ( net , trndata ) 1 0 0 .


...

Listing 3: Training an RNN with Evolino. Only the differences to Listing 1 are shown.
1.5.2 Multi-dimensional RNNs One of the main advantages of RNNs compared to window-based methods is their ability to take context into account. For a one-dimensional time series, context obviously consists of past samples. It has long been known that in some cases, like language processing, taking the future context into account helps considerably [Schuster and Paliwal , 1997; Graves and Schmidhuber , 2005]. In this case, a separate hidden layer processes the (buffered) time series in the reverse direction, and the results of forward and reverse scan are combined to yield the network output. This procedure has recently been generalised to more than one dimension Graves et al. [2007]. Roughly speaking, multi-dimensional RNNs (MDRNNs) scan a sheet or volume of data from all directions, and combine the results of the corresponding directional hidden layers. PyBrain already contains an implementation of MDRNNs combined with LSTM, in the form of a MDLSTMLayer. Automatic scanning over a data set is realised through the SwipingNetwork class. Training can be carried out using standard backpropagation-type algorithms, as described in Section 1.3.

19