Nano-FPGA Processors For BMI

A LOW-POWER IMPLANTABLE NEUROPROCESSOR ON NANO-FPGA FOR BRAIN
MACHINE INTERFACE APPLICATIONS

Fei Zhang1, Mehdi Aghagolzadeh1, and Karim Oweiss1,2
Department of Electrical and Computer Engineering1 and Neuroscience Program2
Michigan State University, East Lansing, MI 48824, USA
ABSTRACT
This paper presents the implementation of a low-power and
implantable neuroprocessor on low-cost nano-FPGA for data
reduction and on-the-fly spike sorting in Brain Machine Interface
applications. Detailed analysis of efficient utilization of the
hardware resources, power consumption and design scalability are
provided. The prototype we report here enables simultaneous
processing of 32-channel data sampled at 25 kHz/channel with 8bit/sample resolution with less than 5 mW power consumption for
all modes of operation (monitoring, compression and sensing) at
1.2 V core voltage supply on a 5 mm 5 mm nano-FPGA.
Index TermsNeuroprocessor, nano-FPGA, brain machine

interfaces, low-power, spike sorting, compression
1. INTRODUCTION
Neural recording using multisite microelectrode arrays has
revolutionized our understanding of functional neural circuits due
to the high spatial and temporal resolution of the signals they
collect. They are consequently paving the way to potentially treat
many neurological diseases and disorders such as epilepsy and
Parkinsons disease [1]. Equally important is their ability to
provide means for subjects with motor and communication deficits
to interact naturally with their environment through Brain Machine
Interfaces (BMI) [2].
For these devices to be clinically viable, they must be
embedded into systems that have to meet stringent requirements
imposed by the BMI application at hand. First, they should enable
continuous and simultaneous recording from a large number of
electrode channels to actuate neuroprosthetic devices with large
degrees of freedoms. Second, they should feature real time signal
processing to extract critical information early in the data stream
and reduce system latency [3]. Third, they should feature wireless
telemetry of both data and power to enable the subject to interact
freely with the surrounding and minimize any risk of infection and
discomfort. Fourth, they should consume low power to prevent
excessive heating of surrounding tissue (< 65 mW/cm2). Fifth, they
should be miniaturized to meet implantability constraints. Finally
but not the least, they should be programmable and versatile to
accommodate a wide variety of experimental conditions.
Several wireless state-of-the-art neural recording systems have
been designed but found to only partially meet the above
requirements. For example, Stanfords Hermes system [4] and
Browns neurosensor [5] can wirelessly telemeter neural data out
of large subjects (such as monkeys) using off-the-shelf
components that are large in size and weight and are not suitable
for human applications, besides lacking on chip information

extraction. The Utahs integrated neural interface [6] and the
Michigans multichannel system [7], on the other hand, use
application specific integrated circuits (ASIC), that feature small
size and low-power consumption, along with some elementary
information extraction (spike detection) to reduce wireless
telemetry bandwidth, but at the expense of loss of spike identities,
limited programmability and high cost fabrication process. These
systems as well as others [8-10] lack the complete information
extraction feature that merits rapid translation to clinical
applications. Specifically, the information extraction here refers to
the ability to reduce the telemetry bandwidth without
compromising spike identities needed to identify each neurons
firing pattern in the recorded ensemble. This has to be done by
sorting the spikes on chip before wireless telemetry. Arguably, this
is the most computationally prohibitive step and as a consequence,
all existing systems telemeter all the data (or a compressed version
of it) to a powerful computing platform to perform this task. While
this is acceptable in a pristine lab environment for basic
neuroscience investigations, it is unacceptable in clinical BMI
applications in which neuronal firing patterns have to be
instantaneously decoded to actuate external devices.
In this paper, we propose an approach to extract this
information on the fly, eventually shortening the thought to
action latency. Specifically, we report on a low power small size
neuroporcessor implemented on a nano-FPGA that is capable of
extracting the information from multiple channels simultaneously.
We demonstrate that this implementation meets all the
aforementioned design requirements with comparable overall
performance - if not superior - to other systems.
2. SYSTEM ARCHITECTURE
2.1 Background
As shown in Fig. 1, the neuroprocessor is one of three main
elements of a fully implantable Neural Interface Node (NIN). This
NIN can be hardwired to a separate electrode array and forms the
front end of a Wireless Intra-cortical Multi-scale Neural Interface
System (WIMNIS) currently under development in our lab [11].
Briefly, the NIN communicates the extracted information to an
external Manager Interface Module (MIM) that is fixated in close
proximity of the implanted NINs. It manages power, clock, data
and control commands to and from possibly multiple NINs. The
MIM is equipped with a translation algorithm (a decoder) that
translates the neural firing patterns to control commands to actuate
an external device. This is a fundamental design aspect that makes
WIMNIS unique compared to other systems.
This work was supported by National Institutes of Health under grant

NS062031.
978-1-4577-0539-7/11/$26.00 2011 IEEE
1593
ICASSP 2011
Figure 1. NIN functional diagram
The neuroprocessor has three operational modes controlled

by the externally configurable register: 1) Monitoring mode (MM,
red arrow), where neural data are transmitted at full bandwidth
sequentially from each channel to permit estimating channel
parameters and compression/sorting thresholds to be used during
the other two modes; 2) Compression mode (CM, green arrow), in
which a sparse representation of the neural data is obtained
through a discrete wavelet transformation followed by
thresholding and Run Length Encoding; and 3) Sensing mode
(SM, blue arrow), where only spike time stamps are transmitted
post DWT-based sorting on chip [12].
2.2 VLSI Architecture
As shown in Fig. 2, the neuroprocessor mainly includes a Finite
State Machine (FSM) based controller for controlling timing and
sequence operations, a lifting DWT based Computation Core (CC)
[13], a Run Length Encoder (RLE), a comparator and six memory
modules for incoming data, coefficients, intermediate CC
products, threshold and intermediate values for multichannel
multilevel interleaved DWT computations [13].
3.1 Synthesis of the Neuroprocessor

The neuroprocessor was designed in Verilog and its
implementation was fully synthesized with Libero IDE 9.0, which
totally consisted of more than 610000 systems gates (about 15000
D-flip-flops). However, memory demands were found to consume
more than 90% of the system logic gates such that the memory size
of general purpose FPGA could not accommodate the entire
neuroprocessor implementation. The required hardware resources
are summarized in Table I. The other blocks such as the
computation core (about 15706 system gates/386 D-flip-flops)
required a very small number of gates. Hence, in order to fit this
implementation on implantable FPGAs, embedded memory blocks
were preferred to accommodate the memory demand and reduce
heavy consumption of system gates.
TABLE I. MEMORY HARDWARE DEMANDS
Memory Size and Resource
Memory
Type
Size (bit) System Gates D-FFs
Channel&Level Memory
32432
352579
8665
Pairing Memory
32316
138265
3398
Input FIFO Buffer
328
24373
599
Threshold Memory
3247
80526
1979
3.2 Neuroprocessor on Nano-FPGA
Figure 2. VLSI architecture of lifting DWT computation [13]
3. IMPLEMENTATION ON NANO FPGA

ASIC and FPGA, with different value propositions, were carefully
evaluated before choosing one over the other, where cost,
programmability, power and size were key decision criteria. The
programmability of the FPGA is a superior feature for our
application because changes in embedded algorithmic design are
much easier, cheaper, faster and more risk-free than changes in
ASIC hardware design, particularly after the system is implanted
in the brain.
1594
The flash-based IGLOO nano-FPGAs [14] with embedded

memory blocks exhibit power characteristics similar to those of an
ASIC design, making them an ideal choice for power-sensitive
applications. In particular, the 130 nm process based AGLN 250,
has enough resources (250000 system gates and configurable 36
kbits memory blocks) and fits our size constraint (5mm5mm).
After replacing the above memory with the embedded memory
blocks in the nano-FPGA, the resource utilization of the
neuroprocessor was optimized as listed in Table II, which indicates
that our design fully utilized the available resources on this FPGA.
Internal clock conditioning circuitry based on an integrated Phase
Locked Loop (PLL) can be used to provide desirable clock to
accommodate desirable wireless data transmission rates. Its worth
to note that, once programmed, the configuration data becomes an
inherent part of the FPGA, and no external configuration data need
to be loaded at system power-up (unlike SRAM-based FPGAs).
TABLE II. HARDWARE RESOURCES OF FINAL DESIGN ON AGLN 250
Resource of AGLN250
Total
Used
Percentage
36
20
62.50%
6144
1052
17.12%
1
1
100%
1
1
100%
Power Consumption (mW)
100
S a m p lin g R a te
12
10
80
60
40
20
2
0
10
15
20
0
30
25
Sampling Rate (x1000, sampsle/sec)
120
P o w e r C o n s u m p tio n
It has been established that effective data compression can be

achieved through thresholding DWT coefficients [16], where most
values below a specific threshold will become long sequences of
zeros that can be losslessly encoded using the RLE block, while
few values above that threshold can be used to reconstruct the
neural spikes. Fig. 5 shows the tradeoff between signal integrity
and compression rate and the inset gives an example of the original
and reconstructed waveforms, where only 20% coefficients were
used to obtain the reconstruction. This demonstrates how the
system implementation dramatically compresses neural signals
while simultaneously preserves all the necessary information
(spike waveform shapes) in case off chip spike sorting is needed.
1 .0
Figure 3. Equivalent sampling rate and measured DWT power

consumption for different master clock frequencies
E xecution T im e
D ata T hroughput R ate
210
Execution Time (Ps)
280
140
70
15
12
9
6
3
0
30
60
90
120
150
N um ber of C hannels
10
15
20
25
0
30
600
M aster C lock F requ ency (M H z)
Figure 4. DWT execution time and data throughput rates as a

function of master clock frequency
The measured power consumptions for the three modes of
operation are listed in Table III. The idle mode is also designed to
work for power saving when the system is not running but some
system settings such as threshold values and last operational mode
are retained. Here, the analog conditioning circuitry and DWT
computations are turned off. This idle mode can be implemented
using the unique ultra-low power Flash*Freeze mode [14] on
demand with as low as few W power consumption, in which
there is no need for additional components to turn off I/Os or
clocks while retaining design information, SRAM, and registers
contents.
O rig in a l S ig n a l
R e c o n s tr u c t e d S ig n a l
400
0 .8
0 .6
200
0
-2 0 0
0 .4
-4 0 0
10
20
30
T im e ( m s )
0 .2
0 .0
0
Data Throughput Rate (Msamples/sec)

CM
SM
4.82
4.92
6.11
6.23
4.1 Rate-Distortion tradeoff
M a s te r C lo c k F re q u e n c y (M H z )
350
MM
4.68
5.97
4. RESULTS
Am plitute
In order to test the systems full speed operation (6.4 MHz),

neural data were uploaded to the SRAM of a Cyclone III FPGA
only for testing purposes to provide 8-bit formatted data to the
neuroprocessor implemented on the AGLN 250 FPGA. In Fig. 3,
we show the power consumption and corresponding equivalent
sampling rates for 32-channel DWT implementation at different
master clock frequencies of the nano-FPGA. The execution time as
a function of master clock frequency is plotted in Fig. 4, together
with the data throughput rate. From this, power consumption was
calculated as a function of the number of channels at 25 kHz
sampling rate.
FPGA Core
Voltage (V)
1.2
1.5
Mean Squre Error
Nano-FPGA Core
Resources Type
Embedded RAM/FIFO (kbits)
Versa Tile (D-flip-flops)
PLL
Flash*Freeze
14
TABLE III. MEASURED FPGA CORE POWER CONSUMPTION
20
40
60
80
C o m p re ssio n R a te (% )
100
Figure 5. Tradeoff between signal integrity and compression rate
4.2 Spike Sorting

Thresholding the output DWT coefficients can also be used for
spike sorting [12, 15]. Based on the time frequency characteristics
of spike waveforms from different neurons, the largest coefficients
likely reside in different subspaces, and therefore can be used for
spike sorting. Briefly, this can be achieved by choosing the single
most significant coefficient/event and only transmit the time stamp
of that coefficient. Fig. 6 demonstrates an example for sorting
three neurons on a single channel using this strategy. The black
trace in the top row of Fig.6a demonstrates the actual data, and the
following rows represent the DWT expansion coefficients. The
thresholds are chosen to only pass the most significant coefficient
per event (an event is ~1.5 ms on any given electrode channel).
Using these surviving coefficients (red dots), spike waveforms
were reconstructed (red trace) in the top row of Fig. 6a. Despite the
poor reconstruction, the firing pattern (spike train) of each of the
three neurons can still be extracted with high fidelity. For example,
Fig. 6b demonstrates the 2D feature space obtained using Principal
Component Analysis (PCA) of the spike waveforms in each node.
The three neuronal clusters are color labeled with red, green and
1595
blue hollow circles for clarity. Coefficients surpassing the neuronspecific threshold in each node are represented by filling the color
labeled circles. It can be seen that while d4 is useful to detect
almost all spike events from all three neurons, d2 and d3 are most
effective in detecting the red and the blue labeled neuronal
clusters, respectively. The green-labeled neuronal cluster, on the
other hand, can be detected in d4 once events from d2 or d3 are
excluded from that feature space. Table IV briefly summarizes a
system level comparison of the features relative to other state-ofthe-art systems.
(a)
(b)
Figure 6. A demonstration of the sensing mode, (a) Top row:
actual recording (black), and the reconstruction (red). Following
rows: the wavelet-tree decomposition of nodes d2, d3, d4 and a4,
respectively. (b) The two dimensional feature space of the spike
waveforms from three neurons (red, green and blue circles).
TABLE IV. SYSTEM LEVEL FEATURE COMPARISON
Refs
Data
Reduction
[4, 5]
[8, 9]
[6, 7]
[10]
WIMNIS
No
Yes
Yes
Yes
Yes
Online Spike Sorting

Spike
Feature
Spike
Detection
Extraction
labeling
No
No
No
No
No
No
Yes
No
No
Yes
Yes
No
Yes
5. CONCLUSION
We presented an efficient implementation of a neuroprocessor for
neural signal compression and spike sorting. The system is
implemented on a 5 mm x 5 mm nano-FPGA that consumes less
than 5 mW of power to process 32-channels of neural data sampled
1596
at 25 kHz and 8-bit resolution. This brings the total power/size

demand per channel to be less than 0.122 mWmm2/channel. The
system is highly scalable, programmable and cost effective,
making it well suited for neuroscience research with small animals
as well as clinical BMI applications. Work on full integration of
this system with analog conditioning and wireless telemetry is
undergoing.
6. REFERENCE
[1] M. A. Lebedev and M. A. L. Nicolelis, "Brain-Machine
Interfaces: Past, Present and Future," Trends Neurosci., vol. 29,
pp. 536-546, 2006.
[2] D. M. Taylor, S. I. H. Tillery, and A. B. Schwartz, "Direct
Cortical Control of 3D Neuroprosthetic Devices," Science, vol.
296, pp. 1829-1832, 2002.
[3] K. G. Oweiss, "A Systems Approach for Data Compression and
Latency Reduction in Cortically Controlled Brain Machine
Interfaces", IEEE Trans. BE, vol. 53, pp. 1364-1377, 2006
[4] H. Miranda, V. Gilja, C. A. Chestek, K. V. Shenoy, and T. H.
Meng, "HermesD: A High-rate Long-range Wireless Transmission
System for Simultaneous Multichannel Neural Recording
Applications," IEEE Trans. BioCAS, vol. 4, pp. 181-191, 2010.
[5] Y. K. Song, D. A. Borton, S. Park, W. R. Patterson, and C. W.
Bull et al "Active Microelectronic Neurosensor Arrays for
Implantable Brain Communication Interfaces," IEEE Trans. NSRE,
vol. 17, pp. 339-345, 2009.
[6] R. R. Harrison, P. T. Watkins, R. J. Kier, R. O. Lovejoy, D. J.
Black, B. Greger, and F. Solzbacher, "A Low-power Integrated
Circuit for a Wireless 100-electrode Neural Recording System,"
IEEE J. Solid-State Circuits, vol. 42, pp. 123-133, 2007.
[7] A. M. Sodagar, G. E. Perlin, Y. Ying, K. Najafi, and K. D.
Wise, "An Implantable 64-channel Wireless Microsystem for
Single-unit Neural Recording," IEEE J. Solid-State Circuits, vol.
44, pp. 2591-2604, 2009.
[8] R. Michael, I. Obeid, S. H. Callender and P. D. Wolf, "A
Single-chip Signal Processing and Telemetry Engine for an
Implantable 96-channel Neural Data Acquisition System," J.
Neural Eng., vol. 4, pp. 309, 2007.
[9] Y. Perelman and R. Ginosar, "An Integrated System for
Multichannel Neuronal Recording with Spike/LFP Separation,
Integrated A/D Conversion and Threshold Detection," IEEE Trans.
BE, vol. 54, pp. 130-137, 2007.
[10] M. S. Chae, Z. Yang, M. R. Yuce, H. Linh, and W. Liu, "A
128-channel 6 mW Wireless Neural Recording IC with Spike
Feature Extraction and UWB Transmitter," IEEE Trans. NSRE,
vol. 17, pp. 312-321, 2009.
[11] F. Zhang, M. Aghagolzadeh, M. Kiani, M. Ghovanloo, K. G.
Oweiss, "WIMNIS 1.0: Wireless Intracortical Multichannel Neural
Interface System for Neural Recording in Freely Behaving
Subjects," in preparation.
[12] M. Aghagolzadeh and K. G. Oweiss, "Compressed and
Distributed Sensing of Neuronal Activity for Real Time Spike
Train Decoding," IEEE Trans. NSRE, vol. 17, pp. 116-127, 2009.
[13] K. G. Oweiss, A. Mason, Y. Suhail, A. M. Kamboh, and K. E.
Thomson, "A Scalable Wavelet Transform VLSI Architecture for
Real-time Signal Processing in High-density Intra-cortical
Implants," IEEE Trans. CAS I, vol. 54, pp. 1266-1278, 2007.
[14] http://www.actel.com/documents/IGLOO_nano_DS.pdf.
[15] K. G. Oweiss, Statistical Signal Processing for Neuroscience
and Neurotechnology, Academic Press, Elsevier, pp. 15-74, 2010.

Nano-FPGA Processors For BMI

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Nano-FPGA Processors For BMI

Hochgeladen von

Copyright:

Verfügbare Formate

A LOW-POWER IMPLANTABLE NEUROPROCESSOR ON NANO-FPGA FOR BRAIN

MACHINE INTERFACE APPLICATIONS

Index TermsNeuroprocessor, nano-FPGA, brain machine

for human applications, besides lacking on chip information

This work was supported by National Institutes of Health under grant

978-1-4577-0539-7/11/$26.00 2011 IEEE

Figure 1. NIN functional diagram

The neuroprocessor has three operational modes controlled

3.1 Synthesis of the Neuroprocessor

3.2 Neuroprocessor on Nano-FPGA

Figure 2. VLSI architecture of lifting DWT computation [13]

3. IMPLEMENTATION ON NANO FPGA

The flash-based IGLOO nano-FPGAs [14] with embedded

TABLE II. HARDWARE RESOURCES OF FINAL DESIGN ON AGLN 250

Power Consumption (mW)

Sampling Rate (x1000, sampsle/sec)

It has been established that effective data compression can be

Figure 3. Equivalent sampling rate and measured DWT power

Execution Time (Ps)

M aster C lock F requ ency (M H z)

Figure 4. DWT execution time and data throughput rates as a

Data Throughput Rate (Msamples/sec)

Power Consumption (mW)

4.1 Rate-Distortion tradeoff

In order to test the systems full speed operation (6.4 MHz),

Mean Squre Error

TABLE III. MEASURED FPGA CORE POWER CONSUMPTION

Figure 5. Tradeoff between signal integrity and compression rate

4.2 Spike Sorting

Online Spike Sorting

at 25 kHz and 8-bit resolution. This brings the total power/size

Das könnte Ihnen auch gefallen