Sie sind auf Seite 1von 2

sign up log in tour help

x Dismiss

Join the Stack Overflow Community

Stack Overflow is a community of 6.7 million


programmers, just like you, helping each other.
Join them; it only takes a minute:

Sign up

Using Hadoop for storing stock market tick data

I'm having fun learning about Hadoop and the various projects around it and currently have 2 different strategies I'm thinking about for
building a system to store a large collection of market tick data, I'm just getting started with both Hadoop/HDSF and HBase but hoping
someone can help me plant a system seed that I won't have to junk later using these technologies. Below is an outline of my system and
requirements with some query and data usage use cases and lastly my current thinking about the best approach from the little documentation
I have read. It is an open ended question and I'll gladly like any answer that is insightful and accept the best one, feel free to comment on any
or all of the points below. - Duncan Krebs

System Requirements - Be able to leverage the data store for historical back testing of systems, historical data charting and future data
mining. Once stored, data will always be read-only, fast data access is desired but not a must-have when back testing.

Static Schema - Very Simple, I want to capture 3 types of messages from the feed:

1. Timestamp including date,day,time


2. Quote including Symbol,timestamp,ask,askSize,bid,bidSize,volume....(About 40 columns of data)
3. Trade including Symbol,timestamp,price,size,exchange.... (About 20 columns of data)

Data Insert Use Cases - Either from a live market stream of data or lookup via broker API

Data Query Use Cases - Below demonstrates how I would like to logically query my data.

4. Get me all Quotes,Trades,Timestamps for GOOG on 9/22/2014

5. Get me all Trades for GOOG,FB BEFORE 9/1/2014 AND AFTER 5/1/2014

6. Get me the number of trades for these 50 symbols for each day over the last 90 days.

The Holy Grail - Can MapReduce be used for uses cases like these below??

1. Generate meta-data from the raw market data through distributed agents. For example, Write a job that will compute the average trading
volume on a 1 minute interval for all stocks and all sessions stored in the database. Create the job to have an agent for each
stock/session that I tell what stock and session it should compute this value for. (Is this what MapReduce can do???)
2. On the classpath of the agents can I add my own util code so that the use case above for example could publish its value into a central
repo or Messaging server? Can I deploy an agent as an OSGI bundle?
3. Create different types of agents for different types of metrics and scores that are executed every morning before pre-market trading?

High Frequency Trading


I'm also interested if anyone can share some experience using Hadoop in the context of high frequency trading systems. Just getting into this
technology my initial sense is Hadoop can be great for storing and processing large volumes of historic tick data, if anyone is using this for
real-time trading I'd be interested in learning more! - Duncan Krebs

hadoop hdfs hstore hdfstore hft

asked Sep 23 '14 at 0:08


Duncan Krebs
1,438 13 29

2 can you share what you mean by HFT - hadoop is not meant for real time processing generally. Spark will
perform better due to in-memory processing mostly but most of HFT (depending on your timescale) will
require you to avoid disk/remove networks. Once again, it depends on your timescale for HFT. ali haider
Sep 23 '14 at 0:20

Hi ali haider, that is my sense too that hadoop in the HFT eco-system is best suited for storing tick data
coming from a real time stream or historical lookup. I was just curious if people are using it for things like
intraday tick data storage that is analyzed in real time across entire market segments that becomes difficult
in memory. Thanks for the comment - will look at Spark to feed my curiosity. Duncan Krebs Sep 23 '14 at
0:31

1 You can more easily decice between diska nd in-memory storage via spark. With Hadoop, you may have to
look at in-memory datagrid (or preferably off-heap memory on large RAM node) to tackle what you're trying
to do. ali haider Sep 23 '14 at 0:34
Do you have tickdata stream ? ? Tomasz Waszczyk Sep 23 '14 at 5:45

Yes, multiple streams and I want to start storing the tick data into a store for future back testing. I'm looking
to use a solution like Hadoop to store this data. Duncan Krebs Sep 24 '14 at 1:02

1 Answer

Based of my understanding of your requirements, Hadoop would be really good solution to


store your data and run your queries on it using Hive.

Storage: You can store the data in Hadoop in a directory structure like:

~/stock_data/years=2014/months=201409/days=20140925/hours=01/file

Inside the hours folder, the data specific to that hour of the day can reside.

One advantage of using such structure is that you can create external tables in Hive over this
data with your partitions on years, months, days and hours. Something like this:

Create external table stock_data (schema) PARTITIONED BY (years bigint, months bigint,
days bigint, hours int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION
'~/stock_data'

Coming to the queries part, once you have the data stored in the format mentioned above you
can easily run simple queries.

Get me all Quotes,Trades,Timestamps for GOOG on 9/22/2014

select * from stock_data where stock = 'GOOG' and days = 20140922

Get me all Trades for GOOG,FB BEFORE 9/1/2014 AND AFTER 5/1/2014

select * from stock_data where stock in ('GOOG', 'FB') and days > 20140501 and days <
20140901)

You can run any such aggregation queries once in a day and use the output to come up with
the metrics before pre-market trading. Since Hive internally runs mapreduce these queries
won't be very fast.

In order to get faster results, you can use some of the in memory projects like Impala or Spark.
I have myself used Impala to run queries on my hive tables and I have seen a major
improvement in the run time for my queries (around 40x). Also you wouldn't need to make any
changes to the structure of the data.

Data Insert Use Cases : You can use tools like Flume or Kafka for inserting data in real time
to Hadoop (and thus to the hive tables). Flume is linearly scalable and can also help in
processing events on the fly while transferring.

Overall, a combination of multiple big data technologies can provide a really decent solution to
the problem you proposed and these solution would scale to huge amounts of data.

answered Sep 26 '14 at 20:48


Amar
2,220 10 18

Das könnte Ihnen auch gefallen