Sie sind auf Seite 1von 50

Honours Individual Project Dissertation

STOCK PRICE PREDICTION USING


SOCIAL MEDIA DATA

Chun Pang Adrian Wong


March 27, 2019
i

Abstract

Stock prediction is a highly competitive market and is a topic that has been heavily researched.
However, many of the current products on the market only utilize empirical data. This project
aims to investigate whether the movement of stock prices can be predicted solely from the use of
social media sentiment. The system design was split into four main components; data ingestion,
sentiment and feature break down, prediction model training, and building a user application.
Twitter posts about companies and their products would be analyzed and stored on a database,
before being deconstructed and used for training various models. Applying the principal of
Granger Causality, a previous day’s data would be associated with price movements the next day.
The product was able to achieve a 57.31% accuracy in predicting the movement of stock prices
at closing time the next day, with an expected annual return in investment of 50.12%. The data
ingestion process was done well, achieving extremely consistent information about any topic at
any period of time and giving uniform results reliably, indicating a high precision in converting
public sentiment into data. That being said, the performance of the classifier model was the main
limiting factor and requires further investigation. Overall, this product may be viable in the real
world, but as a supplement meant to be used in combination with other methods of prediction.
i

Acknowledgements

I would like to express my sincere gratitude to the University of Glasgow for offering me the
opportunity to complete my studies here.
In addition, I offer special thanks to Dr. Iadh Ounis, who has been my project supervisor and
supported me throughout my journey.
I am also hugely grateful to the members of Peak Capital Limited, who took me in and provided
me with invaluable mentorship and experience over the summer.
Finally, I would like to thank Mr. Zachary Hagan for his unending emotional support over the
course of this project.
i

Education Use Consent

I hereby grant my permission for this project to be stored, distributed and shown to other
University of Glasgow students and staff for educational purposes. Please note that you are
under no obligation to sign this declaration, but doing so would help future students.

Signature: Chun Pang Adrian Wong Date: 27 March 2019


ii

Contents

1 Introduction 1
1.1 Motivations 1
1.2 Aims 3
1.3 Questions 3
1.4 Hypothesis 3
1.5 Dissertation Layout 3

2 Background 4
2.1 Essential background theory 4
2.1.1 Efficient Market Hypothesis 4
2.1.2 Random Walk Theory 4
2.1.3 Sentiment analysis 4
2.2 Related research 5
2.2.1 Twitter mood predicts the stock market 5
2.2.2 Stock prediction using twitter sentiment analysis 6
2.2.3 The effects of twitter sentiment on stock price returns 6
2.2.4 What we have learned 7
2.3 Similar products 7
2.3.1 TINO IQ 7
2.3.2 Market Sensei 8
2.3.3 I Know First 8
2.3.4 What we have learned 9
2.4 Summary 9

3 Requirements capturing 10
3.1 User Scenarios 10
3.2 Functional Requirements 10
3.2.1 Must have 10
3.2.2 Should have 11
3.2.3 Could have 12
3.2.4 Won’t have 13
3.3 Non-functional requirement 13

4 Design 14
4.1 System design and architecture 14
4.1.1 Initial architecture proposal 14
4.1.2 Early prototyping architecture 15
4.1.3 Developmental architecture 16
4.1.4 Final architecture 16
4.1.5 Summary 17
4.2 Technology prototyping, selection and justification 18
4.2.1 Data retrieval 18
4.2.2 Twitter 18
4.2.3 Secondary methods of retrieval 18
4.2.4 Stock price retrieval 18
iii

4.2.5 Data storage 19


4.2.6 Sentiment Analysis 19
4.2.7 NLTK Text Classification 19
4.2.8 Azure, Google Cloud Platform, TextBlob, VaderSentiment and SpaCy 20
4.2.9 Classification 20
4.2.10 User Application 20
4.2.11 Summary 20

5 Implementation 21
5.1 Data ingestion 21
5.1.1 Tweepy implementation 21
5.1.2 Diagram of how related keywords operated 22
5.1.3 Market data retrieval 22
5.1.4 Code snippet showing how historical data was collected 23
5.1.5 Preliminary feature deconstruction 23
5.1.6 Data storage 23
5.2 Sentiment analysis 23
5.2.1 Natural language processing 24
5.2.2 Collected data 24
5.3 Prediction 24
5.3.1 Data retrieval and structuring 24
5.3.2 Granger causality 25
5.3.3 Feature scaling 25
5.3.4 Principal component analysis 25
5.3.5 Classifier training 25
5.3.6 Prediction data retrieval 26
5.4 User application 26
5.4.1 Model saving 26
5.4.2 User interface 26
5.4.3 Standalone application 27
5.4.4 Web application 28
5.5 Summary 28

6 Evaluation 29
6.1 Performance analysis and refinement 29
6.1.1 Classifier metrics 29
6.1.2 Classifier comparison 29
6.1.3 Classifier configurations 30
6.1.4 PCA component analysis 30
6.1.5 Data reduction and filtering 31
6.1.6 Feature adding 32
6.1.7 Summary 32
6.2 Final evaluations 32
6.2.1 Unit testing 32
6.2.2 Ablation study 33
6.2.3 Training with fewer days 34
6.2.4 Random data split testing 35
6.2.5 Prediction consistency 35
6.2.6 Comparison with empirical standards 36
6.2.7 Summary 36

7 Conclusion 37
7.1 Requirements review 37
iv

7.1.1 Functional requirements 37


7.1.2 Non-functional requirements 37
7.1.3 Achievements and failures 37
7.2 Product viability 38
7.2.1 Overall accuracy 38
7.2.2 Expected return 38
7.2.3 Verdict 38
7.3 Limitations 38
7.3.1 The classifier problem 39
7.4 Future development 40
7.4.1 Mood analysis 40
7.4.2 Terrier data ranking 40
7.4.3 Combination with empirical prediction methods 40
7.5 Summary 40

Appendices 41

A Appendices 41
A.1 Understanding the source code submission 41

Bibliography 42
1

1 Introduction

In this dissertation, we will explore the journey taken to produce an application that would predict
the movement of stock prices using social media sentiment. We will discuss the challenges, how
they were solved, and the barriers which must be overcome for future development.
In this chapter, I will go through the motivations and goals for the project, my initial hypothesis,
as well as how this dissertation has been laid out.

1.1 Motivations
Stock prediction has been a huge trend for the past decade. Today, almost three quarters of trades
are done by algorithms and computers [1]. Quant funds are some of the most profitable funds in
banks and are present in almost all large institutions. There are many types of automated trading
in the industry, with different frequencies, and a range of methodologies [5].
The use of computers to assist stock trading is not new. Electronic trading platforms have been
used since the 1970s to place orders for financial products over a network through financial
intermediaries and service providers [2]. These were used as a replacement for traditional floor
trading, where brokers had to handle transactions between themselves manually, as electronic
trading could be carried out by users from any location. These platforms provided essential
information, such as live market prices, volumes, and company statistics. As they developed, the
platforms started including tools which helped brokers predict future prices, such as charting
packages, news feeds, and technical analysis. Eventually, they would also allow traders to set up
automatic trading in order for them to trade at a higher frequency than humanly possible, based
on the parameters set out by the traders.
These parameters were based on existing trading models. Strategies for automated trading have
been developed and used since 1949. The most commonly used is trend following, where trend
in moving averages and price level movements are simply followed. Other examples include
volumes weight average price and mean reversion, but we will not be going into details here, as
this is not a finance paper.
It was inevitable that, sooner or later, institutions would combine automated trading, high
frequency trading, and technical analysis as a strategy in their management of funds. As long
as what they used to predict the movement of prices had a slightly better than 50% (random)
expected return, they would profit over large volumes of trades. This would become the basis of
automated trading systems (ATS).
An automated trading system is a program that automatically generates orders and submits them
to an exchange in the market, following sets of predefined rules representing trading strategies
within which orders are generated. With rules based on technical analysis, such as theoretical
buy and sell prices based on current market price, and with trades and tasks being carried out at
orders of magnitude greater than human equivalencies, this would soon become the norm for
day trading brokers.
2

The usage of an ATS is also advantageous for a number of other reasons. On top of increased
trade frequency and calculations, emotion is eliminated completely from the process, which is
important while the market is volatile [3]. The speed at which orders are places in response to a
market action is minimized as trades are conducted almost instantaneously after rules are met.
It is also relatively quick and simple to test and evaluate a strategy before deployment into the
live market. The consistency of an ATS may entice investors who look for low risk investments,
and furthermore, the ability to diversify a portfolio is made easy as ATS allows for simultaneous
trading on multiple accounts, decreasing risk even more. However, systems often require careful
monitoring as failures could ensure heavy costs for the fund. A system could perform very well
during back testing but poorly when deployed into the live market. In 2012, Knight Capital
Group lost four times its net income in just 30 minutes due to a bug in one of their trading
algorithms [4].
One of the most dangerous scenarios is market disruption and manipulation. In 2010, during an
event known as the ’2010 Flash Crash’, the Dow Jones Industrial Average (DJIA) plummeted 1,000
points then recovered within minutes. New regulations had to be issued to control automated
trading market access.
As automation of processes and trading were perfected and high frequency trades were not an
issue any more, focus shifted towards the algorithms and rules that drove the trades. Over the past
decade, algorithmic trading has been gaining traction both with retail and professional investors.
It is now widely used by investment banks, pension funds, mutual funds, and hedge funds. In
2014, Virtu Financial, a high frequency algorithmic trading firm, reported that during a five
year period the firm was profitable 1,277 out of 1,278 trading days.

Figure 1.1: A graph showing the increase in use of algorithmic trading over the past two decades

Most of the strategies in algorithmic trading involve looking at numerical statistics. Quant funds
tend to hire Physics and Mathematics researchers to work on these funds, and attempting to
compete with these institutions as a student who is barely crawling through computing science
would be foolish. There are already methods in place for parsing and analyzing lengthy financial
reports for companies and generate a verdict on whether there is an improvement since the last
report. However, it was only recently that algorithms would look at news stories, and this is an
idea that seems promising. There is evidence that large institutions are already using strategies
related to sentiment, as mentions of specific words in news can be directly correlated to stock
price movements. Social media sentiment, on the other hand, is still in a very hypothetical stage,
as any discussion on this topic lie mostly in academic papers.
3

1.2 Aims
In this project, I hope to devise a method of predicting stock prices using social media data.
This will be produced in the form of either a web application or a standalone application. The
application will have a way of ingesting social media data, deconstructing it into its bare features,
then use machine learning techniques to attempt to predict stock prices. The social media data
will be stored on a database, and models will be pre-built. The user can simply input a company
stock code, and the social media data related to that company will start being ingested. The user
may decide how much social media data to use, and afterwards, the application will process that
data through a pre-trained classifier model to get a prediction for their stock. The model will be
evaluated and tested.

1.3 Questions
In addition to building the application, I would like to decipher what factors in particular affect
stock prices the most, as well as the most optimal conditions for training a classifier model. Will a
product like this work in the real world or is it completely outclassed by traditional methods of
algorithmic trading?

1.4 Hypothesis
I believe that it is possible to achieve a better than random level of accuracy with predictions
using social media sentiment. As stated previously, and assuming that daily fluctuations in prices
remain consistent over long periods of time, using this product should, in theory, net a profit
over the long run.

1.5 Dissertation Layout


We will begin with some background information on finance, especially current theories on
stock prediction using sentiment. After that, we will explore some previous research done by
academic institutions, as well as some related products on the market right now, to see what we
can learn from them.
We will then go through requirement capturing for the product that is to be made. Software
design and architecture will then be examined, before going through the prototyping stages for
each individual component and why certain tools were used.
We will then explore the implementation of the project itself. An initial evaluation will be
discussed, what was learned from the evaluation, and how the product was further improved
from that point. Lastly, a final evaluation will be carried out and a conclusion for the project will
be made.
For reference, I have named my product Enular, an anagram of neural due to the machine
learning aspects of the project. I hope you enjoy this paper.
4

2 Background

In this chapter, we will be discussing existing market theories, what other researchers have done,
what companies with similar products are doing, and how we can learn from all of this.

2.1 Essential background theory


In this section, we will be exploring some of the theories when it comes to stock prediction. As
mentioned in section 1, algorithmic trading mainly utilizes numerical statistics to predict prices.
We will not be delving deeper into this, as this is the standard for many quant funds and has no
doubt been researched rigorously by people far smarter than I.
Instead, we will be looking at whether it is viable to predict stock prices using social media
sentiment as the main factor. This is fundamental knowledge for the rest of the paper. We will
explore some current market theories that are related to sentiment analysis in finance.

2.1.1 Efficient Market Hypothesis

Early studies on stock prediction were based efficient market hypothesis (EMH) [6]. EMH was
an investment theory that suggested share prices reflects all available information, such as news,
investor reports, financial statements, and so on [8]. From this theory it was assumed that it
would not be possible to outperform the market on a consistent basis because neither fundamental
nor technical analysis can produce excess returns, relative to the market. Stocks would never
be undervalued or overvalued, always reflecting its true price. Fluctuations and volatility were
caused by new information being released to the public.

2.1.2 Random Walk Theory

EMH was built upon in the Random Walk Theory (RWT). Due to the unpredictability of news,
RWT essentially suggests that stocks take a random and unpredictable path. Changes in stock
prices have the same distribution and are independent of each other. Therefore, future stock
prices cannot be predicted using past movements or trends, and market indexes could not be
outperformed without assuming addition risk [7]. These two theories remain controversial.
Firstly, there are studies to indicate that stock prices may not follow a random walk, and can be
predicted to a certain extent using historical data [9]. We will not look further into this. Instead,
we attack the other pillar on which this theory stands; whether or not news is truly unpredictable.

2.1.3 Sentiment analysis

Numerous researches have shown that social media sentiment could potentially be very early
indicators of changes in the economic or commercial fields. Analysis of online activity has been
5

used to predict book sales, [10] movie sales, [11] product sales, [12] and even disease infection
rates and consumer spending [13]. This begs the question; can we not predict the stock market
this same way? Public mood and sentiment may play an equally important role as news in terms
of influence of the stock market. To further understand this concept, we must examine some of
the previous academic studies conducted on this subject.

2.2 Related research


In this section, I will briefly explore some of the most significant research that has been done on
the subject of stock price prediction using tweets.

2.2.1 Twitter mood predicts the stock market

by Johan Bollen, Huina Mao, Xiao Jun Zeng, 2008 [20]


Likely the most famous of papers regarding stock prediction using sentiment. Bollen et al
investigated whether measurements of collective mood states derived from large scale Twitter
feeds are correlated to the value of the Dow Jones Industrial Average (DIJA) over time.
They obtained the text content, date, and post type of approximately 10 million tweets. Tweets
were cleaned so that only instances where if certain keywords related to feeling were mentioned,
the tweet would be accepted. They analyzed the text content of the daily Twitter feeds by two
mood tracking tools, OpinionFinder, which measured positive versus negative mood, and Google
Profile of Mood States (GPOMS), which measured mood in terms of 6 dimensions (Calm, Alert,
Sure, Vital, Kind and Happy). They cross validated the resulting mood time series by comparing
their ability to detect the public’s response to the presidential election and Thanksgiving Day in
2008.

Figure 2.1: A diagram showing Bollen’s system design

A Granger causality analysis and a Self Organizing Fuzzy Neural Network were then used
to investigate the hypothesis that public mood states, as measured by the OpinionFinder and
6

GPOMS mood time series, are predictive of changes in DJIA closing values. A DJIA value would
be linked to the sentiment performance of the previous n days. They then combined this model
with existing DJIA prediction models, which they did not alter.
Their results indicate that the accuracy of DJIA predictions can be significantly improved by the
inclusion of specific public mood dimensions, namely ’sure’ and ’happiness’, but not others. They
found an accuracy of 87.6% in predicting the Daily up and down changes in the closing values
of the DJIA and a reduction of the Mean Average Percentage Error by more than 6%. This was
a 14.3% improvement from the baseline DJIA prediction model that they had built upon.

2.2.2 Stock prediction using twitter sentiment analysis

by Anshul Mittal, Arpit Goel, 2010 [21]


A paper that was based off the previous work we examined, written by Bollen et al. In this
paper, Mittal and Goel applied sentiment analysis and machine learning principles to find the
correlation between âĂIJpublic sentimentâĂİ and âĂIJmarket sentimentâĂİ. They used twitter
data to predict public mood and use the predicted mood and previous days’ DJIA values to predict
the stock market movements. These tweets were analyzed to get a degree of membership into
four classes: calm, happy, alert, and kind, similar to what Bollen et al had done. They then used
preceding days’ moods and DJIA values to predict future stock movements.

Figure 2.2: A diagram showing Mittal’s system design

They proposed a new cross validation method for financial data and obtained 75.56% accuracy
using Self Organizing Fuzzy Neural Networks (SOFNN) on the Twitter feeds and DJIA val-
ues from the period June 2009 to December 2009. They also implemented a naive portfolio
management strategy based on their predicted values.

2.2.3 The effects of twitter sentiment on stock price returns

by Gabriele Ranco, Darko Aleksovski, Guido Caldarelli, Miha GrÄŊar, Igor MozetiÄŊ, 2015
[22]
One of the more recent papers, Ranco et al investigated the relations between a well-known
micro-blogging platform Twitter and financial markets. In particular, they consider, in a period
of 15 months, the Twitter volume and sentiment about the 30 stock companies that form the
Dow Jones Industrial Average (DJIA) index. They find a relatively low Pearson correlation and
Granger causality between the corresponding time series over the entire time period. However,
7

they find a significant dependence between the Twitter sentiment and abnormal returns during
the peaks of Twitter volume. This is valid not only for the expected Twitter volume peaks
(e.g., quarterly announcements), but also for peaks corresponding to less obvious events. They
formalize the procedure by adapting the well-known âĂIJevent studyâĂİ from economics and
finance to the analysis of Twitter data. The procedure allows to automatically identify events as
Twitter volume peaks, to compute the prevailing sentiment (positive or negative) expressed in
tweets at these peaks, and finally to apply the âĂIJevent studyâĂİ methodology to relate them
to stock returns. They show that sentiment polarity of Twitter peaks implies the direction of
cumulative abnormal returns. The amount of cumulative abnormal returns is relatively low
(about 1-2%), but the dependence is statistically significant for several days after the events.

2.2.4 What we have learned

There are several points we can take away from these papers. From the first two papers we
are told that public sentiment and mood is likely correlated with stock price movements, and
therefore there must be an emphasis on sentiment analysis features for our product. The third
paper contradicts the first two in the way that they found low correlation between previous day’s
data, and the next day’s stock price movements. However, they did find significance in large
volume peaks in Twitter, which we may be able to consider as another feature.

2.3 Similar products


In this section, we will examine some similar products that are already on the market

2.3.1 TINO IQ

TINO IQ is a tool for precise predictions in stocks showing signs of artificial manipulation. They
boast 2- years of research, analyzing various factors impacting the market. Their algorithms are
designed to detect human sentiment and machine trading patterns, converting them into trading
opportunities. However, their take on sentiment seems to be based entirely on pattern analysis,
instead of using text and media data.
Every day, thousands of stocks are scanned by TINO for patterns. These patterns are checked
against models trained with 20 to 30 years of data. The patterns are scored and ones that show
high probability and effectiveness are recommended to users. Only blue chip stocks are analyzed.

Figure 2.3: A graphic showing the design of TINO IQ


8

TINO specializes in recognizing artificial manipulation in stocks, whether intentionally done


for profit, or unintentionally done by faulty algorithms in large quant funds. Stocks are often
purposefully ’pumped’ with the intention of selling them off to inexperienced investors looking
to ride the wave, only for the price to plummet soon after, with the price manipulators prof-
iting greatly. With TINO, stocks are identified for this trait early on, and users are given a
recommendation along with a target sell price.
TINO works on a subscription model. Prices range from 50usd a month to 3000usd a year.

2.3.2 Market Sensei

Market Sensei predicts a stock’s most likely low, high, opening and closing prices daily. Users
can view predictions up to the next 7 days. They offer information such as the best buy in price
for a stock, expectations for returns, and a stock’s range and volatility. They also offer a stock
training game with an educational aspect, as well as several other supplementary features.

Figure 2.4: An example of Market Sensei’s user interface

Accuracy rates for stock predictions are updated on a daily basis. Historical predictions can be
viewed and compared with actuality. They do not offer any information on how their algorithms
work. They offer a very affordable subscription model but is only available as a mobile application
or an API.

2.3.3 I Know First

I know first is a fintech company that provides self-learning, AI based algorithmic forecasting
for capital markets to uncover the best investment opportunities. They provide daily investment
forecasts to users.
Their algorithm was developed by a team of researchers lead by Dr. Lipa Roitman, who has over
20 years of research and experience in artificial intelligence and machine learning and a long
record in computer modeling of processes.
The algorithm generates daily market predictions for over 10,000 financial assets, including
stocks, commodities, ETF’s, interest rates, currencies, and world indices for the short, medium
and long term time horizons.
The system outputs the predicted trend as a number, positive or negative, along with a wave
chart that predicts how the waves will overlap the trend. This helps the trader to decide which
direction to trade, at what point to enter the trade, and when to exit. The algorithm produces a
forecast with a signal and a predictability indicator.
Since the model is 100% empirical, the results are based only on factual data, thereby avoiding
any biases or emotions that may accompany human derived assumptions.
9

Figure 2.5: The system architecture of I Know First

Their pricing ranges from $170usd a month to $439usd a month, depending on the amount of
stock picks the user wants per day.

2.3.4 What we have learned

There are many products available that offer stock price prediction, but there a few that make
use of social media sentiment. The fact that the use of empirical data may offer more accurate
results than sentiment may be apparent, but should nonetheless be investigated.

2.4 Summary
In both the research papers and competitor products, empirical data was used in some way. It
must therefore be investigated to answer the question of whether the analysis of social media
data alone could be viable in the market today.
10

3 Requirements capturing

In this chapter, we will go over the requirements for the project. The requirements were first
created before work on the product began, and have been added to as more research came to light
and while meeting with the client (supervisor). It was been updated constantly due to impractical
limitations that presented themselves along the way. These revised requirements will attempt
to demonstrate a product that is unique in its own way, having aspects that have not been done
before, while answer questions about the viability of social media data as stock price indicators.

3.1 User Scenarios


Scenarios were created to assist in identifying the needs and requirements of potential users of
the product.
Adam is a professional day trader. He buys and sells hundreds of stocks every day and tries to
make as much money off those trades as possible. Adam ends the day with the same amount of
stocks he starts with, nothing. He will never hold a stock for more than a few hours. He has
many expensive day trading tools. He wants something simple that confirms his decision to buy
or short a stock. When Adam has made a prediction for a stock, he uses Enular to confirm his
guess.
Belle is a retail investor. She has some disposable cash saved up and would like to make a some
money off the stock market. She has a lot of spare time at work so trading during stock opening
hours is not an issue. Belle is not looking for long term investments, and will most likely be
holding a stock for several days maximum. Belle does not have the resources of a bank, and
therefore relies on free tools. She also knows very little about financial data and what they might
mean. She browses through popular stocks and uses Enular to decide whether she should invest.
Connor is retired. He has a good deal of retirement money saved up, and would like to make
some low risk investments. He will mainly look at blue chip stocks with minimal risks. He also
has nothing but time on his hands, so he will be trading throughout the day. His mind is not
what it once was and he has problem learning new skills or processing complex data. He uses
Enular to determine what stocks to buy.

3.2 Functional Requirements


Here, we use the MoSCoW method to evaluate the function requirements for the project,
exploring ’must have’, ’should have’, ’could have’ and ’won’t have’.

3.2.1 Must have

We examine functionality that is essential to the product.


11

M1 User interface - the product must have a user interface so that the user can easily do what
they want to do without navigating through code. GUI libraries such as tkinter can be
used.
M2 Stock choice input - the user must able to be input the stock that they wish to have predicted,
and this should include any company that is publicly traded.
M3 Twitter data ingestion - the system must have a method of gathering large amounts of
social media data in an efficient way. The data must include tweet content, tweet date,
tweet type (retweets), information about the author (such as number of friends, followers),
how well the tweet was received, such as number of likes. Tools for this include the official
Twitter API, Twitter4j, and tweepy.
M4 Topic filtering - must be able to retrieve data specifically for a particular topic, while
filtering out unrelated tweets.
M5 Sentiment analysis - using the contents of a tweet, the application must be able to generate
a numerical sentiment score for each tweet. This can be done with a number of libraries
such as OpinionFinder.
M6 Stock code dictionary - given a company, the stock code should be linked to the company
name so that even if the user does not have to full name of the company, which might
be long and include company types (ltd), other components of the product will still be
functional.
M7 Financial data retrieval - with the stock code, the system should be able to retrieve that
stock price information, specifically the change in price during that day. On top of that,
we should be able to get the price change for any day in the past, within reason.
M8 Feature break down - with the available twitter data mentioned in M3, the system must
disassemble it into a number of individual features for classification.
M9 Database storage and retrieval - data from the social media ingestion must be stored and
retrieved efficiently in order to be used by the classifier to build a model. The database will
most likely be a noSQL database.
M10 Feature scaling - as we deconstruct the features, more likely than not they will not be on
the same scale. We must have a way to scale them so that it is fair when we train models.
M11 Classification - using collected data, we must train a classifier model such as Decision Tree
or Support Vector Machine that will be used to make stock predictions.
M12 Immediate prediction data retrieval - as the user inputs a company or stock of their choice,
the system must immediately and automatically retrieve social media data for that exact
very company.
M13 Stock prediction - as the main functionality of the project, the application must be able to
offer the user a prediction for their selected stock.
M14 Evaluation - the product will be evaluated and tested so that improvements can be made.

3.2.2 Should have

We look at functionality that is not essential, but highly desirable.


S1 Related words input - the company and stock code are not enough to gather the information
of a company. Public feeling towards words related to products or services related to a
company should also be taken into consideration. Users should be allowed to input their
own related keywords to more accurately gather opinion on a company and possibly
increase the accuracy of their prediction.
S2 Related words training - as we train the classifier with many different companies, we want a
way to link each of those companies with their respective keywords so that we can identify
which topics relate to which company.
12

S3 Standalone application - although a user interface will exist, we do not want the users to be
compiling and executing code every time they wish to run the product. On top of that,
there are many packages that will be required, meaning that the user must install them
on their own device. A standalone application will eliminate this problem as it can be run
without any dependencies.
S4 Web application - to make access even easier, a web application should be built so that
users will not even have to download the product.
S5 Empirical data comparison - to accurately assess the viability of stock prediction using
sentiment, we should compare it to a basic technical analysis baseline.
S6 Data clean up - tweets are messy by nature, causing problems such as inaccurate features
due to clutter. We should have a method of cleaning up tweets so that they only contain
information that we desire.
S7 Model saving - we do not want to train the classifier every time we wish to make a
prediction. This would also mean dependence on a database, which we rather not allow
everyone access to. Trained models should be able to be saved and easily retrievable.
S8 Scaler saving - scaling features is an essential part of the project, but we want to have the
same scaling even as the user processes the data that they retrieved. Again, we do not wish
to reuse the training data, so saving the scaler model will be optimal.
S9 Profit testing - we should have a method of obtaining the expected returns of stock in
addition to the accuracy. This way we can assure that it will not net our users a loss, even
if the evaluation metrics of the classifier is high.

3.2.3 Could have

We discuss functionality that is desirable and would enhance user experience, but may be difficult
to implement or time consuming.
C1 Long term prediction - although the product is meant for short term prediction, we
could possibly adjust the classifier to output an aggregated prediction over a longer period,
although the accuracy of such prediction may be questionable.
C2 Exact price prediction - on top of up down prediction, we could create a regression model
that predicts the exact closing price of a stock. We have the data, but again, the reliability
of such product is in question.
C3 Text classification - we may implement our own text classification method of sentiment
analysis, so that we can favor certain finance related words over others. The problem with
this task is the seer workload it involves, as it is a project in itself.
C4 Mobile application - downloadable mobile app could be created for even greater outreach
and accessibility. Again, this will take a lot of work.
C5 Financial model comparison - on top of using empirical data for comparison, we could
implement some basic financial modeling scripts and compare those with our classifier.
These would be highly accurate and well thought out equations and are used in a professional
setting, and will probably blow my project out the water.
C6 Tweet ranking - using the University of Glasgow’s own Terrier application, we could rank
the tweets and weigh certain samples more than others.
C7 Mood analysis - to even further develop our sentiment analysis, we could look into Google’s
famous GPOMS.
C8 Twitter volume analysis - as data is collected upon request of the user, the volume of
ingested tweets could be used as an additional feature.
13

3.2.4 Won’t have

For each of these features, we will discuss the reasons that they will not be included.
W1 Empirical data predictor - although the product will be tested against a regression model
that I wrote, this model will not be used in combination with the sentiment model to
improve accuracy, as the point of this project is to solely focus on sentiment.
W2 Financial models - Discounted Cash Flow and Three Statement Model are commonly used
models for calculating the value of a company, however, we will not be using these in our
project either.
W3 Macro sentiment analysis - linking macro-economic news such as political policies, country
GDPs, and trade wars, with changes in price for individual stocks. This was something
that I really wanted to include in my project, but I soon found out it was completely out of
my scope.
W4 Statistical data viewing - to keep the product flexible, automated retrieval of financial
statistics for the user will not be included. There are many stock exchanges, such as the
NASDAQ, the LSE, the NYSE, and countless more all around the world. Retrieval of
financial statistics would mean limiting the user to a certain group of exchanges. However,
we want to user to be able to get a prediction for any stock on any exchange in any part of
the world.
W5 Author and topic details - the product will retrieve social media data for whatever topic on
command, this is settled. However, it will not provide the user with details about specifically
what is being said, nor who is saying them, simply outputting empirical information.
Privacy issues must be considered as we do not wish for this application to be unethical or
controversial in any way.

3.3 Non-functional requirement


We explore some requirements that involve the usability of the application.
NF1 Easy to use - the application should be accessible to someone who has zero computing
science background
NF2 Easy to access - the application should be accessible to a wide range of users without
requiring them to install dependencies. On top of that, it should work on a variety of
devices, whether they want to use their phone, laptop or Amazon fire stick.
NF3 Fast - the product should be able to retrieve data and return a prediction in a short amount
of time.
NF4 Accurate - the prediction that we provide must meet a certain level of accuracy. If it is at
or below 50%, the user may as well flip a coin.
NF5 Reliable - the application should work at any time of the day, any day of the week. Packages
and dependencies of the application must therefore be selected carefully.
NF6 Configurable - the user should be able to choose additional configurations for their stock
pick, such as related topics to filter, amount of social media to ingest, and choice of prediction
model,
NF7 Informative - the product should provide information in addition to a prediction, such as
an aggregated sentiment score.
NF8 Maintainable - the code must be maintainable and extendable so it can be support in the
future
NF9 Tested for errors - stress test for any given situation must be conducted on the product so
that it is reliable and does not fail under any circumstance when used
14

4 Design

The design of the project has been split into two sections, one for the system design and architec-
ture and one for component prototyping and selection.

4.1 System design and architecture


In this section, we will go through the stages of development the project has gone through. You
will see initial ideas, how the project has grown, design choices and all the major changes along
the way. We will only be looking at the big picture here; as well will go through each individual
component and choice of tools in following section.

4.1.1 Initial architecture proposal

The project was heavily researched over the summer leading up to fourth year. I was ambitious
to create the most accurate stock prediction tool. Needless to say, I was way out of my depth.
However, the research and ideas would come in handy as the project developed. The initial
architecture is shown below.

Figure 4.1: Original planned architecture for the system

The initial idea was to get the prediction from a number of classifier components and combine
them in a central predictor. This predictor would determine the weights of each prediction,
and generate a final prediction which was to have the best accuracy. Components included
a tool which would automatically generate related words for a topic, a sentiment analyzer,
a macroeconomic sentiment analysis, prediction using popular financial models, and pattern
15

recognition using neural network. Some of the ideas we kept, but most of these either required
too much work, were pointless, or completely out of scope. We will discuss some of the things
we cut out and why.
Related word generation would have been a mess to implement for several reasons. Firstly, finding
related words for companies and stock names are completely different to finding a related word in
a dictionary. Synonyms are established, but finding words related to ’Tencent’ is an entire matter
altogether, requiring us to almost implement a search engine. The second reason is that even
if we had a tool for gathering related words on the spot, it would be extremely unreliable. For
example, http://relatedwords.org lists the top related word for ’Microsoft’ as ’Nokia’. This would
have been a nightmare for stock sentiment ingestion, and for these reasons, this component was
replaced with a more sensible method.
At the start of the project, I imagined financial models being one of the foundation components.
They are used by professional analysts, researchers, and fund managers. On top of that, they
would relatively simple to implement, compared to the rest of the project. It is for this very
reason that they were not included. Financial modeling is common and highly accessible. Most
trading software has it included by default. I had initially hoped that combining information
from models with predictions from my own product would enhance its accuracy, but results
from these models do not change day to day. We use statistics from a company’s financial reports,
which remains consistent and thereby only causing needless interference with the product. The
same went for regression models, which mainly extrapolated the past prices of individual stocks,
and pattern recognition using neural networks. However, theses will come into play later, and
will be explained in more detail as we describe their implementation during the evaluation stages.
Macroeconomic sentiment analysis was likely going to be the most powerful component. Politics
and national news affect stock prices as much, if not more than company level news. Look at the
2008 financial crisis for example. However, despite how powerful this component would have
been, it was just not meant to be. There are again several problems with it. Firstly, we would
require a completely new method of ingesting data, crawling through various news sources, as
well as identifying the topic, the tone of the article, what stock it relates to, which countries and
industries are affected, and so on. We would essentially need to build a robot that could read
and understand news at a human level. On top of that, we would need advanced knowledge of
economics to even know what to do with the data, a topic which expects have studied years to do.
We would not be simply able to apply machine learning techniques to this due to the uniqueness
for each case. Regretfully, but to no surprise, this was not going to be part of the project.

4.1.2 Early prototyping architecture

Figure 4.2: Prototype system architecture

The first task at hand was to test out the three core components of the project, social media
ingestion, sentiment analysis, and classification, shown in 4.2. The other components used at this
stage would be temporary and quickly replaced down the line.
16

I implemented a twitter search function that would gather a certain amount of tweets on a certain
topic. A script that processed the content of these tweets was written and the output of these
results would be temporarily stored on text files, with different directories for different companies
and days. The chances in stock prices were manually input by me, and another script would then
retrieve the collection of data and run it through a classifier. Some arrays of arbitrary features
would be hard coded to test whether the model was generating prediction.
Many different techniques were tried, especially for the ingestion component (details in the next
chapter). Once these three core components were working, we were ready to replace the rest of
the parts.

4.1.3 Developmental architecture

The next stage was to implement a proper data storage tool, as well as continue improving the
core components. I began by improving the ingestion component. Due to the limitations of a
search method of getting tweets, we would begin using a stream which filtered out undesirable
content which live streamed data about 10 companies at a time. The data would be unpacked,
converted from JSON into a readable format, and then broken down into a barrage of features.
Then a noSQL database was set up to store the features for each company on each day, with an
automated method of retrieving the up and down movement of stocks on a particular day, given
the stock market was open. Social media data would be collected in the hours leading up to a
market open and paired with the change in stock price for that company the previous day, a
method designed around the principal of Granger Causality.

Figure 4.3: Developmental system architecture

With the training and validation data now in the database, we created another script to retrieve
and structure the data. We scaled them due to the vast disparity in scale between features and ran
them through that classifier to create a trained model. A range of classifiers were used, and the
training data was split so that we had a portion to test the evaluation metrics of different models.
The same process was carried out to obtain the data of the company we wanted to predict. It
would be processed the same way as the training data, and then processed through one of the
models to obtain a prediction.

4.1.4 Final architecture

At this stage, it was a matter of improving usability, fine tuning the predictor for accuracy, and
setting up the evaluation techniques.
17

For the usability aspect, I wanted the whole product to make a prediction with only one user
action. To do this, I started working on the module that would turn out to be the application
itself. From the training stages of the project, all we needed were the classifier models, and
a way to save the scaling and feature reduction settings. Once we had those, we could set up
another module which retrieved and streamed only the social media data that we would need to
make a prediction, loaded the trained models, and process the data directly, without the need of
external storage, to generate a prediction. A method of connecting companies to related words
was created to obtain a wider range of data for any topic. Finally, a graphical user interface
would then be implemented so that the product would become fully usable by anyone.
The accuracy predictions at this stage were being constantly tested and improved, at all angles.
Classifiers would be compared, classifier configurations would be tweaked, and features would be
adjusted, reduced, added and removed for the purpose of obtaining a slightly higher accuracy.
Testing and comparison techniques will be described in the ’Fine tuning’ chapter.

Figure 4.4: Final system architecture

Finally, methods of evaluation were carried that judged classifier performance. The predictions
from our product would be compared to predictions from other methods. The viability and real
life practicality of the product would also be considered.

4.1.5 Summary

In this section we have explored the gist of the inner workings of this project. The system
architecture essentially revolves around four main components.
1. Data ingestion
2. Sentiment and feature analysis
3. Prediction
4. User application
18

The outcome is a product that will retrieve social media about a stock on demand, break the data
down into features, and make a prediction. Although reliant on Twitter’s API for data retrieval,
has no other dependencies and should be easy to maintain and improve.

4.2 Technology prototyping, selection and justification


In this section, we will discuss possible technology we could have used for each component,
what some of the issues were with certain libraries, and justify our chosen tools. We will not go
through every component here, only the parts where tough choices had to be made. Pros and
cons of each option will be discussed to end on a verdict. Code samples shown here are not in
the final product. They will only act as evidence that these choices were thoroughly explored
before being disregarded in favor of another option.

4.2.1 Data retrieval

The first major component of the project is a method of ingesting social media data and retrieving
stock price information. Twitter had been determined to be primary source of the social media
data since the start of the project. Several solutions were explored for this task.

4.2.2 Twitter

The official Twitter API was the only legitimate method of retrieving Twitter data so it must
have been involved no matter the circumstances. This meant that registering as a developer
and authentication was required, which was simple to achieve and quickly set up. The only
disadvantage of this was that I would be required to keep my authentication keys on the final
products and would need a way to protect it. There are a number of libraries which facilitate
the use of the Twitter API. The two that were considered were Twitter4J and Tweepy, which
the project would then be based around. Initially Twitter4J was the apparent choice, however a
short time later it was replaced with Tweepy due to Python being better suited for the machine
learning stages later on in the project. Both had similar functionality, so transitioning was not
difficult.

4.2.3 Secondary methods of retrieval

Other methods of retrieving social media data were also considered. StockTwits and Reddit APIs
were planned to have been implemented at a later stage, but this never happened due to time
constraints.

4.2.4 Stock price retrieval

Again, several methods were considered for this task. Many of the APIs which retrieved live
stock quotes were subscription based with free trials and would soon require payment. This was
not an issue since stock price retrieval would not be necessary of the standalone product, just
for the training process. That being said, I would still need access to it for at least 6 months,
which narrowed down options. Eventually, I stumbled upon a humble Python module named
iexfinance. Despite doubts about its dependability, I would begin using as a temporary solution
as it met all the functional requirements I needed. I figured I would continue with it until I was
19

forced to switch to something more reliable. It has, luckily, lasted the entire training process
without need of replacement.

4.2.5 Data storage

At the start of my project, I was against the use of a database due badly designed initial system
architecture. I was convinced that data I had retrieved should be stored as a csv dataset, so it
could be more easily used to train a classifier. This would also mean that it could be included in
with the final packaged product, as opposed to having to set up hosting if I were to proceed with
a database. However, as discussed in the previous chapter, plans for the design of the product
quickly changed and hosting was no longer part of the specification. With the recommendation
of my supervisor, I set up a noSQL database using MongoDB, which conveniently had a Python
API, PyMongo.

4.2.6 Sentiment Analysis

Sentiment analysis was the second major component in my project, and it proved to challenging
process with rather anticlimactic solution. This component was interestingly one of the first
and last concepts explored during my project, with two very different goals each time. At the
beginning, the task was to analyze the sentiment of the contents of a particular tweet, then
assign it with one or several numerical scores. Near the end of the project, after a large part of
evaluation had been done, feature adding was explored and we came back to sentiment analysis.
This time around, I focused a lot more on other natural language processing methods to extract
more features from our achieved tweets. We will go through tools chronologically, thereby
starting with finding a method of detecting the polarity of a tweet.

4.2.7 NLTK Text Classification

With much ambition, I had hoped to implement my own text classifier that would be tailored to
keywords which related to stocks, finance, and economics. I began with using a Twitter dataset
from Kaggle [14] to train a classifier which would take a string of text and output a polarity score
indicating whether the contents of a tweet held a positive or negative sentiment.
Only sentiment and tweet content were kept from the Kaggle dataset. The usernames, hashtags,
links, and stop words were then filtered out from the tweet. Individual words from each tweet
split to create a word list, which was then mapped onto a frequency distribution dictionary with
words as keys. Keys were then extracted to get a list of unique words.
To get the features, a method was then implemented that took a document as an argument, and
checked each word in the document against the list of unique words, returning a dictionary with
the list of unique words and a boolean indicating whether each word was in the document. This
was done to each element in the dataset to create a lazymap training set containing the features
for each element along with the sentiment. This would then be used to train an NLTK NaÃŕve
Bayes classifier to create a model which would predict whether a string of text had positive or
negative sentiment. After testing, an accuracy score of 0.8854 was found.
However, I soon found out there were several, problems with using this method practically. First
of all, the training set used was of questionable quality, with tweet obtained over two days in
2015, with a very strong emphasis on the presidential election. It also only gave a prediction of
positive, negative, or neutral, as opposed to a continuous polarity score I was after. On top of that,
filtering or emphasizing on words specifically related to stocks would be difficult to implement
and time constraints had to be considered. Although there were solutions to these issues, the use
20

of this method of sentiment analysis was disregarded as there were to be several existing tools
that would outdo the one I tried to produce myself.

4.2.8 Azure, Google Cloud Platform, TextBlob, VaderSentiment and


SpaCy

There were a range of libraries which offered sentiment analysis for Python. At first, Azure
Text Analytics API and Google Cloud Platform’s Natural Language API were considered due to
their prominence. However, as sentiment analysis was a required part of the packaged product, I
wanted to minimize the use of tools which required authentication keys. Therefore, TextBlob
and VaderNLTK were chosen instead. TextBlob was a module for natural language processing
which offered scores for sentiment polarity and subjectivity. VADER sentiment analysis was a
lexicon and rule based tool build specifically for the task at hand. It took negations, punctuation,
capitalization, slang, emoticons and acronyms into consideration, generating scores for positivity,
negativity and neutral, along with a compound score which aggregated the three measurements.
Finally, ScaCy would be used after the evaluation stages to further add features for classifier
training.

4.2.9 Classification

The third major component of our project was building a predictor. The choice for this task was
fairly evident from the beginning; a range of tools from Sklearn would meet our requirements
for both training and evaluation.

4.2.10 User Application

The final task for the project was to package the code into a usable format with few dependencies.
Tkinter was used to create a user interface for the standalone application, and PyInstaller to
package the code into a executable for multiple operating systems. For the web application,
the PLAY and Django frameworks were explored, but in the end Flask was used due to the
lightweight nature of the task. Development of a mobile app using Android Studio was initiated,
but forgone due to time constraints.

4.2.11 Summary

A range of tools were explored, shortlisted, and selection during the planning stages of each
component. Now that we have got everything we need, we are ready to discuss how the product
was to be built.
21

5 Implementation

The following two chapters will go through the implementation of the product in its entirety.
In this project, I had gone through two coding stages and two evaluation stages. Approximately
80% in weight of tangible content and code was done in the first stage, which fully created a
product which met requirements, and evaluation was carried out. At that stage, I discovered that
there were a number of additional features that I could implement that could potentially further
elevate the quality of the project. Therefore, in this section, we will discuss the implementation
of the project, followed with initial analysis and improvements, followed with final evaluations.
As with the previous chapter, we will split this section into the four main components, going
through their respective sub-components or related modules in each section.

5.1 Data ingestion


This section will cover everything involving the intake of data. Every day the social media data
of 10 blue chip stock companies would be collected, with the previous day’s change in stock
prices paired to each one. Between 3,000 and 5,000 tweets were collected per day, however,
most of this would be eliminated during the cleaning process.

5.1.1 Tweepy implementation

In the previous chapter we discussed the usage of Tweepy as the main method of data retrieval.
Due to it using the official Twitter API, authentication was necessary but simple to achieve.
Tweepy offered both search and streaming capabilities. As the scale of the project grew, search
became less and less efficient and data was since solely retrieved via streaming.
In order to stream, a listener class was implemented which inherited properties and methods for
Tweepy’s StreamListener class. A stream was then created using the Twitter API authentication
keys and the listener class. Streaming itself was then initiated by calling the stream’s filter method.

Figure 5.1: Code snippet of Tweepy in use

The result was a JSON object every time a tweet was retrieved. This was converted into a
dictionary format using Python’s JSON module, which simplified the unpacking process.
Almost none of the tweets would have consistent keys once converted to dictionaries. Some
may have no location, or no posting date, or author name, and so on. Many had vital data
missing which could not be replaced with substitute variables. Try and except clauses were used
to suppress exceptions during streaming and a check was put in place to ensure that all necessary
22

data was present for each element before posting to the database. This would ensure that features
would be consistent for each object.
The dates for the tweets had to be carefully monitored. Although it may seem that using the
datetime module or using the tweet posting date would be the obvious choice to keep track of
object dates, a very severe issue would later present itself. Tweets collected during the period
before the stock market opened would have to be dated the day prior, and tweets collected on
Monday mornings would have to be dated the previous Friday. This mean that date input was
manual for every day the stream was ran. The reason why will be explained further down in this
chapter because it also involves the classification process.

5.1.2 Diagram of how related keywords operated

The first order of business was to set up the querying mechanics for streaming. For each of
the 10 companies I focused on, there would be a company name, stock code, and two related
words for that company. Each of these four words for all 10 companies would be used as filters
for the twitter data stream. As elements obtained from the data stream did not indicate which
filter it came from, it was difficult to assign elements to companies. As an alternative to running
40 streams synchronously with multiprocesses, a function was used to identify which keyword
was detected, which company that keyword belonged to, and the data would subsequently be
stored under that company in the database. The stock code and related words would be stored as
dictionaries values with the overseeing company name as the key.

Figure 5.2: Diagram illustrating the related word process

5.1.3 Market data retrieval

Using the iexfinance module, the change in stock price would be obtained. However, two
methods were required for this. The first method simply offered the most recent price change
of a stock, which was a built in function by default. The second would return the change of
the stock price on any given day. As mentioned above, tweets had to be dated the working
day before they were collected, and this would be the same for stock prices. The historical data
retrieval function did not offer a value for the percentage change of the stock on that day, only
the open and closing prices, which would be sufficient after running a brief formula.

Figure 5.3: Features grouped by similar traits


23

5.1.4 Code snippet showing how historical data was collected

Approximately 90% of collected tweets were filtered out in the first stage. Tweet contents were
ensured not to be spam, only containing links, or were just gibberish but ensuring that each
tweet’s content contained at least one of the key words of that company, which was surprisingly
not done by default.

5.1.5 Preliminary feature deconstruction

Figure 5.4: Initial 15 features

Using a mixture of available data and analysis of tweet content, each retrieved object was
deconstructed into an initial 15 features show in 5.4 before even being stored in the database. A
number of these features had already been included in the JSON object by default, while others
required analyzing mainly the tweet content.

Figure 5.5: Features grouped by similar traits

Additionally, the features were split into four more or less even groups, as shown in 5.5. Each of
these groups represented an aspect of the tweet. Six more features were added at a later stage
during evaluation, please refer to 6.1.6.

5.1.6 Data storage

MongoDB was used as the sole method of data storage with PyMongo module used. Each viable
object is added to the database with a single post. The 15 features above were all empirical and
would be posted to the database along with the company name, stock code, stock price change,
date, and tweet content. This meant that further addition of features could be done at any stage.
Backups of the database were made frequently.

5.2 Sentiment analysis


Implementation of the failed NLTK text classifier was described in the previous chapter. Using
TextBlob and VaderSentiment, features were produced for sentiment both before and after going
through the database. At this stage on the project, only polarity scores had been used. Please
refer to 6.1.6 for elaboration.
24

5.2.1 Natural language processing

SpaCy would be used to break a tweet down. It was able to determine people, organizations, dates,
nouns, verbs, locations and so on. I would use these as addition features after initial evaluation
had been carried out to examine how language processing would improve performance.

5.2.2 Collected data

Here, we will disclose statistics about data we retrieved.

Figure 5.6: Table containing number of tweets for each company

This gave us the aggregate of approximately 40 good quality (non spam/gibberish) tweets per
sample, with 400 samples over 40 consecutive market days.

5.3 Prediction
Now that we had the data we needed, we needed to build a model which would give us a
prediction as to whether the stock would go up or down the next day.

5.3.1 Data retrieval and structuring

Extensive trialing and prototyping were conducted on the database, resulting in certain days
which couldn’t be used. Posting structure gradually changed as more features and values were
added. For this reason, posts before a certain date were not to be used. List of the 10 companies
and viable dates were created manually. Simple loops were then implemented that would parse
every single viable post.

Figure 5.7: Code snippet show the how posts were iterated

Features in each post were appended to initial arrays for each feature. These initial arrays were
then aggregated to get an array of features which attributed to a single company on a single day.
This array was then further appended to the main training data set, which would be a 2D array
with an array of size 14 for each element. A similar process was carried out to get the validation
data set with price changes.
25

5.3.2 Granger causality

The data we had correlated samples to the prices the previous day, since that was the only
information known at that time. However, we want to see if this data would correlate to data
during the market hours later that day. Granger causality is a statistical concept of causality that
is based on prediction [16]. If a certain signal X causes signal Y, then past values of X should
contain information that helps predict Y. This is the principal we will use.

Figure 5.8: A graph illustrating the concept of Granger Causality

To implement this principal, we simply shifted the training arrays once day forward and the
validation data one day back by deleted the last and first days’ worth of values of the respective
arrays.

Figure 5.9: How the concept was implemented in code

5.3.3 Feature scaling

The disparity in scale between individual features was several orders of magnitude. For this
reason, it was a necessity to use a scaler to standardize features to unit variance, using the formula:
z = (x-u)/s Where z is the scaled feature, x is the sample, u is the mean of training samples, and s
is the standard deviation of training samples.

5.3.4 Principal component analysis

Due to the large number of features, wildly different scale, and its noisy nature, principal
component analysis was used to reduce the complexity of the training data. Sklearn’s PCA
module was used for this.

5.3.5 Classifier training

From Sklearn’s library, 8 classifiers, including a dummy, were implemented, trained, tuned, and
evaluated. This will be heavily explored in the next chapter as we discuss evaluation techniques.
26

5.3.6 Prediction data retrieval

Data for companies I wanted to make a prediction for were streamed the same way as the training
data. There were two ways I approached this. Firstly, the data could be streamed directly,
formatted into arrays and predicted using the classifier models we trained previously. Secondly,
if we wanted to save this prediction to the database, we posted the data the same way as we did
the training data, but with an extra Boolean value as part of the object to differentiate between
training and prediction data.

5.4 User application


To finalize our produce, we needed to package it up into an accessible form, both as a standalone
application and a web application.

5.4.1 Model saving

The database was not going to be hosted online for several reasons, including application depen-
dence, cost, and privacy issues. On top of that, we did not want the application to train a model
every time a prediction for a stock was to be made. The solution was simple to use Python’s
Pickle module to save the classifier, scaler and PCA models so that only those three external files
would be required to be packaged with the application.

Figure 5.10: Code describing how the model was saved

5.4.2 User interface

Figure 5.11: Main menu for the standalone application


27

The user interface would be built using Tkinter. It would allow the user to input the company
name and stock code, which would be required, along with two optional related words to that
company. The number of tweets to be retrieved would be up to the user. 5.11 Shows how the
window appears on the Windows OS.
The streaming process would be difficult to show on the user interface. String variables used
by Tkinter would have to be updated every time new data was retrieved and the root would
have had to refresh. To combat this issue, I would close the window and allow the terminal to
run in the background while tweets were collected, and another pop up would appear once the
prediction was ready.

Figure 5.12: How results are presented after a prediction is made

We would let the user know how many tweets were collected for each key word they input.
Company name and stock code would be considered the same category. We would also show
them an aggregated sentiment score. Finally, they would be given a prediction as to whether
their stock would go up or down at the closing time the next market day.

5.4.3 Standalone application

From the requirements, we had to make the product as easy to use as possible. This meant that
we would not be making the user install hundreds of Python modules in order to run the code.
To package the entirety of the project into an executable format for an operating system, without
dependencies, PyInstaller was used. This was a much more challenging process than expected, as
the project involved several heavy modules which all had to work together. This meant having to
go through modules I already had installed, then upgrade or downgrade them so that everything
would be compatible. As a fix for several recurring issues, changing variables in several methods
in the Python modules themselves was required. On top of that, many resources had to be
manually or specifically included.

Figure 5.13: Specific instructions for packaging this project, based on its dependencies
28

5.4.4 Web application

Finally, a web application was built to allow for even easier access. The web app could not depend
on Tkinter for the user interface, so a HTML had to be used instead, with GET methods.

(a) User interface of the web app (b) Results page of the web app

Figure 5.14: Web application snapshots

Using the Flask framework, the application was successfully adapted into a hosted web application.
However, the same problem was faced as when using Tkinter; it was extremely difficult to show
streaming. On top of this, due to slightly different limitations, tweet limits could not be used,
and instead, I implemented a time limit. The final screen would still show you the same details as
with the standalone application.

5.5 Summary
At this stage, the product had met most of the essential requirements. The goal now was to attain
reasonable results using the product we had built, and that would involve improving and adding
features as well as configuration of the prediction process.
29

6 Evaluation

This chapter will be split into two sections. In the first section we will discuss initial analysis of
the product’s performance, how and what was implemented to further improve its accuracy and
refine the product. In the second section, we will carry out our final evaluations to answer the
questions set out at the start of the project.

6.1 Performance analysis and refinement


In this section, we will discuss evaluation procedures carried out to improve and fine tune
components, while adding features to try and improve the performance of our product.

6.1.1 Classifier metrics

The first order of business was to have a method of measuring the performance of our classifiers.
Using the Sklearn metrics library, I was able to measure the F1, accuracy, precision and recall for
each model. The accuracy would measure the ratio of correctly prediction observations to total
observations, precision measured the ratio of ratio of correctly prediction positive observations to
total prediction positive observations, recall measured the ratio of predicted positive observations
to all observations in a class, and the F1 score weighted the average of precision and recall [18].

6.1.2 Classifier comparison

(a) General accuracy comparison between classifiers (b) Detailed comparison between shortlisted classifiers

Figure 6.1: Comparing performance of various classifiers trained using the same data

Immediately, using the data that I had, I trained eight classifiers: decision tree (DTC), support
vector machine (SVM), Gaussian NaÃŕve Bayes (GNB), k-nearest neighbors (KNB), multi-layer
perception (MLP), random forest (RFC), extra trees (ETC), and finally a random dummy classifier
(DUM). They were evaluation with different splits of the testing and training data to see which
30

of the classifiers would consistently perform well, adhering to the principal of cross validation.
An example of the results is shown in 6.1a.
Over a number of evaluations, DTC, SVM and GNB would consistently perform well. These
three classifiers, along with the random, were shortlisted and compared further, with the results
shown in 6.1b.
The decision tree classifier was outperforming the other classifiers the majority of the tests. This,
on top of the flexible nature of decision tree classifiers and comprehensive analysis, I would be
using this classifier for the rest of the evaluation. However, the choice of classifier would be
reexamined later on.

6.1.3 Classifier configurations

Figure 6.2: A graph showing the performance over a range of classifier configurations

The configurations of classifier would be adjusted to try and find optimal settings for performance.
A loop was used to cycle through parameters such as max depth, minimum samples required
for splitting, weighted fraction, and so on. For many of these options, the default gave the best
performance. For those that showed an improvement we be used to replace the default. An
example of the results from analysis is show in 6.2.

6.1.4 PCA component analysis

Figure 6.3: A graph showing the performance over a range of PCA configurations
31

As we had 15 features at this point, I wanted to find out whether reducing them to components
would improve the performance, with the results being shown in 6.3
Reducing the data down to two components seemed to improve the performance drastically.
This was most likely due to the independent nature of the features, with many possibly being
arbitrary.

6.1.5 Data reduction and filtering

The initial spam filtering stage had already removed more than 90% of our data. However, there
were still samples which would be of more use than others, such as tweets with more words. For
this reason, a minimum word and character limit were implemented to see if it would make a
difference. First, I examined the distribution of tweets among my data set, as shown in 6.4

(a) Distribution of minimum word count (b) Distribution of minimum character count

Figure 6.4: Distribution of various tweet characteristics

With a close to linear distribution of samples, I created a loop which would give us the performance
of the classifier as each of these limitations were put in place, plotting with Python’s matplotlib
module.

(a) Affect of filtering minimum words (b) Affect of filtering minimum characters

Figure 6.5: Performance with various filters

Surprisingly, filtering the words and length of the tweets did not improve the performance of the
classifier. This was possibly due to the face that word and character counts were already included
as features.
32

6.1.6 Feature adding

At this time, I had a few ideas for extracting addition features out of the data that we already had.
Using the content of the tweet, addition methods would be implemented to additional values for
sentiment, subjectivity, word types.

Figure 6.6: A graph showing performance with new feature groups added

Using several natural language processing libraries including TextBlob, VaderSentiment, SpaCy,
three more groups of features were generated, with 6 individual features. Using VaderSentiment,
values for positivity, negativity and a compound score were used as features under the group
’Sentiment’. TextBlob’s subjectivity analysis was used and finally, the number of nouns and verbs
in each tweet was counted using SpaCy. The results are shown in 6.6.
Interestingly, adding these additional features only reduced the performance of the classifier.
Perhaps it was due to the fact that variations of these features already existed as part of the control
model, and adding these were simply more noise.

6.1.7 Summary

I was successful to an extent in improving the performance of the model. Many methods which I
thought would improve the performance of prediction did not, and explanations were given to
the best of my ability.

6.2 Final evaluations


With the no further improvements being made, we were reading to evaluation our results to try
and learn what would cause stocks to go up or down.

6.2.1 Unit testing

The software practice of unit testing was used throughout the project. Every lower level compo-
nent were testing separately and individually before being merged into a larger component, as
33

seen in section 4.1.4. This was convenient due to every component having a very specific output
expectation given a situation. The larger components would when be tested on their own to
ensure that results were consistent, such as getting the same metrics from the classifier. Finally,
the product was tested as a whole, which print checks at every stage. This would be repeated any
time a large change was implemented.

Figure 6.7: Principal of unit testing

6.2.2 Ablation study

The features that we used were split into four groups. The first group contained features describing
the author of the tweet. The second group was about the reception of the post itself. The third
group would include information about the company, such as sentiment, the number of times it
was mentioned. Finally, the fourth group would contain twitter specific information, such as
number of hashtags. I then compared the accuracy of with the accuracy of the control. The
results are shown in 6.8.

Figure 6.8: A graph showing performance with various groups removed

As we can see, group 1 in particular caused the accuracy of the classifier to deteriorate drastically.
Group 2 on the other hand, made no difference at all. This made sense as I later found out group
34

2 contained no valuable information whatsoever. As tweets were collected as they were posted,
they would not have time to accumulate likes and retweets. With this information, I further
expanded the study to find out which features in particular made the biggest difference, shown
in 6.9

Figure 6.9: Graph showing the performance with various features removed

It seemed that the biggest indicators were the number of followers of the author, the length of the
tweet, the mentions of the company, the number of hashtags, and the sentiment of the content.

6.2.3 Training with fewer days

The fixed number of days was set aside. Using the remaining training data, the model was
training the same way to see if results would be affected. It was expected that the more training
data I took away, the worse the performance would get.

Figure 6.10: A graph showing performance while training with less day
35

Although the results from taking away 10, 15 and 20 days of training data worsened the per-
formance, surprisingly taking away only 5 days improve it slightly. This could have been for
multiple reasons, but that most likely explaination is that the 5 days taken away were giving
unusual results, possibly due to macroeconomic changes.

6.2.4 Random data split testing

Finally, with our changes made, we conducted one final test of our classifier. Using cross
validation principals, we split out training and testing data randomly 10 times and obtained a
value for the accuracy of each model trained. The results are shown in 6.11.

Figure 6.11: A graph showing the results from cross validation

The model was consistently achieving over 50% accuracy, with the average score being 0.5731.

6.2.5 Prediction consistency

Using the final product, we attempt to find out how consistent the predictions are for the same
stocks, during the same period. Our goal is that, if the application is run 10 times, they would
all give the same prediction as well as offer similar values of sentiment and tweet distribution.
We would test this with Apple, with the stock code AAPL and related words ’iPhone’ and ’Mac’,
retrieving 500 tweets each time.

Figure 6.12: A table shown the results from running the application 10 separate times

These results were extremely surprising. It indicated that retrieved tweeting data was not as
random as I thought. It was also very suspicious. The consistency of the sentiment did not
seem possible, and for that reason I suspect that it could be manipulation from marketing firms.
However, it is difficult to investigate and confirm this.
36

6.2.6 Comparison with empirical standards

A regression model was built using historical data of stocks for the purpose of comparing the
results with the sentiment model. It used much of the same libraries to retrieve stock data, and
used Sklearn’s linear regression algorithm. Although the model was able to give numerical results,
it was not adjusted to compare day to day to the sentiment model due to time constraints.

6.2.7 Summary

Firstly, features related to sentiment, such as content polarity, number of mentions of search
query, as well as tweet length and author friends had the biggest negative impact when removed
from the training. We can conclude that these four features may be the great indicators of stock
price movements the following day.
Unsurprisingly, training with more data generally improved the performance of the performance
of the classifier, with the exception of unusually periods such nationals level news affecting the
economy.
Finally, the data retrieval process was extremely precise. Collected a certain number of tweets
using the same parameters consecutively would give very consistent results, especially for senti-
ment values. To confirm that this was not an error in the code, the same search was performed
on another company with different keywords, getting very different results. Furthermore, the
experiment was performed on Apple again, using the same keywords, but on a different day at a
different time. Again, the results were completely different. This means that the randomness and
unpredictability of social media data was in fact not an issue, but a matter of what is done with
that data.
37

7 Conclusion

At last, we will discuss whether this project was successful, why or why not, and what more
could have been done.

7.1 Requirements review


We will begin by reviewing the initial requirements set out at the start of the project.

7.1.1 Functional requirements

The ’must have’ requirements were all met, as every one of them was essential for the product to
even exist. All the ’should have’ requirements were also met, with the exception of S5 and S9. S5
which was explained in section 6.2.6 to have been forgone due to time constraints, while S9 will
be satisfied later this chapter.
The ’could have’ requirements were slightly too ambitious. C1 and C2 were not feasible due to
barely passable accuracy of the existing model. C3 was attempted as described in section 4.2.7
but was met with failure. C4, C5, C6, C7, and C8 were not able to be completed simply due to
time restraints.

7.1.2 Non-functional requirements

The product was informally trialed by colleagues and students from both technological and
non-technological backgrounds to gauge an understanding of the product’s general usability. Of
the non-functional requirements, NF4 and NF7 were not met, while NF2 and NF8 are only just
acceptable. N4, an accuracy concern, will be elaborated in this chapter. As for NF7, users felt
that just a prediction and a value for sentiment was not enough. They wanted to know why the
prediction was made, which could not simply be explained by saying ’trust the machine’, which I
myself do not believe. Although a web app was implemented for NF2, it was not hosted, meaning
I would have to send the standalone application in order for someone to use it. The standalone
was over 1GB, which inconvenienced some users. Finally, although the code is expandable,
whether or not NF8 was met is debatable due to uncleanness of the commenting and structuring
of the source files.

7.1.3 Achievements and failures

Out of the four major components, I believe that the data ingestion and sentiment/feature
breakdown were executed rather well, due to the preciseness of averaged features and consistency
of predictions. The user applications satisfied its requirements. Generating an accurate prediction
using retrieved data, however, was questionable at best, as evident from the mere 57.31% accuracy
and large deviation of performance metrics when configurations were adjusted.
38

7.2 Product viability


Given the performance of the product, we will discuss whether it is a reasonable application to
expect professionals to use.

7.2.1 Overall accuracy

The overall accuracy obtained was 57.31%. This may seem awfully low compared to other
machine learning problems, however, once must take into consideration the independent relation
between tweets and stock prices.

7.2.2 Expected return

The average price increase was 0.54% and the average price decrease was 0.36%. Assuming we
only buy stocks that the predictor tells us will increase each day, we use for formula:

return = accuracy(1 + increase) + (1 − accuracy)(1 − decrease), (7.1)

This gives us an expected return of 0.16% per day. If many trades are made, this value of profit
becomes consistent, although not accounting for trading fees. Over a 365 day year, the expected
return of one’s portfolio is 50.12%.

7.2.3 Verdict

Although a return of 50.12% of a year may seem good, we must take into consideration of risk
and reliability. The data used to train the models were collected over a 3 month period with very
questionable political circumstances. As the world’s situation changes, so might this model.
The 57.31% accuracy was based on training data. When applied to real life scenarios, the results
may vary wildly.
On top of that, many profession prediction models achieve accuracies over 70%. My application
is most definitely unable to compete with them.
The main disadvantage that this product has is that it only examines sentiment, which is highly
independent to stocks movements, while other models mostly use empirical data from stocks
themselves which would be much easier to correlate and make accurate predictions.
Overall, this product would not be able to be viable on its own. With this being said, it is not
useless either, as there are still improvements that can be made to bring viability to the product.

7.3 Limitations
In this section, we will briefly discuss why this project was particularly difficult.
39

Figure 7.1: Distribution of positive and negative samples

7.3.1 The classifier problem

The choice of classifier was questionable. Although decision tree gave the best performance out
of all the default classifiers, the task could have possibly been more suitable for Random Forrest
or NaÃŕve Bayes. A graph of the training data with its components reduced to 2, are shown in
7.1, with the components being the x and y value and the color indicating the class the sample
belonged to. Decision tree was not the best option for this, but then, what was?
Using Sklearn’s classifier comparison code [19], we get the results shown in 7.2.

Figure 7.2: Comparison of Sklearn classifiers on our data

We find that no classifier really does well given the data set. This is a problem which must be
tackled at the lowest level.
40

7.4 Future development


So, where do we go from here? We will briefly explore some possible enhancements for future
development.

7.4.1 Mood analysis

In several of the previous papers we explored, mood analysis was heavily involved, and was
believed to show a very strong indication of stock price movements. Given more time, mood
analysis could have been implemented to get features from each of a number of moods.

7.4.2 Terrier data ranking

Another thing I would have liked to do was implement a method of ranking tweets according to
how significant they were, but was not implemented due to time limitations. Glasgow’s own
Terrier platform could be used for this.

7.4.3 Combination with empirical prediction methods

With current standards of stock prediction heavily focused on using empirical data, it is essential
that this product be used in combination with those products to achieve maximum accuracy.

7.5 Summary
This was a difficult task for start to finish, particularly building a good model given the data that
was retrieved. However, the project achieved better than random predictions on the movement
of stock prices at closing time the next day. The data retrieval process was done well, achieving
extremely consistent information about any stock at any time, and giving the same prediction
reliably, indicating a high precision in converting public sentiment into data. That being said,
the accuracy of the model is the main limiting factor and must be improved, with a lot of work
going into designing and perfecting the classifier. Overall, this product may be feasible in the
real world, but as a supplement meant to be combined with other methods of prediction.
41

A Appendices

A.1 Understanding the source code submission


The source code comes with a README.txt file which describes each of the components inside
as well as a requirements.txt file containing a list of dependencies. The only thing that was not
included is the standalone user application, due to the file size exceeding 1GB.
42

7 Bibliography

[1] Expertfy, 2017, https://www.experfy.com/blog/the-future-of-algorithmic-trading


[2] Lemke and Lins, Soft Dollars and Other Trading Activities, ÂğÂğ2:25 - 2:29 (Thomson
West, 2013-2014 ed.).
[3] Jean FOLGER, Investopedia, 2019, https://www.investopedia.com/articles/trading/11/automated-
trading-systems.asp
[4] Matthew PHILIPS, BLoomberg, 2012, https://www.bloomberg.com/news/articles/2012-08-
02/knight-shows-how-to-lose-440-million-in-30-minutes
[5] D. M. LEVINE, Fortune, 2013, http://fortune.com/2013/05/29/a-day-in-the-quiet-life-of-a-
nyse-floor-trader/
[6] H.Cootner, P. (1964), The random character of stock market prices, MIT
[7] Will KENTON, Investopedia, 2018, https://www.investopedia.com/terms/r/randomwalktheory.asp
[8] Justin KUEPPER, Investopedia, 2019, https://www.investopedia.com/terms/e/efficientmarkethypothesis.asp
[9] Gallagher, L. A & Taylor, M. P. (2002) Southern Economic Journal 69, 345-362.
[10] Gruhl, D, Guha, R, Kumar, R, Novak, J, & Tomkins, A. (2005) The predictive power of
online chatter. (ACM, New York, NY, USA), pp. 78-87.
[11] Mishne, G & Glance, N. (2006) Predicting Movie Sales from Blogger Sentiment. AAAI
2006 Spring Symposium on Computational Approaches to Analysing Weblogs
[12] Liu, Y, Huang, X, An, A, & Yu, X. (2007) ARSA: a sentiment-aware model for predicting
sales performance using blogs. (ACM, New York, NY, USA), pp. 607-614
[13] Choi, H & Varian, H. (2009) Predicting the present with google trends., (Google), Technical
report
[14] Peter NAGY, Kaggle, 2017, https://www.kaggle.com/ngyptr/python-nltk-sentiment-
analysis
[15] Geeksforgeeks, https://www.geeksforgeeks.org/twitter-sentiment-analysis-using-python/
[16] Scholarpedia, http://www.scholarpedia.org/article/Granger_causality
[17] Wikipedia, https://en.wikipedia.org/wiki/Granger_causality
[18] Renuka JOSHI, Exilio, 2018, https://blog.exsilio.com/all/accuracy-precision-recall-f1-score-
interpretation-of-performance-measures/
[19] Scikit-learn, https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
[20] BOLLEN, https://arxiv.org/pdf/1010.3003.pdf
[21] MITTAL, https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0138441&type=printable
[22] RANCO, http://cs229.stanford.edu/proj2011/GoelMittal-
StockMarketPredictionUsingTwitterSentimentAnalysis.pdf
[23] Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment
Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media
(ICWSM-14). Ann Arbor, MI, June 2014.
[24] Sayak PAUL, Datacamp, 2018, https://www.datacamp.com/community/tutorials/simplifying-
sentiment-analysis-python
[25] Aaron KUB, towardsdatascience, 2018, https://towardsdatascience.com/sentiment-analysis-
with-python-part-1-5ce197074184
43

[26] PingShiuanChua, 2018, https://www.pingshiuanchua.com/blog/post/simple-sentiment-


analysis-python?utm_campaign=News&utm_medium=Community&utm_source=DataCamp.com
[27] Softwaretestingfundamentals, http://softwaretestingfundamentals.com/unit-testing/
[28] Khan Saad Bin Hasan, towardsdatascience, 2019, https://towardsdatascience.com/stock-
prediction-using-twitter-e432b35e14bd
[29] iexfinance, https://pypi.org/project/iexfinance/
[30] spaCy, https://spacy.io/
[31] TweePy, http://www.tweepy.org/
[32] Lucas Kohorst, towardsdatascience, 2018, https://towardsdatascience.com/predicting-stock-
prices-with-python-ec1d0c9bece1
[33] TextBlob, https://textblob.readthedocs.io/en/dev/

Das könnte Ihnen auch gefallen