Beruflich Dokumente
Kultur Dokumente
com/sgeletta/95577
Simon Geletta
Saturday, July 25, 2015
Milestone Report
Introduction and Objectives
The main goal of this report is to demonstrate the level of competency achieved in working with
unstructured data in order to produce a structured set of records which can then be used for the
purposes of statistical modeling. The first step in any such task is to really know (as much as
possible), what is included in the raw data (or document corpus) and to separate out the useful from
the not-so-useful information. I would like to note that because the running of the codes while
preparing the document for publication on RPub.com was taking unreasonably long period of time, I
am forced to present this report based on a 10% sample of the entire data that was provided. The
idea is to provide this as an evidence of what I will do with the entire data at the end of the capstone
project.
Methods
The first task is to download the raw resources that would be used for the analytics tasks - The main
being the three data sources en_US.blogs.txt, en_US.news.txt, and en_US.tweets.txt. In addition, the
list of bad/profane words were also obtained (later to be used to exclude from the analysis). The raw
data were extracted from the given site: http://d396qusza40orc.cloudfront.net/dsscapstone/dataset
/Coursera-SwiftKey.zip (http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-
SwiftKey.zip) in a compressed format and locally uncompressed. The bad/profane words were
downloaded from https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-
and-Otherwise-Bad-Words/master/en (https://raw.githubusercontent.com/shutterstock/List-of-Dirty-
Naughty-Obscene-and-Otherwise-Bad-Words/master/en). These were also locally stored as
en_bws.txt. The following chunc of code shows how the files acquisition went.
1 of 1 06/04/2017 09:49 PM