Sie sind auf Seite 1von 16

Click Stream Analysis using

Hadoop

ClickStream Analysis

Click Stream are records of users interaction with


a website or other compute application. Each
row of the Click Stream contains a timestamp and
an indication of what the user did. Every click or
other action is loggedhence the term Click
Stream. This is useful when the website does
different things for different users, such as post
recommendations.

Data
Data is obtained from the site in the form of click
stream records. Each record consists of the
details of clicks by the visitors and each record
contains the following details:
Server IP
Client IP
Time stamp with Date
URL visted
No. of bytes transferred
Custom record(s)

The country of origin for a specific request is


identified using the IP address.

Methodology
In order to perform our algorithm on a larger data set, we
must know big data and Hadoop.
What is Big data?
A dataset that has the three characteristics can be called big
data:Volume - of order tera byte to peta byte
Variety -

Structured - Relational database


Unstructured - text, pdf, word, image
Semistructured - xml, log files

Velocity - How fast data comes (eg more than 5000 tweets
per second)

To handle such big amount of data best


solution is the distributed approach,
where system can process all types of
data in a distributed manner.
Googles solution - Divide the task,
assign to many computers, collect
result, integrate and form final result.
This is called the Map-Reduce algorithm.

HDFS(Hadoop Distributed File System)


This uses a block size of 64MB, or 128MB (recommended).This
may even use block size of 256MB or 1GB. So hadoop works
more eciently for larger file rather than small files.
MapReduce framework
Hadoop has created this framework for job division and
combining results. Consists of two phases:
o Map phase
o Reduce phase

Hadoop Architecture
Hadoop follows a master-slave architecture.

This acts as the master and so is


the most vital component of hadoop. This is
the book keeper of the HDFS - how files are
broken , and which node contains those etc.
NameNode :

This act as slave and there are


many slave nodes in a hadoop cluster unlike
namenode i.e. one/cluster. This can directly
access local file system and perform
read/write operations
DataNode :

Secondary Name Node(SNN): It takes


snapshot of the HDFS metadata at
intervals and communicates with
namenode so as to minimize failure.
Job Tracker : This determines the
execution plan, which files to process,
assign nodes, monitors all tasks.
Task Tracker : Manages the execution
plan. One task tracker per slave node. It
constantly communicate with job tracker.
If no response for 10mins(generally) then
it resubmits the job to another node.

Simulation Result

The objective of this simulation is to


collect Click Stream data of USA
Government websites which is high in
volume and velocity, and store it for
analysis in a cost effective manner for
enhanced insight and decision making

Most Clicked Websites

Website Visited/Country

Website Visited/Month

Conclusion

We need to create our own mapper class and


reduce class for clickstream analysis which can be
applied to business models.

Future Work

OpenTracker Clickstream Analysis Tool

o An interactive tool that lets you see all the visitors on


your site in real-time, those both online and offline.
o Every visitor will be represented by an icon.
o If you click on any visitors icon, you will see a graphic
representation of their clickstream. You will also see
that visitors profile, which consists of their country of
origin, their ISP, technical specs, the frequency of visits
they have made to your site and search terms that
they might have used.
o You will also know if they are a first-time visitor, and
view the details of their visit, i.e. the times they
entered and left.

References

http://hadoop.apache.org/

http://www.usa.gov/About/developerresources/1usagovt.shtml

http://www.cloudera.com/content/cloudera/en/about
/hadoop-and-big-data.html

Thank You

Das könnte Ihnen auch gefallen