Sie sind auf Seite 1von 30

Applied Machine Learning

CT046-3-M

Topic 1 Intro to Data Science &


Machine Learning

Outline
Why Data Science?
What is Data Science?
What are some prominent examples of
Data Science?
How to become a Data Scientist?
Who are hiring Data Scientists Now?

CE52604-5-Object Oriented Methods

Module Introduction

Why Data
Science?

CE52604-5-Object Oriented Methods

Module Introduction

The Dawn of Big Data


Google, Yahoo today
Web Search and Computational advertising
Google: 35,000 searches/sec
Yahoo! scale: 600 million users per month, 4
billion clicks per day, 25 terabytes of data
collected every day
Netflix 2007
Movie recommendations, netflix prize
100 million ratings, 500,000 users, 18,000
movies
Amazon 2003
Product recommendations, reviews
29 million customers, millions of products
CE52604-5-Object Oriented Methods

Module Introduction

How Big is Your Data?


Kilobyte (1000 bytes)
Megabyte (1 000 000 bytes)
Gigabyte (1 000 000 000 bytes)

Terabyte (1 000 000 000 000 bytes)


Petabyte (1 000 000 000 000 000 bytes)
Exabyte (1 000 000 000 000 000 000 bytes)
Zettabyte (1 000 000 000 000 000 000 000
bytes)
Yottabyte (1 000 000 000 000 000 000 000 000
bytes)
7

CE52604-5-Object Oriented Methods

Module Introduction

5 Vs of Big Data
Raw Data: Volume
Change over time: Velocity
Data types: Variety
Data Quality: Veracity
Information for Decision Making:
Value

CE52604-5-Object Oriented Methods

Module Introduction

Cloud Computing
The practice of using a network of remote
servers hosted on the Internet to store, manage,
and process data, rather than a local server or a
personal computer-- Gartner IT Glossary
Cloud Computing is a new term for a longheld dream of computing as a utility
-- Above the Clouds, 2009
CE52604-5-Object Oriented Methods

Module Introduction

Cloud Computing = Cloud +


SaaS
Cloud computing refers to both:
Cloud: The hardware and system software in
the datacenters that provide those services.
Public Cloud (Utility Computing) vs. Private Cloud

SaaS: Describes any cloud service where


consumers are able to access software applications
over the internet. (e.g facebook,twitter..)
Cloud Computing started around 2006
Big Data and Data Science (Big Data
Analytics) started around 2011
CE52604-5-Object Oriented Methods

Module Introduction

10

Current Trends
Applications has bigger data and
need
more advanced analysis

Example: Web, Corporate documents


and Emails
Natural Language Processing

Example: Social Media

Network/Graph Analysis

IT Infrastructure moving to Cloud


Computing
Data Science arise given this
application pull and technology
push

CE52604-5-Object Oriented Methods

Module Introduction

11

What is Data
Science?

CE52604-5-Object Oriented Methods

Module Introduction

12

Data Science A Definition


Data Science is the science which uses
computer science, statistics and machine
learning, visualization and human- computer
interactions
to collect, clean, integrate, analyze,
visualize, interact with data to create data
products.

CE52604-5-Object Oriented Methods

Module Introduction

13

Goal of Data Science

Turn data into data


products.

Data to Data Products


Transaction Databases Fraud Detection
Wireless Sensor Data Smart Home
Text Data, Social Media Data
Product Review and Consumer
Satisfaction
Software Log Data Automatic Trouble
Shooting
CE52604-5-Object Oriented Methods

Module Introduction

What are some prominent


examples of
Data Science?

CE52604-5-Object Oriented Methods

Module Introduction

17

Data Products Google


Web Search
Google Ads
News Recommendation
Engine
Google Maps
Currently one of the best if not the best
IT company to work for. (Google event
on Jan 21/22)
CE52604-5-Object Oriented Methods

Module Introduction

Data Products Netflix


Personalized Movie Ratings
Movie Recommendations
Similar Movies
Movie Categories (e.g., 80s movie
with a strong female lead, Kung
Fu movies)
BlockBuster is out of the business
CE52604-5-Object Oriented Methods

Module Introduction

Data Products
LinkedIn/Facebook
People you may know
Applications you may like
Jobs/Events you might be
interested
Classifier for bad users and bad
content
With high accuracy, Facebook can
guess whether you are single or
married
CE52604-5-Object Oriented Methods

Module Introduction

Data Products Twitter


Text Analysis Spam
Filter/Similarity
Search
User
Sentiment/Satisfaction/Feedback
News Breakout
Trend and Topics
200 million users as of 2011,
generating

CE52604-5-Object Oriented Methods

Module Introduction

Data Products Splunk


Degradation, Failure Detection
Identify Security Breach
Event Monitoring
Troubleshoot Tools
Cross-platform Event Correlation
Splunk is an American multinational corporation headquartered
in San Francisco, California, which produces software for
searching, monitoring, and analyzing machine-generated big
data, via a web-style interface.
CE52604-5-Object Oriented Methods

Module Introduction

How to become a Data


Scientist?

CE52604-5-Object Oriented Methods

Module Introduction

23

The Life of Data


Users

Collect
Clean

Integrat Analys
e
is

Data
Sources

CE52604-5-Object Oriented Methods

Module Introduction

Interfac
e
Visualizati
on

Challenges in Data Science


Preparing Data (Noisy, Incomplete,
Diverse, Streaming )
Analyze Data (Scalable, Accurate,
Real- time, Advanced Methods,
Probabilities and Uncertainties ...)
Represent Analysis Results (i.e. data
product) (Story-telling, Interactive,
explainable)

CE52604-5-Object Oriented Methods

Module Introduction

Skill Set of a Data


Scientist
Data Management
Data collection, storage, cleaning, filtering,
integration

Large-scale Parallel Data Processing


Parallel computing

Statistics and Machine Learning


Data modeling, inference, prediction,
pattern recognition

Interface and Data Visualization


HCI design, visualization, story-telling
CE52604-5-Object Oriented Methods

Module Introduction

Who are hiring Data Scientists


Now?

CE52604-5-Object Oriented Methods

Module Introduction

27

Sexy Job in the next 10 years


The sexy job in the next ten years
will be
The ability to take datato be
able to understand it, to process it,
to extract value from it, to
visualize it, to communicate it
thats going to be a hugely
important skill.
-- Hal Varian, Google Chief
CE52604-5-Object Oriented Methods

Module Introduction

Whos hiring Data Scientist?


IT companies: Google, Twitter,
Lexis/Nexis, Facebook.
Media and Financial sectors Fox,
CNN, NYT, Bloomburg,
Research: Biology, Medicine,
Physics,
Psychology,
Information office in government
and corporations
Law firms: e-discovery tools

CE52604-5-Object Oriented Methods

Module Introduction

Books on Data Science

CE52604-5-Object Oriented Methods

Module Introduction

30

Additional Reading Pointers


Data Science Summit (Strata)
(
http://www.datascientistsummit.co
m/
)
Kaggle Competitions (
http://www.kaggle.com/)
Data Science course at Berkeley &
Corsera (http://datascienc.es/ )

CE52604-5-Object Oriented Methods

Module Introduction

31

Summary
Why now: Dawn of Big Data, Need for
Advanced Analytics and Cloud Computing
What is it: Data Data Product, many
examples incl. Google, Netflix, Splunk,
LinkIn
How to become: Data management, parallel
computing and data processing, statistical
machine learning, and visualization skills
Life of Data

Who are hiring: Data Scientists are in great


demands, from industry to government to
science.
CE52604-5-Object Oriented Methods

Module Introduction

32

Question & Answer Session

Q&A
CE52604-5-Object Oriented Methods

Module Introduction

Das könnte Ihnen auch gefallen