Sie sind auf Seite 1von 77

BIG

DATA
A revolution that will transform how
we live, work and think
What is Big Data ?
• Big [volume] Data is not new!
• Traditionally, “Big Data” = massive volumes of data

• Big Data is a misnomer!


• Big Data is more than just “big”
• The Vs that define Big Data
– Volume
– Variety
– Velocity
– Veracity
– Variability
– Value
– …
What is BIG DATA?
A collection of data sets so large and
complex that it becomes difficult to
process using traditional data processing
applications

Volume – amount of data


Velocity – speed of data in and out (Big Data must
be used as it streams)
Variety – data types, data sources
Veracity – trust in the Big Data
Value
VARIETY
• Very large data sets have existed for decades –
what’s new is the emergence of the collection and
storage of unstructured data primarily from
“unconventional sources”
Big Data – Unconventional sources
• Where does the Big Data come from?
Everywhere! Web logs, RFID, GPS systems, sensor networks, social
networks, Internet-based text documents, Internet search indexes, detail
call records, astronomy, atmospheric science, biology, genomics, nuclear
physics, biochemical experiments, medical records, scientific research,
military surveillance, multimedia archives, …

In an online word, every interaction leaves a trail. Every purchase,


every click, and every message creates a footprint. This data, if
analyzed, has the potential to generate historically unimaginable
insight. (Even Stubbs, 2014)

• The World Economic Forum has classified social media data as an


economic asset
VELOCITY
BIG DATA

“…the digital bread crumbs we all leave behind


us as we move through the world – call records,
credit card transactions, and GPS location fixes
among others. These data tells the story of
everyday life by recording what each of us has
chosen to do….Who we actually are is
determined by where we spend our time and
which things we buy, not just but what we say
or do”
(A. Pentland – Social Physics)
BIG DATA
• More
• Messy
• Correlation
• Datafication
• Value
• Implications
• Risks
• Control
• Next
MORE
• Data volumes have grown geometrically

Ø 2007 – 300 exabytes of stored data – 7 % analog data


Ø 2013 – 1200 exabytes of stored data – 2% analog data
Ø By 2020, the amount of digital information will grow to
40.000 exabytes!
Ø By 2025 …. 160000 exabytes!

The digital data is doubling about every 2 years.


Technology Insights
The Data Size Is Getting Big,
Bigger…
Names for Big Data Sizes
• Hadron Collider - 1 PB/sec
• Boeing jet - 20 TB/hr
• Facebook - 500 TB/day.
• YouTube – 1 TB/4 min.
• The proposed Square
Kilometer Array telescope
(the world’s proposed
biggest telescope) – 1
EB/day
Annual size of the datasphere

1 ZB = 10 21 bytes
Data creation by type
Where data is stored
Data criticality over time
MORE
• From SOME to ALL
The need to sample disappears !

Sampling
• Involves costs
• Loses detail

Sampling will not longer be the predominant


way we analyze large volumes of data.
Example: Genome sequencing
• SOME - just a small part of the genetic code is
investigated
Mass market technique (<1000$)
• ALL – sequence the entire DNA
MORE
• The change of scale has lead to a change of state -
the quantitative change has led to a
qualitative one

• Just like the internet radically changed the world


by adding communication to computers, Big Data
will change fundamental aspects of live by
giving it a quantitative dimension it never
had before
MESSY
• Increasing the volume opens the door to
inexactitude
• Big Data transforms figures into something
more probabilistic than precise
• People are willing to sacrifice a bit of accuracy
in return for knowing the general trend
• We need to embrace messiness when we
increase scale
More and messy vs. fewer and exact

• BI – “a single version of truth”


• Database design – intolerance of imprecision

“We can no longer pretend to live in a clean


world….It’s ok if we have lousy answers – that’s
frequently what business needs”
(Pat Helland – If you have too much data, then ‘Good
Enough’ is ‘Good Enough’)
More and messy vs. fewer and exact
• the quantity of data is so breathtaking enormous that it
can’t be moved and it must be analyzed where it is (noSQL,
Hadoop…)

- The credit company Visa was able to reduce the processing


time for two years’ worth of test records (73 billion
transactions) from one month to 13 minutes

- ZestFinance - analyzes a huge number of weak variables for


which a lot of data is missing to decide over small, short –
term loans. In 2012 it boasted a loan default rate that was
a third less than the industry average
More and messy vs. fewer and exact
• Only a small percent of all data is structured – by
allowing for imprecision, we open a window into
an untapped universe of insights

• The Assumptions :
We can’t use far more data, so we don’t
The quality of information – accuracy, consistency
and exactitude
are no longer true !
More and messy vs. fewer and exact
MINDSET SHIFT
We have to change – to become
comfortable with disorder and
uncertainty

“We don’t give up on exactitude entirely; we


only give up our devotion to it. What we lose in
accuracy at the micro level we gain in insight at
the macro level.”
CORELLATION
• Knowing what, not why, is good enough
• Predictions based on correlations lie at the heart of
Big Data
Insurance companies (AVIVA, PRUDENTIAL, AIG)
TARGET – discount retailer

• PREDICTIVE ANALYTICS may not explain the cause


of a problem – it only indicates that the problem
exist!
CORRELATION
• Hypotheses are no longer crucial for correlational
analysis
• More sophisticated analyses to identify non-linear
relationships among data
- happiness and income relationship
- network analysis: Map, measure and calculate the
nodes and links for everything (from one friend’s on
Facebook to whom calls whom, to which court
decisions cite with precedents)
DOES CAUSALITY REALLY EXIST?
• HUMAN COGNITION explain our intuitive desire to
see causal connections
• Two modes of thinking (D. Kahneman):
1. the FAST way
2. the SLOW and HARD way

“just as sampling was a shortcut we used because


we could not process all the data, the perception of
causality is a shortcut our brain uses to avoid
thinking hard and slow”
HABITS vs. BELIEFS
FAST SLOW
PROCESS

FAST SLOW
PARALLEL SERIAL
AUTOMATIC CONTROLLED
ASSOCIATIVE RULE-BASED
CONTENT

CONCEPTUAL REPRESENTATIONS
PAST,PRESENT,FUTURE

Can be evoked by language


HABITS vs. BELIEFS
• Most of our behavior is habitual rather
than reasoned – based on fast
judgments of intuition and habit (social
influence)

• The majority of our most important


decisions are due to the slow process of
reasoning (free will)
CORRELATION vs. CAUSALITY
• CORRELATION - fast & cheap
- mathematical and statistical methods to analyze
relationships
- digital tools to demonstrate them with confidence
• CAUSALITY
- no obvious mathematical way to “prove” it
- experiments

Causality is not any more the primary “fountain” of


meaning
CORELLATION IS ENOUGH?
• From a hypothesis-driven world to a data driven
world
• Big Data – the end of theory?
“the data deluge makes the scientific method
obsolete”
(Chris Anderson, 2008) – quantum physics against
gene sequencing

Big Data analysis has no need of any


conceptual model?
CORRELATION vs. CAUSALITY
MINDSET SHIFT

We have to accept that we don’t need to


understand the reasons behind all that
happens

“We have to put our trust in correlations


without fully knowing the causal basis for the
predictions”
DATAFICATION

“The move to Big Data is a continuation of


humankind’s ancient quest to measure, record
and analyze the world” – not IT!
Datafication – quantifying the world

Datafication Digitalization
• to put a phenomenon in • To convert analog
quantified form so it can be information in binary code
analyzed
• Amazon
• Google - Datafied books
- From digitized text to datafied text
- Focus on the Content that humans
- Improving it’s machine learning read, not on the analysis of datafied
translation service
text
- http://books.google.com/ngrams
- Culturomics – computational
lexicology that tries to understand
human behavior and cultural trends
through the quantitative analysis of
text
The Datafication of everything
• Words
• Location
- people, objects - ”datafied floor” , “quantified self”…
- Reality mining

• Interactions
- Relationships, experiences, moods
• Social graph (Facebook),
• Tweets as signals for investments in the stock market (Derwent Capital &
MarketPsych)
• An analysis of 509 million tweets over 2 years from 2.4 milion people in 84
countries : people’s mood follow similar daily and weekly patterns across
cultures! (Science, 2011)

• Physical activity, sleep


DATAFICATION - An infrastructure
project ?

“With the help of Big Data, we will no longer


regard our world as a string of happenings
that we explain as natural or social
phenomena, but as a universe comprised
essentially of information”
Closing the cycle…
• DATA – GIVEN (Latin) – fact
• Present:
- A numerate society
- The power of the “written word” - Knowledge can
be transmitted across time and space
• Tomorrow?
A Big Data consciousness

Quantifying reality – transforming myriad dimensions


of reality in data will be treated as a GIVEN
DATAFICATION
MINDSHIFT

Information is the basis of all that is


VALUE
• DATA
- a significant corporate asset
- a vital economic input
- the foundation of new business models

“Personal data is the new oil of the Internet and the


new currency of the digital world”
(M. Kuneva, European consumer commissioner)
Data’s value shifts from its primary
use to its potential future assets
• The “option value” of data

• The reuse of data

• Recombinant data

• Extensible data
VALUE
• Depreciating value of data
- Limit of data usefulness for some proposes
- It can destroy the value of fresher data
- No influence on its option value
• The value of data exhaust
- Google: recursively “learning from data”
- Can be a huge competitive advantage or a powerful barrier
to entry against rivals
• The value of open data
Open government data
IMPLICATIONS
The Big data value chain
• The data
• The skills
Data scientist – statistician, software programmer,
infographics designer, storyteller
• The mindset – to see opportunities before others
do
Data Scientist
“The Sexiest Job of the 21st Century”
Thomas H. Davenport and D. J. Patil
Harvard Business Review, October 2012

• Data Scientist = Big Data guru


– One with skills to investigate Big Data

• Very high salaries, very high expectations

• Where do Data Scientist come from?


– M.S./Ph.D. in MIS, CS, IE,… and/or Analytics
– There is not a specific degree program for DS!
Skills That Define a Data Scientist
Domain Expertise,
Problem Definition and
Decision Modeling

Data Access and


Communication and Management
Interpersonal (both traditional and
new data systems)

DATA
SCIENTIST
Curiosity and Programming,
Creativity Scripting and Hacking

Internet and Social


Media/Social Networking
Technologies
IMPLICATIONS

• The new data intermediaries


Convincing the data holders of the value in sharing

§ In Big Data’s early stages, the ideas and the skills


seem to hold the greatest worth.

§ But eventually most of the value will be in data


itself.
IMPLICATIONS
• The demise of the expert ?
Data- driven decisions – statistical analyses
force people to reconsider their instincts

Changes:
§ What you need to know
§ Whom you need to know
§ What it takes for an employee to be valuable
to a company
REALITY MINING
The most valuable flows of ideas within an
organization are face-to-face and telephone
conversations, because they carry the most complex,
sensitive information
• Sociometric badges – to collect data on individual
communication behavior
Patterns of communication – the most important
predictor of a team’s success.
www.sociometricsolutions.com
• Mobile phone sensing – behavioral activity sensing
www.funf.org
IMPLICATIONS
• Shift in corporate decision making -
EMPHASIZE DATA DRIVEN DECISION MAKING
to achieve competitive advantage

Research study :
Productivity levels were as much as 6 %higher at
firms that excel at data-driven decision making
than at non data guided companies.
IMPLICATIONS
• Big Data squeezes the middle of an industry,
pushing firms to be very large, small and quick
or dead

• Is Big Data Disruptive for states as well…?


Big data for Decision Management
• BIG DATA

• PROCESSING POWER (High performance computing)


- Data is useless without the ability to effectively
analyze it.
èINSIGHT
Insight without action is the same as doing nothing!

• DECISION MANAGEMENT
BUSINESS ANALYTICS

• ANALYTICS: any data-driven process that provides


insight

• Each of these are based on


o Data rather then opinion
o Quantitative (rather then qualitative) techniques
BIG DATA Analytics
The process of examining large
amounts of data of a variety of types
to uncover hidden patterns, unknown
correlations and other useful
information

Primary goal: to help companies make


better business decisions
BUSINESS ANALYTICS
• Analytics create insight.
• Business Analytics is concerned with taking
that insight and using it to create value.

Understanding the value of insight


and convincing an organization to
change the way it does business.
BUSINESS ANALYTICS

ANALYTICS
CHANGE
MANAGEMENT
BUSINESS INFORMATION
INTELLIGENCE MANAGEMENT
VALUE MEASUREMENT

PERSUASION AND
INFLUENCE

ORGANIZATIONAL
ADVANCED
DESIGN ANALYTICS
BUSINESS STRATEGY

…. …
DSS versus DSMS
• In BIG DATA environments, it is important to
analyze, decide and act quickly and often

• Data Stream Management Systems (DSMS) –


to maintain and query streaming data
DSS versus DSMS
The main difference: the data stream

• In a DSMS, new data are generated continually and


the arrival dates may differ dramatically (from
millions of items/second to several items/hour)
• A traditional DSS system queries static data to
generate statistics and reports ; a DSMS constantly
queries streaming data in an effort to give the end
user the most up-to-date real –time information
possible
DSS vs. DSMS
• The way in which data is stored.

In a DSMS – the data is stored in the memory for


as long as processing the data is needed.
DW architecture

DW

Query reports from


“Static” storage system the end user
Business data

DSMS architecture
1100111000
UNSTRUCTURED
DATA
Text, video, pictures,
Media information,
web data 1111001100
Queries are
Sources of business data Data arrives in real time constantly being
streaming from multiple and is stored in generated for the
sources computer memory end user
IMPLICATIONS
MINDSHIFT

Adjustment of traditional ideas of


management, decision making,
human resources and education
Trust (only ?) data
RISKS
• Our perceptions and institutions
were constructed for a world of
information scarcity, not surfeit.

• We may not be ready for its impact


on our privacy and on our sense of
freedom.
RISKS
1.Privacy – can we protect it anymore?

2. Penalties based on propensities – undermine the


principles of fairness, justice and free will
Use Big Data predictions about people to judge and
punish them even before they’ve acted.

3. Dictatorship of data
Fetishize the information and end up missing it
Used unwisely, Big data may be turned in a source of
repression
PRIVACY

• Big Data increases the risk to privacy


• Big Data changes the character of this risk

The tried and trusted concept of notice and


consent is either too restrictive to unearth
data’s latent value or too empty to protect
individual’s privacy
STRATEGIES to protect privacy

• Individual notice and consent

• Opting out

• Anonymization
PRIVACY
• BIG DATA is used not only in the private
sector!!!

• Governments do it too…

The people are the sum of their social


relationships, online interaction and
connections with content.
PROBABILITY AND PUNISHMENT
• Big Data used to prevent crime from happening
FAST (Future Attribute Screening Technology) – identifying
potential terrorists by monitoring individuals’ vital signs, body
language and other psychological patterns (70% accurate)
• In many contexts, data analysis is already employed in the
name of prevention - “profiling” in a small-data world

• In Big Data world


- non causal analysis help to identify the most suitable
predictors from the sea of information
- Specific individuals rather than groups
PROBABILITY AND PUNISHMENT
• The Promise : we do what we’ve been doing all
along, but make it better, less discriminatory, and
more individualized.
• The main danger : the idea of punishment related
to future behavior
Decide releases on parole if predictions indicate a big
probability of committing a new crime (Pennsylvania)

Deny that humans have a capacity for moral choice


Depriving of Responsibility
PROBABILITY AND PUNISHMENT
• Big Data relies on CORRELATION

• THREAT : to ABUSE Big Data for causal proposes

“…Big Data THREATENS to imprison us – perhaps


literally – in probabilities”
Still…
The Dictatorship of Data

• THREAT : to become obsessed with collecting facts


and figures for data’s sake
“In GOD we trust – all others bring data”
(the mantra of the modern manager)
Data deluge
“Garbage inn, garbage out”
CORELATIONS DO NOT IMPLY
CAUSATION!
The Dictatorship of Data
• HR policy at GOOGLE
No correlation between the scores and job
performance!!!

• BRILLIANCE DOESN’T DEPEND ON DATA

• The use of data often serves to empower the


powerful
CONTROL
• FROM PRIVACY TO ACCOUNTABILITY
From individual consent to data-user accountability

• SAFEGUARDS in place to assure


- Openness: making available the data and algorithm
underlying the prediction that affects an individual
- Certification: having the algorithm certified for
certain sensitive uses by an expert third party
- Disprovability: specifying concrete ways that people
can disprove a prediction about themselves
CONTROL

• EXPLAINABILITY in terms of monitoring and


transparency
Big data predictions, the algorithms and
datasets behind them are black boxes - no
accountability, traceability or confidence.

New types of expertise


Algorithmists - reviewers of big data analysis
and predictions
CONTROL
• A NEW DEAL ON DATA
Provide both regulatory standards and financial
incentives that entice owners to share data while at
the same time serving the interests of both
individuals and society

§ You have the right to possess data about you


§ You have the right to full control over the use of your
data
§ You have the right to dispose of or distribute your data
The future
• What is not? - innovation source
• New winners and losers
• Technical skills and a lot of imagination – the big-
data mind set
• Shift from technology to what happens when the
data speaks
• The nature of decision making, destiny, justice
• The possession of knowledge (an understanding of
the past) is coming to mean an ability to predict
the future

Das könnte Ihnen auch gefallen