Sie sind auf Seite 1von 64

Seminar Talk at the Chadwick Building 102, University College London

(UCL), London, Date: July 8 (Fri), 3pm. 2016.

Big Data and Human Dynamics: A HumanCentered


Approach to Analyze Human Activities and
Movements with Social Media and GIS data

Prof. Ming-Hsiang Tsou


Twitter @mingtsou mtsou@mail.sdsu.edu,
Director of the Center for Human Dynamics in the Mobile Age
Professor, Department of Geography , San Diego State University
The Center for Human Dynamics in the
Mobile Age at San Diego State University

Advancing Transdisciplinary Research on Big Data,


Human Dynamics, and the Social Web
http://humandynamics.sdsu.edu/
HDMA Team Members

Dr. John Elder Jiue-An (Jay) Yang


Dr. Lourdes Martinez
Dr. Ming-Hsiang Tsou Jessica Dozier
(Director) Dr. Heather Corliss

Dr. Sheldon Zhang Su Yeon Han


Dr. Atsushi Nara

Dr. Jay Lee Alejandra Coronado


Dr. Brian Spitzberg

Hao Zhang
Dr. Piotr Jankowski Joey Lee
Dr. Eric Buhi Chris Allen Rich Zhang
Dr. Xuan Shi Stephanie
Dr. Jean Mark Gawron
Nowinski
Jared Jashinsky

Dr. Bruce Appleyard


Dr. Joseph Gibbons Elias Issa
Dr. Michael Peddecord Dr. Xinyue Ye
HDMA Center Graduate Students
NSF Projects
HDMA Center has been hosting two large NSF
projects (CDI and IBSS):
1. Mapping Ideas from Cyberspace to Realspace. Funded by NSF
Cyber-Enabled Discovery and Innovation (CDI) program. Award #
1028177. $1.3 million (2010-2015) http://mappingideas.sdsu.edu/

2. Spatiotemporal Modeling of Human Dynamics Across Social


Media and Social Networks. Funded by Interdisciplinary Behavioral
and Social Science Research (IBSS) program. Award#: 1416509.
$1 million (2014-2019). Http://socialmedia.sdsu.edu

Principal Investigator: Dr. Ming-Hsiang Tsou mtsou@mail.sdsu.edu, (Geography), Co-PIs: Dr. Dipak K Gupta
(Political Science), Dr. Jean Marc Gawron (Linguistic), Dr. Brian Spitzberg (Communication), Dr. Li An (Geography).
Dr. Jay Lee (Kent State, Geography), Dr. Ruoming Jin (Kent State, Computer Science), Dr. Xinyue Ye (Kent State,
Geography, Dr. Heather Corliss (Public Health, SDSU), Dr. Xuan Shi (Geoscience, U of Arkansas).
San Diego State University, Kent State University, University of Arkansas, USA.
What is Human Dynamics?

Human Dynamics --
is a transdisciplinary research field
focusing on the understanding of
dynamic patterns, relationships,
narratives, changes, and transitions
Smart Phones - 2007 of human activities, behaviors, and
(The Mobile Age)
The most important communications.
scientific instrument
in the 21st Century.
(2014 Sales: 1.2 billion units)
Animated Image created by the HDMA Center (Hao Zhang).
Big Data is Human-Centered Data

Big Data is a large dynamic dataset created by or derived


from human activities, communications, movements,
and behaviors. (Tsou, 2015).

The term, Big Data, refers to big ideas, big impacts, and big
changes for our society in addition to a big volume of datasets.

Tsou, M. H. (2015). Research challenges and opportunities in mapping social


media and Big Data. Cartography and Geographic Information Science,
42:sup1, 70-74. doi: 10.1080/15230406.2015.1059251.
http://www.tandfonline.com/doi/full/10.1080/15230406.2015.1059251#.VeCVyPlVhBd
The Challenge of Big Data Analytics:

Big Data are very Messy, Noisy, and Unstructured!

Image Source: http://www.contentverse.com/office-pains/10-messy-desks-successful-people/

Require collaboration efforts from linguistics, geographers (GIS experts),


computer scientists, data mining experts, statisticians, physicists, modelers,
and domain experts.

Human Dynamic in the Mobile Age (HDMA)


Geography (place and time) is the KEY for
Understanding and Integrating Big Data

Big
Data
(information)

Time Place

(Tsou and Leitner, 2013)


KDC (Knowledge Discovery in Cyberspace) framework

Tsou, M. H. and Leitner, M. (2013). Editorial: Visualization of Social Media: Seeing a Mirage or a Message? In Special Content Issue: "Mapping
Cyberspace and Social Media". Cartography and Geographic Information Science. 40(2), pp. 55-60. DOI: 10.1080/15230406.2013.776754
Data Integration / Data Fusion
Explore their spatiotemporal relationships in both network
space (cyberspace) and geographical space (real world).

Health or Disaster
Data Layer

Image provided by
Dr. Atsushi Nara
(Associate Director of
HDMA Center).
Big Data Category (Tsou, 2015).

Social life data: social media services (Twitter, Flickr, Snapchat, YouTube,
Foursquare, etc.), online forums, online video games, and web blogs.

Health data: electronic medical records (EMR) from hospitals and health
centers, cancer registry data, disease outbreak tracking and epidemiology data.

Business and commercial data: credit card transactions, online business


reviews (such as Yelp and Amazon reviews), supermarket membership records,
shopping mall transaction records, credit card fraud examination data, enterprise
management data, and marketing analysis data.

Transportation and human traffic data: GPS tracks (from taxi, buses,
Uber, bike sharing programs, and mobile phones), traffic censor data (from
subways, trolleys, buses, bike lanes, highways), and mobile phone data (from
data transmission records and cellular network data).

Scientific research data include earthquakes sensors, weather sensors,


satellite images, crowd sourcing data for biodiversity research (iNaturalist),
volunteered geographic information, and census data.

Geography (place and time) is the KEY for understanding Big Data!
Research Showcase #1:

Geo-Targeted Social Media (Twitter) Analytics


for Tracking Flu Outbreaks in U.S.
Data Filtering, Mining, and Visualization

Geo-Targeting
Data Collection
(Twitter APIs)
Application Programming Interfaces (API)

Filter
Machine
Learning
Trend
Analysis
Spatial
Analysis
Analysis

SMART
Dashboard Visualization
What we can get from Twitter data?
Where to find geospatial information?

Example: Use Twitter Search API to search for keyword HIV test or HIV testing
Only 1% - 7% of Tweets have X, Y GEO-coordinates (from GPS or Geo-tagged devices).
But 50% - 60% Tweets have city-level locations provided by their User Profiles.
80% Tweets have Time Zone (limited spatial meaning)
Geocoding Engine for Social Media
The HDMA Center has built our own Internal
GeoCoder Engine for User Location Profile:
using GeoNames.org gazetteers (Creative
Commons Data).+ User defined rules.

Enable Flexible or Self-defined Geo-Target


Boundaries (California, Santa Barbara, Los
Angeles, San Diego bounding boxes, or State
boundaries)
SMART Dashboard
Social Media Analytic and Research Testbed
YouTube Video for 3 Mins
http://vision.sdsu.edu/hdma/smart/

Real-time social media analytics (Trend Analysis, Word Clouds, Top URL,
web pages, Top Hashtags/Mentions/Stories).
Collect Tweets from Top 31 U.S. Cities (17 miles radius)
31 different cities across the United States (chosen based on their population sizes): Atlanta, Austin,
Baltimore, Boston, Chicago, Cleveland, Columbus, Dallas, Denver, Detroit, El Paso, Fort Worth,
Houston, Indianapolis, Jacksonville, Los Angeles, Memphis, Milwaukee, Nashville-Davidson, New
Orleans, New York, Oklahoma City, Philadelphia, Phoenix, Portland, San Antonio, San Diego, San
Francisco, San Jose, Seattle, and Washington, D.C.

Human Dynamic in the Mobile Age (HDMA)


Filter and Refine Big Data (Remove Noises)

Number of tweets
10,678
5,398
4,947
4,944
3279

Machine
Learning

Total Flu tweets collected:


307,070.
Final valid flu tweets: 88,979.
Real-Time Monitoring of Flu Outbreaks in U.S.
(National Scale combined 31 Cities), 2013 2014 flu season
RED Line: National ILI data (Influenza-like illness) (provided by CDC)
Purple Line: Weekly Tweeting Rate (two weeks earlier than CDC data)
ILI: Influenza-like Illness (R) value = 0.8494
Trend Analysis at the Municipal Scale (San Diego)
with the Lab-tested confirmed flu cases

San Diego: Lab confirmed Flu


Cases vs Tweeting Rate:
(R) value = 0.9331
Two research papers in the
Journal of Medical Internet Research

2013

2014

Human Dynamic in the Mobile Age (HDMA)


Tracking Flu Outbreaks in 2014/2015 Flu Season

# of Filtered ILI Tweets, Top 30 US


Cities, as of February 9, 2015
(from SMART dashboard)

Only 1% -4% tweets has Geo-tagged


coordinates.

CDC Influenza Positive Tests, National Data Summary,


through Weeks 40-3, 2014-2015 Season
Problems!!! Twitter
broke its Search APIs on
11/20/2014 and only
returned Geo-tagged
tweets only. (Reduce
90% -95% of tweets
collected)
2014-2015 Comparison between ILI and
Geo-tagged-only Tweets (4%) among 30 U.S. Cities

R= 0.90559

Human Dynamic in the Mobile Age (HDMA)


2016 Flu Tweets vs CDC ILI data

R= 0.5566

The comparison between National ILI Rate and the 32 Cities Tweeting Rate, with
prediction up to Week 15. Red National ILI, Purple Tweet Rate for 2015-2016.
How to Build a Flu Prediction Model ?
Daily Patterns or Weekly Patterns?
Time Scale: Daily, Weekly, Monthly.

2015 Flu Tweets (Daily Pattern)

2016 Flu Tweets (Daily Pattern)


SMART will be OPEN SOURCE soon!
Available in GitHub Soon.
But need to install both the SMART and the Geocoding Engine (MongoDB and
GeoName.ORG gazetteer databases).
New Journal Paper: http://bds.sagepub.com/content/3/1/2053951716652914

SMART Dashboard Won the BEST METHOD PAPER Award


in the 2015 International Conference of Social Media and Society, Toronto,
Canada. http://dl.acm.org/citation.cfm?doid=2789187.2789196
SMART Dashboard

Client/Server System
Design Framework

Next Step:
Open Source Initiative
(GPL license copy-left)
Research Showcase #2:

Real-time Situation Awareness Viewer for Monitoring


Disaster Impacts Using Location-Based Social Media
Messages (Twitter).

CBS8 News Video (4 mins)

San Diego County: Office of Emergency Services (OES)


Geo-targeted Event Observation (GEO) Viewer 1.0
http://vision.sdsu.edu/hdma/wildfire/
Provide dynamic map display about event ground truth observation
-- linking GPS locations, texts, photos, and time.

(Geo-tagged Tweets)

Monitor Disaster
impacts, Recovery
Activities and
Victims Needs
Spatial Clustering (Wildfire Tweets)

Spatial clusters (hotspots) of tweets are nearby the


actual locations of wildfires (events)
GeoViewer Tool v.2.2
(Video demo) EC2: http://vision.sdsu.edu/ec2/geoviewer/sanDiego (Live)

Real Time (Streaming API), Hot Spots: Kernel Density Estimation


(KDE) method, Auto-Three-Keywords labels with Cluster Maps.
Monitoring Emergency Responses and Rescue Efforts
How to find out critical information from thousands of tweets?
Nepal Earthquake Example: (keyword search: trap)

One Possible Solution: Manual labeling (first 1000 tweets by


volunteers) + Machine Learning Classification (built-in).
Human Dynamic in the Mobile Age (HDMA)
Digital Volunteers may help us identify and select
important Tweets (for machine learning) during and
after the disaster events.
Need Some programming and design
help from OES, RedCross, and 211:

1. How to combine multiple


volunteers Inputs and Integration
Systems (ranking system).
2. Which category and color
schemes/labels should we use for
each types of disasters (flooding,
wildfires, earthquake, hurricanes).
3. Which tags might be useful?
4. Who are the target users? What
kinds of Output system should we
create? (for OES staff? For RedCross
staff?)
5. Other suggestions?

Human Dynamic in the Mobile Age (HDMA)


Donald Trump visited San Diego on May 27, 2016.
(GeoViewer search for Trump from 5/27-5/28)
Geo-tagged Tweets (in GeoViewer)
Spatial Analysis with keyword Drunk in San Diego
Big Data Fusion (Integration)

Comparing Spatial Cluster of DUI Records (Red dots, Left side) and
Tweets with Drunk keyword (Right side).

GIS Map with DUI Records GeoViewer (Search drunk for two months)

Similar Spatial Pattern? (Dynamic monitoring in REAL TIME?)


Errors and Noises in the Geo-tagged Tweets

Detect robot tweets or advertisement tweets (noises) in geo-


tagged tweets by examining the source metadata field. The
portion of data noises is significant (29.42%) in our case study.

The number of Tweets produced by


different platforms inside San Diego
Bounding Box during the month of
November, 2015.
2014 San Diego Wildfire Tweets

Social Network Analysis (SNA)

@ReadySanDiego

@10News @SanDiegoCounty

@KPBSnews

@UTsandiego

Identify the network influence for each individuals (who are the opinion leaders?)
Predicting the Spreads (Speed, Scale, and Range) of Social Media Messages in
Different Social Networks. (following, retweets, and mentions relationships)
Hyperlocal Relationship
From Online Connections to Offline Locations

Master Thesis by Jessica Dozier, 2016


2015 Nepal Earthquake

Master Thesis by Jessica Dozier, 2016


[ReadySD Social]
Mobile App
Development

Recruit 1000 local


volunteers for
Broadcasting Disaster
Information
Use Ionic platform for both Android and iOS
Apps. https://ionic.io/
Users will receive
Push Notifications.

Then they can


decide to RETWEET
the message or not.
Gamification
+
Peer Pressure
Download from the Android
Play Store

Search for readysd


iOS: App Store
It will take a few weeks to apply for being an App in App
Store.

Current side-loading method: TestFlight.


(sending invitation installation email to test users)
The Limitations and Challenges of Social
Media and Big Data Research
Social Media User Profiles

Social Media messages can NOT represent all population,


but it can provide warning signals and real-time updates.
Twitter Users are

Young (60% are between 16 34 years old).


More Urban residents than rural
Higher adoption% in African Americans
Many Journalists and Mass Media staff.
20% are not real human beings (robots):
many advertisement and marketing
activities.

2014 Survey (Business Insider)

Using Different Keywords can get different demographic groups:


#Healthcare: include more senior people (Very few teenagers will tweet
about healthcare). (We need more background study).
Keywords could be used as a sampling tool for social media users.
Who are the Users?
Humans or robots (bots)?
Use SMART dashboard to track E-cigarette topics

High Peak on Feb 11, 2016 (Why?)


1,553 Twitter Accounts
Said the Exact Sentence! In One Day (2/11/2016),
From to 11114 9561 = 1553 (Mummy or Ghost Twitter Accounts?) for Advertisement?
Are They Mummies and Ghosts
(Zombie) ?
Who are they? How they post
the messages?
User Privacy Issue
Concerns about Big Brother.

Although all the tweets collected from APIs


are public tweets (everyone can search
them and retrieve them).

Some content of tweets may contain personal


private information (real names, locations of
homes, offices, private conversations, medical
situations, etc.)

* HDMA center conceals tweet locations by


randomly selecting a coordinate in a 100m
radius of the original location to protect
Twitter users' privacy.
Ethical and Sensitive Issues
SMART Dashboard for Marijuana Legalization Issues
http://vision.sdsu.edu/hdma/smart/marijuana

Public Available Data (Twitter Data) may


include many sensitive personal
information.

How to deal with these information?

Orignal Big Data (not sensitive)


Process + Filter + Noise
Reducing Sensitive Data
Final Remark: Big Data = Transdisciplinary
Geospatial Technology is essential for Big Data Science.
We will transform Science and Technology in the age of
Big Data -- from isolated instruments (disciplines)
into an epic orchestra (collaboration).

Image source: wikipedia.org


Human Dynamic in the Mobile Age (HDMA)
http://humandynamics.sdsu.edu/

Thank You Q&A


Director: Dr. Ming-Hsiang (Ming) Tsou
mtsou@mail.sdsu.edu
Twitter @mingtsou

Funded by
NSF Cyber-Enabled Discovery and Innovation (CDI) program. Award # 1028177. (2010-
2015) http://mappingideas.sdsu.edu/
NSF Interdisciplinary Behavioral and Social Science (IBSS) Program, Award #1416509
(2014-2018): Spatiotemporal Modeling of Human Dynamics Across Social Media and
Social Networks. http://socialmedia.sdsu.edu/
Human Dynamic in the Mobile Age (HDMA)
Why Choose Twitter?
80% academic researchers are using Twitter APIs to get their social media data.

1. Free and Open Access Data from APIs (you can write a program in your desktop
to download Twitter data (tweets) automatically). But the free APIs has the 1%
data limit.
2. Large User Base (+500 million users) and very popular in U.S., Europe, and
Japan. But not in China, Taiwan, and Korea (China has a similar platform called
Weibo).
3. Easy to program in Python or PHP (Tweepy, TwitterSearch, etc.). Many available
API libraries to use now.
4. Historical data and 100% data can be purchased from Twitter (but very
expensive).
5. Rich [Metadata] tags in each tweet (time stamp, user, follower, platform, time
zone, text, URL, Retweet, language, devices).

Other possible social media APIs: Flickr, Instagram, Foursquare, Yelp, YouTube.
Why not Facebook? (Facebook Graph APIs are VERY LIMITED and PROTECTIVE. No
Public data feed). You need to have internal connections to Facebook staff to
conduct research.
Next Step: Syndromic Surveillance (Underdevelopment)
(tracking multiple Symptoms: fever, cold, cough, vomiting, etc. )
http://vision.sdsu.edu/hdma/smart/syndromic

Designed for
Early Detection of
unknown disease
outbreaks, such as
Swine Flu and SARS
Other Examples
Building a transformative research agenda for
Big Data Science

Examples of Research Topics:


1. Use Social Media as Human Sensors to Monitor Our Environment (Air
Pollution, Water Quality, Temperature) and Social Movement (Depression,
Social Problems, Election, etc.).
2. Visualizing dynamic spatiotemporal patterns of geographic awareness
3. Mapping Dynamic Urban Land Use Patterns using Points of Interests (POI)
and Social Media.
4. Building a Comprehensive Food Environment Databases (using Yelp, Google
Places, and Foursquare Check-Ins).
5. Building a Dynamic Urban Population Model (Twitter, Instagram, and Flickr).
Use Social Media as Human Sensors
to Monitor Air Pollution in China.

http://journals.plos.org/plosone/article?id=10.
1371/journal.pone.0141185
Chinese Twitter:
Sina Weibo

The worst month of air quality (April, 2012)


generated the highest correlation coefficient
between the filtered social media messages and AQI.
Mapping Dynamic Urban Land Use Patterns
(using Temporal Pattern Cluster Analysis K-means)

Mapping Dynamic Urban Land Use Patterns with Crowdsourced Geo-tagged Social
Media (Sina-Weibo) and Commercial Points of Interest (POI) Collections in Beijing,
China (submitted to the journal of CaGIS, under review now).

Hourly Temporal Patterns (24 hours)

Residential Areas
+ College Dormitory
Building a Comprehensive Food Environment Databases
Yelp, Foursquare, Flickr, and Google Places.
Building a Dynamic Urban Population Model:

Building a hourly-based
population density
model in Urban area (for
disaster responses and
management)

(Su Han, Ph.D. student in


the HDMA Center).

Das könnte Ihnen auch gefallen