Beruflich Dokumente
Kultur Dokumente
Introduction
In 2013, the mobile-phone user base reached almost
93% of the global population, with more than one billion
smartphones in use [1]. Those devices have created a whole
ecosystem of applications and services, which have rapidly
changed our daily lives and continue to transform our
society. This ongoing mobile revolution has brought
tremendous opportunities for businesses to capitalize on the
vast amount of data that is being generated through mobile
phone usage. For example, phone calls are evidence of a
social link between users. The applications used, or web
pages visited, give hints of a users topical interest, activity,
or commercial intention. The location of the device can
also be used to enrich data about customers. Hence,
the electronic traces of a mobile phone recorded by a telecom
can be used to give deep insights about peoples interests,
lifestyles, and social patterns. The need to analyze this data
motivated us to develop a SoLoMo (social-location-mobile)
analytics solution, which exploits knowledge of a users
social network, locations, and online usage patterns to
provide meaningful insights in a reasonable timeframe.
One of the main challenges for processing this data is
the size and the speed at which it is being generated. Every
day, billions of electronic records are generated by mobile
H. Cao
W. S. Dong
L. S. Liu
C. Y. Ma
W. H. Qian
J. W. Shi
C. H. Tian
Y. Wang
D. Konopnicki
M. Shmueli-Scheuer
D. Cohen
N. Modani
H. Lamba
A. Dwivedi
A. A. Nanavati
M. Kumar
Copyright 2014 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without
alteration and (2) the Journal reference and IBM copyright notice are included on the rst page. The title and abstract, but no other portions, of this paper may be copied by any means or distributed
royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor.
0018-8646/14 B 2014 IBM
VOL. 58
NO. 5/6
PAPER 9
SEPTEMBER/NOVEMBER 2014
H. CAO ET AL.
9:1
Figure 1
SoLoMo architecture. (RT: real-time.)
System architecture
The primary motivation for building the tool was to address
the issues faced in analyzing the huge amount of data
generated by a telecom every day. The data in the telecom
domain is unique and creates challenges not only related
to the volume of the data, but also dealing with the
variety and velocity of the data. Therefore, it was crucial
to develop a system that can effectively deal with all the
above-mentioned characteristics.
Our proposed Big Data solution uses a modular
architecture to address the challenges posed in this domain.
9:2
H. CAO ET AL.
VOL. 58
NO. 5/6
PAPER 9
SEPTEMBER/NOVEMBER 2014
VOL. 58
NO. 5/6
PAPER 9
SEPTEMBER/NOVEMBER 2014
H. CAO ET AL.
9:3
Figure 2
Parallel social network analysis architecture. (SNA: social network analysis; KPI: key performance indicator; HDFS: Hadoop Distributed File System;
HITS: hyperlink-induced topic search.)
Location-based profiling
Location information can provide useful insights for building
customer proles. CDRs contain location information that
reects movements and behaviors of subscribers. Although
the location data in CDRs is typically at cell granularity,
which is spatially coarse and not as precise as GPS
(Global Positioning System) data, this data is still useful
in characterizing the mobility of users. There are numerous
mobility features that can be extracted from the CDR
sequence to build customer proles from the location
perspective. We classify the features into three levels, as
shown in Figure 3.
Location features
We now describe the location features that we extracted
from the CDRs. The rst of these features are the low-level
9:4
H. CAO ET AL.
VOL. 58
NO. 5/6
PAPER 9
SEPTEMBER/NOVEMBER 2014
Figure 3
Location features extracted from CDR for location-based proling. (OD: origin-destination; POI: point-of-interest).
VOL. 58
NO. 5/6
PAPER 9
SEPTEMBER/NOVEMBER 2014
H. CAO ET AL.
9:5
9:6
H. CAO ET AL.
VOL. 58
NO. 5/6
PAPER 9
SEPTEMBER/NOVEMBER 2014
Figure 4
Mobile usage analysis ow.
VOL. 58
NO. 5/6
PAPER 9
SEPTEMBER/NOVEMBER 2014
H. CAO ET AL.
9:7
Listing 1
Utility indices
A few utility indices are prepared in advance, and
consulted with for categorization: Wikipedia taxonomy,
Wikipedia textual index, ODP textual index, and ODP URL
index.
Wikipedia indices: The Wikipedia taxonomy index
captures the category hierarchy of Wikipedia, with each
category accessible by name and pointing to all its parent
categories. Wikipedia textual index contains a document for
each Wikipedia none-category document (i.e., articles that
appear in Wikipedia). The text of that document is indexed
and searchable, and the Wikipedia categories of that
document are saved as metadata of that document. In
addition to its immediate categories, their ancestor categories
are indexed as metadata of that document, thereby
maintaining the entire ancestor category set for the document,
up to a certain predened height. Our experiments indicated
that in Wikipedia, ancestor categories of heights larger
than four divert signicantly from the original immediate
category; therefore, we ignore higher ancestors. The ancestor
height is maintained in the index - zero for the immediate
category, one for its parent categories, etc. In this process,
cycles that evidently exist in the Wikipedia taxonomy
are avoided, by keeping only the shortest path to an ancestor.
During search time (i.e., upon an arrival of query-type
URL), we access both Wikipedia indices as follows: the
Wikipedia textual index is accessed, and the best articles
that match the queries are retrieved, then the Wikipedia
taxonomy index is used to extract the labels of the categories
associated with the retrieved documents.
ODP indices: The ODP textual index contains a single
document for each ODP category. The short description of
that category is indexed and searchable for that document.
9:8
H. CAO ET AL.
VOL. 58
NO. 5/6
PAPER 9
SEPTEMBER/NOVEMBER 2014
VOL. 58
NO. 5/6
PAPER 9
SEPTEMBER/NOVEMBER 2014
H. CAO ET AL.
9:9
9 : 10
H. CAO ET AL.
Related work
A few parallel social network analysis platforms are
proposed in [7, 24, 25] to process large-scale graphs.
The authors propose to accelerate the processing in
parallel computation. The focus of our work is on the
provisioning of the integrated analytics solution using the
parallel social network analysis algorithms. The front-end
visualizations in social networks proling are proposed
based on D3 (data-driven documents) [26], which is a
representation-transparent approach to visualization for
the web. In this work, we extended D3 to make our
visualizations ne-integrated with the web environment.
Community detection has long been one of the
fundamental topics of attention for the network science
researchers. Ever since the seminal paper by Girvan and
Newmann [27], much work and interest has been generated
in this eld. The set of algorithms that can be used to nd
communities can be broadly divided into six categories [28],
namely (i) graph partitioning, (ii) hierarchical clustering,
(iii) partitional clustering, (iv) spectral clustering, (v) divisive
algorithms, and (vi) modularity-based methods. Most of
the approaches try to optimize the given objective
functions. The most notable and state-of-the-art algorithms
are [29] and [30]. Both of these approaches try to
maximize the given objective function, which in this case
is modularity.
Characterizing human mobility by analyzing anonymized
mobile phone data (typically CDRs) has become a hot
research topic recently. Modani et al. [22] reviewed
the methods of collecting location data from cellular phone
network. Isaacman et al. [10] proposed algorithms that
identify generally important places, such as home and work
locations, of subscribers. Becker et al. [31] presented a
comprehensive study on how to use CDR to calculate
subscribers daily travel range, trafc volumes, and carbon
footprint of home-to-work commutes, etc.
In this work, we extracted user interests from the mobile
browsing log using open source taxonomies such as ODP
and Wikipedia. Several papers have utilized the ODP
taxonomy along with the web browsing logs for different
uses. Recently, Konopnicki and Shmueli-Scheuer [19] uses
the ODP taxonomy to model user proles based only on their
domain and URL levels browsing logs; in this work, we
extend the scope to utilize Wikipedia source and support
dynamic queries, such as Title. The work in [32] focused on
exploiting the ODP to achieve high-quality personalized
web search based on the distance of the categories of the
returned URL to the user prole categories. The distance is
measured by hierarchical semantic and the ODP tree
structure. Tanudjaja and Mui [33]
VOL. 58
NO. 5/6
PAPER 9
SEPTEMBER/NOVEMBER 2014
VOL. 58
NO. 5/6
PAPER 9
SEPTEMBER/NOVEMBER 2014
References
1. Global Mobile Statistics 2013 Part A. [Online]. Available:
http://mobithinking.com/mobile-marketing-tools/latest-mobilestats/a
2. IBM InfoSphere Streams, IBM Corporation, Armonk, NY, USA.
[Online]. Available: http://www-03.ibm.com/software/products/en/
infosphere-streams/
3. IBM Netezza Data Warehouse, IBM Corporation, Armonk, NY,
USA. [Online]. Available: http://www-01.ibm.com/software/data/
netezza/
4. IBM InfoSphere BigInsights, IBM Corporation, Armonk, NY,
USA. [Online]. Available: http://www-01.ibm.com/software/data/
infosphere/biginsights/
5. IBM SPSS, IBM Corporation, Armonk, NY, USA. [Online].
Available: http://www-01.ibm.com/software/analytics/spss/
6. L. Page, S. Brin, R. Motwani, and T. Winograd, BThe pagerank
citation ranking: Bringing order to the web,[ Stanford InfoLab,
Stanford, CA, USA, Tech. Rep., 1999.
7. W. Xue, J. Shi, and B. Yang, BX-RIME: Cloud-based large scale
social network analysis,[ in Proc. IEEE Int. Conf. SCC, 2010,
pp. 506513.
8. J. Shi, W. Xue, W. Wang, Y. Zhang, B. Yang, and J. Li,
BScalable community detection in massive social networks
using MapReduce,[ IBM J. Res. & Dev., vol. 57, no. 3/4, pt. 12,
pp. 12:112:14, MayJul. 2013.
9. L. Stenneth, O. Wolfson, P. S. Yu, and B. Xu, BTransportation
mode detection using mobile phones and GIS information,[ in
Proc. 19th ACM SIGSPATIAL Int. Conf. Adv. Geogr. Inf. Syst.,
2011, pp. 5463.
10. S. Isaacman, R. Becker, R. Caceres, S. Kobourov, M. Martonosi,
J. Rowland, and A. Varshavsky, BIdentifying important places
in peoples lives from cellular network data,[ in Proc. 9th Int.
Conf. Pervasive Comput., 2011, pp. 133151.
11. J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and
Techniques, 3rd ed. San Mateo, CA, USA: Morgan Kaufmann,
2011.
12. W. Dong, L. Li, C. Zhou, Y. Wang, M. Li, C. Tian, and W. Sun,
BDiscovery of generalized spatial association rules,[ in Proc.
IEEE Int. Conf. SOLI, 2012, pp. 6065.
13. W. Dong, W. Fan, L. Shi, C. Zhou, and X. Yan, BA general
framework to encode heterogeneous information sources for
contextual pattern mining,[ in Proc. ACM Int. CIKM, 2012,
pp. 6574.
14. URL Types - The URL Cleaninghouse. [Online]. Available:
http://urlclearinghouse.wikidot.com/types
15. X. Qi and B. D. Davison, BWeb page classication: Features
and algorithms,[ ACM Comput. Surv., vol. 41, no. 2, pp. 131,
Feb. 2009.
16. D. Cohn and T. Hofmann, BThe missing link - A probabilistic
model of document content and hypertext connectivity,[ in Proc.
Adv. NIPS, 2001, pp. 430436.
17. Open Directory Project (ODP). [Online]. Available: http://www.
dmoz.org/
18. Wikipedia. [Online]. Available: http://www.wikipedia.org/
19. D. Konopnicki and M. Shmueli-Scheuer, BCustomer analyst
for the telecom industry,[ in Large-Scale Data Analytics.
New York, NY, USA: Springer Science and Business Media,
2014.
20. Articles With Open Directory Project Links. [Online]. Available:
http://en.wikipedia.org/wiki/Category:Articles_with_Open_
Directory_Project_links
21. Wikipedia Mapping. [Online]. Available: http://projects.dmoz.org/
project.cgi?id=7
H. CAO ET AL.
9 : 11
9 : 12
H. CAO ET AL.
VOL. 58
NO. 5/6
PAPER 9
SEPTEMBER/NOVEMBER 2014
David Konopnicki IBM Research Division, IBM Research Haifa, Haifa University Campus, 31905 Haifa (davidko@il.ibm.com).
Dr. Konopnicki manages the Information Retrieval Group in IBM
Research - Haifa and has been involved in unstructured content analytics
both from a theoretical and a practical point of view. In academia,
Dr. Konopnicki developed search systems for the early web. In the IBM
Software Group, and in IBM Research, he has been leading a variety
of projects: development of large-scale full-text search engines,
building customer proles from enterprise and social media sources,
massive-scale analytics with applications to Telco companies, and more.
Dr. Konopnicki is an IBM Master Inventor and holds a Ph.D. degree in
computer science from the Technion-Israel Institute of Technology.
Michal Shmueli-Scheuer IBM Research Division, IBM
Research - Haifa, Haifa University Campus, 31905 Haifa (shmueli@il.
ibm.com). Dr. Shmueli-Scheuer is a Researcher in the Information
Retrieval Group in IBM Research - Haifa. Dr. Shmueli-Scheuer
received her Ph.D. degree in information and computer science at
the University of California, Irvine, in 2009. Her area of expertise is in
the elds of large-scale analytics, database, and information systems,
focusing on user-behavior analytics and information management on
the web. She has authored numerous papers on data management and
information retrieval in leading conferences.
VOL. 58
NO. 5/6
PAPER 9
SEPTEMBER/NOVEMBER 2014
H. CAO ET AL.
9 : 13