SoLoMo Analytics For Telco Big Data Monetization - 06964900

SoLoMo analytics for telco
Big Data monetization

The mobile Internet brought tremendous opportunities for businesses
to capitalize on the vast amount of SoLoMo (social-location-mobile)
data for delivering high-quality and personalized customer
services. In this paper, we describe algorithms and technologies for
discovering actionable customer insights using the combined power
of social network, location pattern mining, and mobile usage
analysis. We illustrate our implementation using Big Data platforms
including IBM InfoSphereA BigInsights, IBM InfoSphere Streams,
and IBM NetezzaA Data Warehouse, while addressing various
Big Data-related challenges, such as context generation of
unstructured data and high-performance analytics for both data at
rest and data in motion. The presented system combines location,
social interactions, and user behavior data to nd like-minded
communities. The system leverages Big Data capabilities to attempt
to scale to support the subscriber base of large telecoms in an
efcient manner.
Introduction
In 2013, the mobile-phone user base reached almost
93% of the global population, with more than one billion
smartphones in use [1]. Those devices have created a whole
ecosystem of applications and services, which have rapidly
changed our daily lives and continue to transform our
society. This ongoing mobile revolution has brought
tremendous opportunities for businesses to capitalize on the
vast amount of data that is being generated through mobile
phone usage. For example, phone calls are evidence of a
social link between users. The applications used, or web
pages visited, give hints of a users topical interest, activity,
or commercial intention. The location of the device can
also be used to enrich data about customers. Hence,
the electronic traces of a mobile phone recorded by a telecom
can be used to give deep insights about peoples interests,
lifestyles, and social patterns. The need to analyze this data
motivated us to develop a SoLoMo (social-location-mobile)
analytics solution, which exploits knowledge of a users
social network, locations, and online usage patterns to
provide meaningful insights in a reasonable timeframe.
One of the main challenges for processing this data is
the size and the speed at which it is being generated. Every
day, billions of electronic records are generated by mobile
Digital Object Identifier: 10.1147/JRD.2014.2336177
H. Cao
W. S. Dong
L. S. Liu
C. Y. Ma
W. H. Qian
J. W. Shi
C. H. Tian
Y. Wang
D. Konopnicki
M. Shmueli-Scheuer
D. Cohen
N. Modani
H. Lamba
A. Dwivedi
A. A. Nanavati
M. Kumar
phone users through the phone calls they make, web

activities they perform, the changes in their location, etc.
New Big Data systems and technologies are required to
handle such data (most of which are perishable) in a highly
scalable, cost-effective, and fault-tolerant fashion. In this
paper, we demonstrate our Big Data solution that is designed
to process mobile data and discover actionable customer
insights using the combination of social networks, location
pattern mining, and mobile usage analysis. We also
discuss details of the implementation, which was designed
on top of Big Data platforms, including IBM InfoSphere*
BigInsights, and IBM InfoSphere Streams [2], and IBM
Netezza* Data Warehouse [3].
Specically, we present a unied SoLoMo analysis
approach that can enable insights that were previously not
possible. In particular, we introduce a method for nding
like-minded communities that factor in the location
data as well as derived interests along with the social
interaction data. By taking location into account, we can nd
communities of users that are not only connected but also
share the same physical spaces. Using derived interests from
the mobile browsing history can help us nd communities
that are not only connected but also have similar interests.
The like-minded communities we thus nd can be leveraged
wherever a word-of-mouth or peer-pressure marketing
strategy is deemed appropriate, e.g., recommendations or
viral marketing. Our focus in this paper is to present a
Copyright 2014 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without
alteration and (2) the Journal reference and IBM copyright notice are included on the rst page. The title and abstract, but no other portions, of this paper may be copied by any means or distributed
royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor.
0018-8646/14 B 2014 IBM
IBM J. RES. & DEV.
VOL. 58
NO. 5/6
PAPER 9
SEPTEMBER/NOVEMBER 2014
H. CAO ET AL.
9:1
Figure 1
SoLoMo architecture. (RT: real-time.)
system-oriented view, and due to space constraint, we do

not present the results on the efciency or accuracy of the
algorithms. Detailed experiments with real-world data would
provide a fertile territory for future research.
The privacy policies and laws with respect to data mining
of phone information do vary from country to country, and
any such analysis would be expected to take into account
such laws and sensitivities.
System architecture
The primary motivation for building the tool was to address
the issues faced in analyzing the huge amount of data
generated by a telecom every day. The data in the telecom
domain is unique and creates challenges not only related
to the volume of the data, but also dealing with the
variety and velocity of the data. Therefore, it was crucial
to develop a system that can effectively deal with all the
above-mentioned characteristics.
Our proposed Big Data solution uses a modular
architecture to address the challenges posed in this domain.
9:2
H. CAO ET AL.
The functionalities provided by the solution turn vast

amounts of low-value inaccurate data into high-value
customer insights.
Figure 1 presents the ve key components (marked with
numbers in the gure) in the solution, as well as how this
solution integrates with campaign management systems in
the enterprise.
IBM InfoSphere Streams component
The IBM InfoSphere Streams component (denoted as 1 in
Figure 1) is responsible for providing real-time performance
on a continuous stream of data, such as network data and
call data records (CDR). It is vital to obtain responses
in real-time, since it is essential for some of the key
functionalities of the solution to be carried out effectively.
For example, real-time analysis of an event detected on the
basis of location and mobile usage should be done to be able
to instantaneously create a location-based service. We use
IBM InfoSphere Streams [3] to provide such key features.
The Streams-processed events are archived into a le-storage
IBM J. RES. & DEV.
VOL. 58
NO. 5/6
PAPER 9
format, and the analyzed data is saved in the IBM Netezza

Data Warehouse.
IBM InfoSphere BigInsights component
To analyze the vast amount of CDR data, an IBM InfoSphere
BigInsights [4] component (denoted as 2 in Figure 1) is used.
BigInsights provides a Hadoop** MapReduce framework
on top of which are implemented key functionalities such as
general social network analytics (SNA) and spatial network
analytics algorithms. These implementations help to
efciently extract each individuals social neighborhood
and key location characteristics. Note that the base Hadoop
platform does not support social network analysis, and we
add the parallel social network analysis accelerator to support
the graph analysis. The compressed and cleaner customer
data is then stored in a data warehouse.
Telco analytics data warehouse/datamart component
After IBM InfoSphere BigInsights and IBM InfoSphere
Streams processing, the higher value insights are stored in a
data warehouse with facts along three key dimensions. The
social dimension contains the users social inuence score
and community afliation. The location dimension stores
facts about the users time and location. The mobile usage
dimension stores the users top websites, mobile apps, and
interests. In this component (denoted as 3 in Figure 1) in
particular, we fully leverage the IBM Netezza system, which
provides parallelism as well as integration capability with
IBM SPSS* [5] to allow deep statistics and clustering
algorithms to efciently run on top of these base facts, so
further insights can be derived. Another key feature of
Netezza is that it natively supports in-database MapReduce.
The inputs and outputs of MapReduce jobs are based on
database tables.
Modeling environment
The new customer segments such as community, mobility,
and lifestyle are discovered using graph algorithms and
clustering models implemented through IBM SPSS (denoted
as 4 in Figure 1). We describe these analytical models in
subsequent sections.
Visualization
Finally, the solution also includes several web-based
visualization components (denoted as 5 in Figure 1) that
allow the end user of the solution to efciently consume the
SoLoMo insights. Those include visualizations related to
social networks, location, and aggregation of the three key
dimensions. We describe these in the sections below.
Social networks profiling

We construct a social network of customers from call data
records (CDRs) to mine social network features, which can
provide deep insights into customer roles and behaviors on
IBM J. RES. & DEV.
VOL. 58
NO. 5/6
PAPER 9
social network, such as their social inuence and the

community to which they belong. These features can be
used in various applications like viral marketing, customer
targeting, churn prediction, etc. Figure 2 shows the
architecture of the social networks proling module.
Social network features
The individual social network features describe user social
inuence, activity, etc. Such features are dened based on
the key performance indicators (KPIs) such as PageRank [6],
in-degree, out-degree, etc., which are derived from SNA.
The group-based social network features represent the
characteristics of communities with strong and weak
connections. These features are dened based on group-based
SNA KPIs like communities, cliques, k-cores, etc., which
are derived from group-based SNA. The layered approach
shown in Figure 2 allows developers to customize their
own social network features based on the SNA KPIs.
Parallel social network analysis
As shown in Figure 2, the parallel social network analysis
is built upon the MapReduce computation platform. Above
the MapReduce computing framework, we have a graph
data model and the message passing framework (MPF)
as the unied SNA framework. The graph data model is
object-oriented and is represented by the adjacency list.
MPF is a unied architecture for parallel graph analysis
algorithms. SNA developers can use the graph object
data model and MPF to implement arbitrary graph
analysis algorithms.
In the SNA KPIs layer, we implemented typical SNA
algorithms such as weakly connected components, k-core,
maximal cliques, community detection, etc., using the graph
object model and MPF. The details of these parallel
algorithms are described in [7, 8].
Front-end visualization
Front-end visualization for social network proling
represents social networking features from two perspectives.
In particular, the community graph visualization represents
group-based social features, such as the closeness of
community. The ego-network visualization depicts
individual social features, such as PageRank, in-degree,
and out-degree, etc.
We divide the problem of community-graph rendering into
two sub-problems: 1) how to cluster the nodes within the
same community and 2) how to distinguish individual
features among the group. While the rst is a common
problem of the graph layout, the second problem is related
to node visualization methods that involve information
visualization and graph-drawing techniques. We use a
multidimensional scaling (MDS) graph layout to assign
locations to individual nodes in multidimensional spaces,
such that individual nodes that are in the same community
H. CAO ET AL.
9:3
Figure 2
Parallel social network analysis architecture. (SNA: social network analysis; KPI: key performance indicator; HDFS: Hadoop Distributed File System;
HITS: hyperlink-induced topic search.)
are close to one another. Furthermore, individual nodes are

displayed in different sizes and colors. Different colors
correspond to different communities, and nodes with different
sizes represent their different inuence.
The ego network for a given individual in a network
is dened as the subgraph that represents all the direct
relationships and two-step relationships between the selected
individual and others. It indicates the impact of the selected
individual on others.
Location-based profiling
Location information can provide useful insights for building
customer proles. CDRs contain location information that
reects movements and behaviors of subscribers. Although
the location data in CDRs is typically at cell granularity,
which is spatially coarse and not as precise as GPS
(Global Positioning System) data, this data is still useful
in characterizing the mobility of users. There are numerous
mobility features that can be extracted from the CDR
sequence to build customer proles from the location
perspective. We classify the features into three levels, as
shown in Figure 3.
Location features
We now describe the location features that we extracted
from the CDRs. The rst of these features are the low-level
9:4
H. CAO ET AL.
features (with physical context) that are directly extracted

by statistics from the data layer, where the raw CDR data
is combined with the map reference data. These features
include the visiting frequency (can be represented by
hotspots on map), the top-k frequently visited POI (points of
interest) at specic location levels, and the top-k frequently
visited POI types at aggregated concept levels, etc. With
varying time window bases, these features can characterize
customers behaviors at different time scales (e.g., last
week, last month, etc.) and based on different temporal
characteristics (e.g., working hours on working days,
weekends and holidays, etc.).
The middle-level features (with semantics context) are
derived using probabilistic modeling and data mining
techniques from the low-level features and the data layer.
Such features include daily ranges of travel, speed (average
and instant), likely transportation mode [9], likely home
location(s) and work location(s) [10], popular routes given
origin-destination (OD) pairs (e.g., home-to-work routes),
likely OD pairs during a particular time period, and
different types of trajectories geometric features, etc.
The high-level features (with application context) are
dened by application-driven methods. Different applications
may have different focus on the customers mobility features.
For instance, the managers of a shopping mall may not
care about how far the target customers travel in a day, but
IBM J. RES. & DEV.
VOL. 58
NO. 5/6
PAPER 9
Figure 3
Location features extracted from CDR for location-based proling. (OD: origin-destination; POI: point-of-interest).
may be more interested in which competitor shopping

centers the target customers visit often. Such features can be
dened, e.g., by combinations of low-level and middle-level
features, with incorporating application driven thresholds.
Location analysis
Based on CDR data, the location analytics components
consists of two parts: (1) off-line location analytics on
data-at-rest, which extracts mobility features and models
the customers moving patterns, and (2) real-time location
analytics for customer targeting.
Off-line location analytics
The ofine location analytics component maintains a
subscriber data model as the base for analyses. The raw CDR
data is rst processed by BigInsights, and the parsed data
representing the trajectories are stored in a data warehouse.
The aforementioned three-level mobility features are
then calculated according to pre-dened time-window
congurations; thus, the customer proles can be adaptive
over time as new data comes. Data mining techniques
provided by IBM SPSS are adopted in the data manipulations
and the advanced pattern analyses. The two most
typical moving patterns modeling tasks are customer
micro-segmentation and association rule mining.
IBM J. RES. & DEV.
VOL. 58
NO. 5/6
PAPER 9
The customer micro-segmentation can be done by

rule-based ltering or unsupervised learning. Filtering is
preferable if the application user has clear requirements
on what kind of customers they are targeting, and the
provided features sufciently characterize the target
customers. Unsupervised learning such as clustering
(e.g., k-means and DBSCAN [density-based spatial
clustering of applications with noise]) [11] is done based
on a feature set that the user is interested in. The subscriber
clustering results typically are a number of subscriber groups.
The subscribers in the same group behave similarly in the
feature space, whereas the subscribers in different groups
behave differently.
The association rule mining aims to provide additional
insights into customers behaviors. By combining additional
data sources such as the pay-by-card data or mobile usage
data, the analytics component is capable of determining
the associations between behaviors (e.g., credit card
consumption, web browsing, etc.) and the spatiotemporal
contexts [12, 13]. Typical spatiotemporal contexts
include the attributes of location, e.g., what kinds of POI
are around, which can be dynamic and changing over
time. As mentioned earlier, because CDR location data are
typically at cell granularity, the location uncertainty may
introduce some noise while carrying out location analytics.
H. CAO ET AL.
9:5
However, empirical results show that the noise often,

if not always, can be tolerable in practice. The temporal
characteristics of a behavior, such as day of week and
hour in a day, are also important. For example, a typical
association rule found from data can be:
Sunday ^ noon ^ xxx mall ! shopping
^ have lunch 25%; 86%
The rst number in the parentheses is the rule support,
and the second is the rule condence [11]. This rule is
interpreted as: (1) 25% of all subscribers go to xxx mall
for lunch every Sunday at noon, and (2) once a subscriber
visits the mall on Sunday at noon, there is an 86% possibility
that he/she will buy something as well as have lunch. Such
an analysis can provide predictive rules for a campaign.
If it is known that some customers regularly go out (i.e., has
high probability to go out) for dinner on weekends, a coupon
sent on a Friday afternoon or evening may signicantly
impact a customers nal decision. This helps the user
to reach the target customers at the best time before the
purchase really happens.
Real-time location analytics
The incoming data stream from the ofine analytics contains
location information (e.g., GPS location of a cell-tower).
Based on the patterns and rules discovered from the off-line
analytics, real-time location analytics processes the incoming
data stream in real time to (1) calculate or infer the precise
location from multiple sources of location data with different
precisions, including 2G/3G/4G CDR, available GPS
tracking, and WiFi ofoad, etc., and (2) trigger predictive
model scoring with the inferred real-time location and other
spatiotemporal contexts as inputs. If a subscriber is regarded
as satisfying the promotion condition, then additional
promotion actions will be taken, e.g., sending a coupon
to the subscribers cell phone number.
Mobile usage profiling

Telecom companies derive user proles from both structured
and unstructured data such as user demographics and
analysis of CDRs. While social network and location analysis
usually handles structured CDRs data, mobile usage analysis
examines web browsing activity, extracted from the
unstructured elements of Event Data Records (EDRs). EDRs
extend CDRs beyond voice call, and they capture various
telecom network activities such as sent message, web
browsing, movie download, etc.
From the browsing activity, we can understand the
user interests and utilize them for customer targeting,
micro-segmentation, and more. For example, users accessing
www.nba.com are likely to be basketball fans and can be
targeted with promotions of basketball-related products. We
perform categorization and aggregation for utilizing web
9:6
H. CAO ET AL.
browsing activity information: browsed pages are rst

effectively categorized, thereby transforming opaque URLs
(uniform resource locators) into meaningful categories; later
on, categories are carefully aggregated into comprehensive
user proles. In the following sections, we describe these two
phases for generating user proles based on mobile usage.
Mobile usage features
With respect to high-level features, telecom companies
monitor the data trafc that traverses their systems. In
particular, each HTTP (Hypertext Transfer Protocol) request
is recorded in a system log, containing all users interactions
with web pages (documents). A log record of the form
huser; document-url; contexti captures a single Buser
document[ association, in some context. Note that context
captures metadata extracted from the association, e.g., time,
date, user agent, and content type.
With low-level features, every web page has a unique URL
made of mandatory scheme and domain along with optional
port, path, and query: Bscheme://domain:port/path?query[
For brevity, we now ignore the scheme and port parts.
We dene three URL levels: domain, path, and query,
respectively, made of just domain, domain with path, and all
three. Similar to [14], we denote query URLs as dynamic
queries. In this spirit, path URLs may contain information
about static queries, which we also denote as a title queries.
Dynamic queries are typical for search engines such as
Google** and, Yahoo!**, as well as for internal searches
of corporate networks. For example, from this query-level
URL http://www.google.com/search?q=starbucks+menu we
extract the dynamic query words Bstarbucks[ and Bmenu[.
Title queries usually origin from URLs that represent articles,
where we split the path into query words, separated by F_
or F_. For example, from http://news.yahoo.com/
will-states-accept-obama-s-insurance-exchange-x214110316.html, a news article about Obamas insurance
exchange x, we extract the title-query words Bwill[,
Bstates[, Baccept[, Bobama[, Binsurance[, Bexchange[, and
Bx[.
Mobile usage analysis
In this section, we describe the analysis ow for mobile
usage as depicted in Figure 4.
Modeling web pages
Common approaches to modeling web page data [15, 16]
extract page content and metadata such as title, hyperlinks,
and layout and apply categorization of this information. Our
approach is making use of two public taxonomies: DMOZ
Open Directory Project (ODP) [17] and Wikipedia** [18].
In the next paragraphs, we describe these open data sources
and how they are utilized.
The ODP (Open Directory Project), one of the largest
collaborative sources for manually annotated web pages,
IBM J. RES. & DEV.
VOL. 58
NO. 5/6
PAPER 9
Figure 4
Mobile usage analysis ow.
categorizes more than 4 million web pages into more than

590,000 categories. Categories such as arts, business, and
computers are expressed as a tree, where subcategories
represent more specic concepts than their parents. For
example, the branch BTop/Arts/Television/Networks[ is split
into two subcategories, BTop/Arts/Television/Networks/
Cable[ and BTop/Arts/Television/Networks/Satellite[. Each
category contains URL links to related web sites, along
with a short description of each website. Most ODP URLs
are of type domain or path. For example, the domain URL
money.cnn.com and the path URL www.cnn.com/CNN/
Programs are both associated with the ODP category BArts/
Television/Networks/Cable/CNN[. Wikipedia is a dynamic
collaborative free encyclopedia, containing more than
4 million articles, and more than 900,000 categories that
are structured as a graph. Being highly dynamic, Wikipedia
quickly reects newly emerging events and concepts. Since
Wikipedia articles tend to be quite wordy, it is adequate
for categorizing query URLs - both dynamic-query and
title-query URLs (e.g., emerging topics).
URL pattern analyzer
For the processing of dynamic-query and title-query URL
forms and for handling some special case URLs, we employ
IBM J. RES. & DEV.
VOL. 58
NO. 5/6
PAPER 9
a hierarchical pattern analysis phase. Each pattern is driven

by a regular expression and may depend on other patterns.
Listing 1 shows an example pattern, applied for detecting
dynamic-query URLs and extracting the query words. The
element BRes[ enables specifying which parts of the regular
expression is the result string, while the BSep[ eld dictates
how to split that string into words. The BGood Examples[
and BBad Examples[ elements allow validating patterns at
system startup. Finally, the BType[ element classies the
detected pattern, dictating which categorization logic should
be applied. The type value is one of BTitle Query[, BDynamic
Query[, BExplicit[, and BPage.[ Matches of BTitle Query[
and BDynamic Query[ are handled by searching for
categories over textual indices of ODP and Wikipedia.
BExplicit[ matches are special patterns tailored to match
certain families of URLs to a predened category, e.g.,
BAdult[ category. BPage[ matches usually indicate that none
of the pattern types applied. We expand on handling the
pattern types in the categorization section.
Categorization
In this section, we describe the process of categorization that
consists of creating indices from the ODP and Wikipedia
sources, and the categorization ow.
H. CAO ET AL.
9:7
Listing 1
Example of pattern analyzer.
Utility indices
A few utility indices are prepared in advance, and
consulted with for categorization: Wikipedia taxonomy,
Wikipedia textual index, ODP textual index, and ODP URL
index.
Wikipedia indices: The Wikipedia taxonomy index
captures the category hierarchy of Wikipedia, with each
category accessible by name and pointing to all its parent
categories. Wikipedia textual index contains a document for
each Wikipedia none-category document (i.e., articles that
appear in Wikipedia). The text of that document is indexed
and searchable, and the Wikipedia categories of that
document are saved as metadata of that document. In
addition to its immediate categories, their ancestor categories
are indexed as metadata of that document, thereby
maintaining the entire ancestor category set for the document,
up to a certain predened height. Our experiments indicated
that in Wikipedia, ancestor categories of heights larger
than four divert signicantly from the original immediate
category; therefore, we ignore higher ancestors. The ancestor
height is maintained in the index - zero for the immediate
category, one for its parent categories, etc. In this process,
cycles that evidently exist in the Wikipedia taxonomy
are avoided, by keeping only the shortest path to an ancestor.
During search time (i.e., upon an arrival of query-type
URL), we access both Wikipedia indices as follows: the
Wikipedia textual index is accessed, and the best articles
that match the queries are retrieved, then the Wikipedia
taxonomy index is used to extract the labels of the categories
associated with the retrieved documents.
ODP indices: The ODP textual index contains a single
document for each ODP category. The short description of
that category is indexed and searchable for that document.
9:8
H. CAO ET AL.
ODP URL index also contains a document for each ODP

category. The URLs associated with an ODP category are
indexed into two searchable elds: the entire URL phrase
is added to the BComplete URL[ eld, and if the URL has
no query part, it is also added to the BDomain Sufx Path
Prex[ (DSPP) eld. For example, the URL http://www.x.y.
z/a/b/c?q=r is added as is to the BComplete URL[ eld but
not to the DSPP eld, while the URL http://www.x.y.z/a/b/c
is added to both elds.
At search time, when accessing the ODP URL index, a
special tokenization applies for searching the DSPP eld. For
example, the URL http://www.x.y.z/a/b/c?q=r is tokenized
into {B$$x.y.z/a/b/c[, B$$x.y.z/a/b[, B$$x.y.z/a[, B$$x.y.z[,
Bx.y.z[, B$$y.z[, By.z[}, and the longest match of these
tokens is returned. Note a few aspects of this tokenization:
only the domain and path of the URL are considered; the
rst generic domain component (Bwww[) is ignored; the
B$$[ string marks an entire domain part; rst the path is
trimmed into its prexes, then the domain is trimmed into
its sufxes; and the domain trimming stops at a domain name
with two components.
Categorization ow
In this component, each input URL is passed through
cascading logic, and the rst step that holds would set the
result category. We now explain the cascading steps.
Step 1: Complete URLVWe search for the complete URL
in the BComplete URL[ eld of the ODP URL index.
Step 2: URL pattern analysisVAnalysis is applied on
the URL, and the URL is handled according to the result
type of the pattern. For BExplicit[ type, the explicitly
specied category is returned. For BDynamic Query[
IBM J. RES. & DEV.
VOL. 58
NO. 5/6
PAPER 9
Table 1 Example of user prole with top-three

categories.
and BTitle Query[ types, the extracted words are used

for constructing search queries, rst for the ODP textual
index and then for the Wikipedia taxonomy and textual
indexes. Upon matching in the ODP textual index, the top
result category is returned. Otherwise, rst a category by
the same name is searched for in the Wikipedia taxonomy.
If one exists, it is returned. Otherwise, upon a matching
Wikipedia textual index, the top 100 result documents
are selected, and their indexed ancestor categories (up
to heights 4) are accumulated, taking into account both
document scores and ancestor heights. This results in a set
of candidate categories, each with a score and name. To
further select a meaningful category, a two-pass voting
process is performed, in which candidate categories are
voting for each other: rst, each category propagates its
score to the words that make up its name, and then each
word propagates (back) its score to all the categories that
contain it. The top scored category is returned.
Step 3: ODP URL searchVWe search the DSPP eld of the
ODP URL index, tokenizing the input URL as explained
above, and returning the longest (and hence rst) match, if
one exists.
Step 4: Fail.
Experiments with various logs from different geographies
achieve average coverage of 87%. The quality is discussed
in [19].
User profile
Categorization turns user URL associations into user
category associations and allows to aggregate higher-level
user proles. Similar to databases GROUP BY operator,
prole categories are ranked by accumulation, and top ranked
categories are selected. For a more consistent prole
presentation, ODP categories can be mapped into Wikipedia
categories as suggested in [20, 21]. Table 1 shows an
example of a user prole consisting of top category
accumulations.
Finding like-minded communities: Combining

SoloMo analysis
Understanding the target audience is important for designing
a campaign. Typically a campaign for individual users
leverages only clustering and micro-segmentation
IBM J. RES. & DEV.
VOL. 58
NO. 5/6
PAPER 9
techniques. However, if peer pressure or social inuence

are used in a campaign (e.g., in the case of a viral marketing
campaigns), it is important to also identify the target groups,
among which members of a group are not only well
connected, but also share similar attributes such as common
interests or shopping patterns. We call the generated group
with such features a like-minded community [20]. The
similarity of a pair of users is calculated using the cosine
similarity between a pair of users. The like-mindedness
of a community is dened as average similarity of all pairs
of users in that community.
In this section, we will describe the steps to build a
like-minded community using points of interest computed
by the location analysis component and interest proles
computed by the mobile usage analysis component. For
example, in location analysis component, we can use
the points of interest for computing like-mindedness and
claiming that two people are like-minded if they frequently
visit similar set of locations. Similarly, in mobile usage
analysis, two people are like-minded if they share a number
of interests. We also provide the high-level implementation
details of the SoLoMo analytics system using Big Data
frameworks. We do not describe the algorithm to nd
like-minded communities in full detail; instead, we focus on
the details specic to the Big Data frameworks we used.
First, we construct the social network graph based on
the CDR data of the subscribers. In the social graph,
each node indicates a subscriber, and an edge between
two nodes indicates interaction between the two subscribers.
The weight on the edge indicates how frequently and for
what duration the two subscribers have been talking. We
then compute the interests of the user as described in mobile
usage analysis section and the POIs as described in the
location-based proling section. We then nd the induced
subgraph for each interest topic, and also for each point
of interest. An induced subgraph for a particular topic
retains only those nodes from the social graph that have an
interest in that particular topic and the edges incident on these
nodes. These induced subgraphs help in establishing what
subscribers are connected over similar interests and also
frequently talk to one another. These induced subgraphs
are found in parallel, leveraging the Netezzas NZSQL
framework. The induced subgraphs are found using the
join query on the subscribers and the topics in nzsql.
Next, we nd maximal cliques in each such induced
subgraph using the method proposed in [7]. As stated
earlier, Netezza provides native support for the MapReduce
computing paradigm. Over the cliques found in the previous
step, we run the frequent itemset treating each maximal
clique as a transaction, and users as items. Since each
maximal clique is a collection of users, it can be treated
as a transaction. For this purpose, we use the Netezza
analytical function ARULE. The model name and the support
level for the frequent itemset mining are specied while
H. CAO ET AL.
9:9
calling the ARULE function of Netezza. Identiers for the

cliques are passed as the TID (transactionID) parameter, and
the members of the cliques correspond to the item parameter.
The maximum set size is also specied as a parameter.
After nding the FIS (frequent itemsets), we apply support
threshold to prune the set of frequent itemsets (i.e., groups of
users who have appeared in a certain number of maximal
cliques across the induced subgraphs for different points of
interests and/or different mobile usage based interest topics).
We can, of course, nd the FIS separately on the POIs
and interest topics and then choose to retain the FIS, which
are meeting the support criteria in either/both.
Once the FIS of interest are determined, we nd the union
of all the FIS. We call the collection of the members of union
of the FIS as core people. Using the method mentioned
above, we again nd the induced subgraph for the core
member, which is called the induced graph of core people.
We then nd communities on this induced graph of core
people. Note that the community nding algorithms used
in [22] are not suitable for a parallel implementation,
and hence we use the algorithm proposed in [8] to nd the
like-minded communities. This algorithm provides
potentially overlapping communities.
Once the like-minded communities are computed, the
inuencer score is calculated for the all the members in the
various communities (e.g., using PageRank [6]). The rank of
the members within the community is calculated using the
Netezza rank analytical function along with the partition by
utility in Netezza. The analytical function helps in improved
query processing [23] by executing simpler SQL queries.
For each found community, following four metrics are
computed to help characterizing the features
Size: Size indicates the number of members in a particular
community.
Density: We use the average degree (within community) as
the density measure, i.e., the ratio of the number of edges
inside the community to the number of members in the
community.
Like-mindedness: The like-mindedness metric indicates
how like-minded/similar the members of the particular
community are. Like-mindedness is computed over one
or more dimensions available from other modules such as
the location module or mobile usage module. The score
ranges from 0 to 1,0 indicating that community is not at all
like-minded, and 1 indicating that community consists of
highly like-minded people.
Activity score: The activity score indicates, on average, how
active each member is in terms of purchasing items or
rating an item, or in general, in any activity. It is the mean
of the activity score of each community member.
All these metrics can assist to launch an effective viral
marketing campaign or provide deep insights of the
9 : 10
H. CAO ET AL.
subscribers. Using the Netezza High Capacity Appliance,

these metrics can be computed efciently over extremely
large data volumes.
Related work
A few parallel social network analysis platforms are
proposed in [7, 24, 25] to process large-scale graphs.
The authors propose to accelerate the processing in
parallel computation. The focus of our work is on the
provisioning of the integrated analytics solution using the
parallel social network analysis algorithms. The front-end
visualizations in social networks proling are proposed
based on D3 (data-driven documents) [26], which is a
representation-transparent approach to visualization for
the web. In this work, we extended D3 to make our
visualizations ne-integrated with the web environment.
Community detection has long been one of the
fundamental topics of attention for the network science
researchers. Ever since the seminal paper by Girvan and
Newmann [27], much work and interest has been generated
in this eld. The set of algorithms that can be used to nd
communities can be broadly divided into six categories [28],
namely (i) graph partitioning, (ii) hierarchical clustering,
(iii) partitional clustering, (iv) spectral clustering, (v) divisive
algorithms, and (vi) modularity-based methods. Most of
the approaches try to optimize the given objective
functions. The most notable and state-of-the-art algorithms
are [29] and [30]. Both of these approaches try to
maximize the given objective function, which in this case
is modularity.
Characterizing human mobility by analyzing anonymized
mobile phone data (typically CDRs) has become a hot
research topic recently. Modani et al. [22] reviewed
the methods of collecting location data from cellular phone
network. Isaacman et al. [10] proposed algorithms that
identify generally important places, such as home and work
locations, of subscribers. Becker et al. [31] presented a
comprehensive study on how to use CDR to calculate
subscribers daily travel range, trafc volumes, and carbon
footprint of home-to-work commutes, etc.
In this work, we extracted user interests from the mobile
browsing log using open source taxonomies such as ODP
and Wikipedia. Several papers have utilized the ODP
taxonomy along with the web browsing logs for different
uses. Recently, Konopnicki and Shmueli-Scheuer [19] uses
the ODP taxonomy to model user proles based only on their
domain and URL levels browsing logs; in this work, we
extend the scope to utilize Wikipedia source and support
dynamic queries, such as Title. The work in [32] focused on
exploiting the ODP to achieve high-quality personalized
web search based on the distance of the categories of the
returned URL to the user prole categories. The distance is
measured by hierarchical semantic and the ODP tree
structure. Tanudjaja and Mui [33]
IBM J. RES. & DEV.
VOL. 58
NO. 5/6
PAPER 9
applied the ODP to enhance the HITS algorithm [34] using

dynamic user proles. Wikipedia is also used to model
user behavior for both search terms and web documents.
Min and Jones [35] used an unsupervised clustering
method to model user search interests using Wikipedias
category. Other authors [36, 37] generated user proles
based on the page content, whereas in our setting, we
only allow access to the URL without fetching the
page content.
All the work performed in this area has used either
connections or interests of users to nd communities. Little
attention has been paid to the integration of connections
among the people and their shared interests to nd
communities. In most of the cases, text has been considered
as the second attribute [38, 39]. Attribute information-based
clustering has also been proposed [40]. Modani et al. [22]
proposed a way of nding communities with higher
modularity and with higher Blike mindedness[ as well, in
which like mindedness was a metric used to represent the
similarity among members of a community, in terms of
product purchases or movie ratings. The same algorithm
was made use of in SoLoMo on the variety of data attributes.
The algorithm has previously been applied only to movie
rating dataset. The algorithms ability to handle various types
of data was tested in this particular tool.
To the best of our knowledge, our tool may be the rst
to enable analyzing subscribers on all three dimensions:
location, social, and mobile. Most existing tools have used
combinations of only two of the dimensions. The Livehoods
project [41] studied the friendship network along with
the locations checked in by the people to come up with
various neighborhoods of the city, which were qualitatively
validated. Work by the MIT Reality Mining group discusses
using mobile phones as social sensors [42], inferring social
network based on the calling patterns [43], computing
communities [44], and proling users based on the
spatiotemporal patterns [45]. Similarly, much work has
been done over mobile and social networks [4649].
Conclusions and future work

In this paper, we presented our BigData solution called
SoLoMo, which provides a coherent set of analytics
functions to process the vast amounts of data generated in
the telecom area every day. In addition, we addressed
the challenge of scale involved in this setting, while still
providing meaningful insight in acceptable timeframe,
as telcos have very large (several million) subscriber bases.
Some of the directions to which our work can be extended
include 1) fusing available location data from multiple
sources (CDR, 2G/3G/4G, GPS, etc.) while considering
spatiotemporal constraints for more accurate location
inference and 2) using the like-minded communities
generated to design efcient strategies for social campaigns
and to determine the appropriate targets for such campaigns.
IBM J. RES. & DEV.
VOL. 58
NO. 5/6
PAPER 9
*Trademark, service mark, or registered trademark of International

Business Machines Corporation in the United States, other countries,
or both.
**Trademark, service mark, or registered trademark of Apache
Software Foundation, Google, Inc., Yahoo! Inc., or Wikimedia
Foundation in the United States, other countries, or both.
References
1. Global Mobile Statistics 2013 Part A. [Online]. Available:
http://mobithinking.com/mobile-marketing-tools/latest-mobilestats/a
2. IBM InfoSphere Streams, IBM Corporation, Armonk, NY, USA.
[Online]. Available: http://www-03.ibm.com/software/products/en/
infosphere-streams/
3. IBM Netezza Data Warehouse, IBM Corporation, Armonk, NY,
USA. [Online]. Available: http://www-01.ibm.com/software/data/
netezza/
4. IBM InfoSphere BigInsights, IBM Corporation, Armonk, NY,
USA. [Online]. Available: http://www-01.ibm.com/software/data/
infosphere/biginsights/
5. IBM SPSS, IBM Corporation, Armonk, NY, USA. [Online].
Available: http://www-01.ibm.com/software/analytics/spss/
6. L. Page, S. Brin, R. Motwani, and T. Winograd, BThe pagerank
citation ranking: Bringing order to the web,[ Stanford InfoLab,
Stanford, CA, USA, Tech. Rep., 1999.
7. W. Xue, J. Shi, and B. Yang, BX-RIME: Cloud-based large scale
social network analysis,[ in Proc. IEEE Int. Conf. SCC, 2010,
pp. 506513.
8. J. Shi, W. Xue, W. Wang, Y. Zhang, B. Yang, and J. Li,
BScalable community detection in massive social networks
using MapReduce,[ IBM J. Res. & Dev., vol. 57, no. 3/4, pt. 12,
pp. 12:112:14, MayJul. 2013.
9. L. Stenneth, O. Wolfson, P. S. Yu, and B. Xu, BTransportation
mode detection using mobile phones and GIS information,[ in
Proc. 19th ACM SIGSPATIAL Int. Conf. Adv. Geogr. Inf. Syst.,
2011, pp. 5463.
10. S. Isaacman, R. Becker, R. Caceres, S. Kobourov, M. Martonosi,
J. Rowland, and A. Varshavsky, BIdentifying important places
in peoples lives from cellular network data,[ in Proc. 9th Int.
Conf. Pervasive Comput., 2011, pp. 133151.
11. J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and
Techniques, 3rd ed. San Mateo, CA, USA: Morgan Kaufmann,
2011.
12. W. Dong, L. Li, C. Zhou, Y. Wang, M. Li, C. Tian, and W. Sun,
BDiscovery of generalized spatial association rules,[ in Proc.
IEEE Int. Conf. SOLI, 2012, pp. 6065.
13. W. Dong, W. Fan, L. Shi, C. Zhou, and X. Yan, BA general
framework to encode heterogeneous information sources for
contextual pattern mining,[ in Proc. ACM Int. CIKM, 2012,
pp. 6574.
14. URL Types - The URL Cleaninghouse. [Online]. Available:
http://urlclearinghouse.wikidot.com/types
15. X. Qi and B. D. Davison, BWeb page classication: Features
and algorithms,[ ACM Comput. Surv., vol. 41, no. 2, pp. 131,
Feb. 2009.
16. D. Cohn and T. Hofmann, BThe missing link - A probabilistic
model of document content and hypertext connectivity,[ in Proc.
Adv. NIPS, 2001, pp. 430436.
17. Open Directory Project (ODP). [Online]. Available: http://www.
dmoz.org/
18. Wikipedia. [Online]. Available: http://www.wikipedia.org/
19. D. Konopnicki and M. Shmueli-Scheuer, BCustomer analyst
for the telecom industry,[ in Large-Scale Data Analytics.
New York, NY, USA: Springer Science and Business Media,
2014.
20. Articles With Open Directory Project Links. [Online]. Available:
http://en.wikipedia.org/wiki/Category:Articles_with_Open_
Directory_Project_links
21. Wikipedia Mapping. [Online]. Available: http://projects.dmoz.org/
project.cgi?id=7
H. CAO ET AL.
9 : 11
22. N. Modani, S. Nagar, S. Shannigrahi, R. Gupta, K. Dey, S. Goyal,

and A. A. Nanavati, BLike-Minded communities: Bringing the
familiarity and similarity together,[ J. World Wide Web, vol. 17,
no. 5, pp. 899919, 2014.
23. Netezza NPS v7.0.3 IEHSc. [Online]. Available: http://pic.dhe.
ibm.com/infocenter/ntz/v7r0m3/index.jsp?topic=%2Fcom.ibm.nz.
dbu.doc%2Fc_dbuser_overview_analytic_funcs.html
24. U. Kang, C. E. Tsourakakis, and C. Faloutsos, BPegasus: A
peta-scale graph mining system implementation and observations,[
in Proc. IEEE Int. Conf. Data Mining, 2009, pp. 229238.
25. G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn,
N. Leiser, and G. Czajkowski, BPregel: A system for large-scale
graph processing,[ in Proc. ACM Int. Conf. SIGMOD, 2010,
pp. 135146.
26. M. Bostock, V. Ogievetsky, and J. Heer, BD3 data-driven
documents,[ IEEE Trans. Vis. Comput. Graphics, vol. 17, no. 12,
pp. 23012309, Dec. 2011.
27. M. Girvan and M. Newmann, BCommunity structure in social
and biological networks,[ Proc. Nat. Acad. Sci. USA, vol. 99,
no. 12, pp. 78217826, Jun. 2002.
28. S. Fortunato, BCommunity detection in graphs,[ Phys. Rep.,
vol. 486, no. 35, pp. 35, Feb. 2010.
29. M. Newmann, BModularity and community structure in networks,[
Proc. Nat. Acad. Sci. USA, vol. 103, no. 23, pp. 85778582,
Jun. 2006.
30. V. Blondel, J. Guillame, J. Lambiotte, and R. Lefebvre, BFast
unfolding of communities in large networks,[ J. Stat. Mech.,
Theory Exp., vol. 2008, no. 10, p. P10008, Oct. 2008.
31. R. Becker, R. Caceres, K. Hanson, S. Isaacman, J. M. Loh,
M. Martonosi, J. Rowland, S. Urbanek, A. Varshavsky, and
C. Volinsky, BHuman mobility characterization from
cellular network data,[ Commun. ACM, vol. 56, no. 1, pp. 7482,
Jan. 2013.
32. P. A. Chirita, W. Nejdl, R. Paiu, and C. Kohlsch, BUsing ODP
metadata to personalize search,[ in Proc. 28th Annu. Int. ACM
SIGIR, 2005, pp. 178185.
33. F. Tanudjaja and L. Mui, BPersona: A contextualized and
personalized web search,[ in Proc. 35th Annu. Hawaii Int.
Conf. Syst. Sci., 2001, p. 67.
34. J. M. Kleinberg, BAuthoritative sources in a hyperlinked
environment,[ J. ACM, vol. 46, no. 5, pp. 604632,
Sep. 1999.
35. J. Min and G. J. F. Jones, BBuilding user interest proles from
Wikipedia clusters,[ presented at the Workshop Enriching
Information Retrieval ENIR/SIGIR, Beijing, China, Jul. 2011.
36. K. Ramanathan, J. Giraudi, and A. Gupta, BCreating hierarchical
user proles using wikipedia,[ HP Labs, Palo Alto, CA, USA,
Tech. Rep. 127, 2008.
37. K. Ramanathan and K. Kapoor, BCreating user proles using
wikipedia,[ in Proc. 28th Int. Conf. Conceptual Modeling, 2009,
pp. 415427.
38. N. Barbieri, F. Bonchi, and G. Manco, BCascade-based community
detection,[ in Proc. WSDM, 2013, pp. 3342.
39. M. Sachan, D. Contractor, T. Faruquie, and L. Subramaniam,
BUsing content and interactions for discovering communities in
social networks,[ in Proc. World Wide Web, 2012, pp. 330340.
40. Y. Zhou, H. Cheng, and J. Yu, BGraph clustering based on
structural/attribute similarities,[ J. Proc. VLDB Endowment,
vol. 2, no. 1, pp. 718729, Aug. 2009.
41. J. Cranshaw, R. Schwartz, J. Hong, and N. Sadeh, BThe livehoods
project: Utilizing social media to understand the dynamics of a
city,[ in Proc. ICWSM, 2012, pp. 5865.
42. N. Eagle, BMobile phones as social sensors,[ in Handbook of
Emergent Technologies in Social Research. Oxford, U.K.:
Oxford Univ. Press, 2005.
43. N. Eagle, A. Pentland, and D. Lazer, BInferring social network
structure using mobile phone data,[ Proc. Nat. Acad. Sci. USA,
vol. 106, no. 36, pp. 15 27415 278, Sep. 2007.
44. N. Eagle, Y. de Montjoye, and L. Bettencourt, BCommunity
computing: Comparisons between rural and urban societies
using mobile phone data,[ in Proc. IEEE Soc. Comput., 2009,
pp. 144150.
9 : 12
H. CAO ET AL.
45. M. A. Bayir, M. Demirbas, and N. Eagle, BDiscovering

spatiotemporal mobility proles of cellphone users,[ in Proc. Int.
Symp. World Wireless, Mobile Multimedia Netw., 2009, pp. 19.
46. A. Nanavati et al., BAnalyzing the structure and evolution of
massive telecom graphs,[ IEEE Trans. Knowl. Data Eng., vol. 20,
no. 5, pp. 703718, May 2008.
47. V. Pandit, N. Modani, S. Mukherjea, A. Nanavati, S. Roy, and
A. Agarwal, BExtracting dense communities from telecom
call graphs,[ in Proc. Commun. Syst. Softw. Middleware, 2008,
pp. 8289.
48. K. Dasgupta, R. Singh, B. Vishwanathan, D. Chakraborty,
S. Mukherjea, A. Nanavati, and A. Joshi, BSocial ties and their
relevance to churn in mobile telecom networks,[ in Proc.
Extending Database Technol., 2008, pp. 668677.
49. A. Nanavati, S. Gurumurthy, G. Das, D. Chakraborty,
K. Dasgupta, S. Mukherjea, and A. Joshi, BOn the structural
properties of massive telecom graphs: Finding and implications,[
in Proc. Int. Conf. Knowl. Manage., 2006, pp. 435444.
Received February 22, 2014; accepted for publication

March 17, 2014
Heng Cao IBM Research - China, Shanghai 201203, China
(hengcao@cn.ibm.com). Ms. Cao is an IBM Senior Technical Staff
Member and heads the IBM Research - China Shanghai Lab. She also
serves as the IBM Research Global Labs Analytics Leader and leads
the cross-geography Research teams in developing innovative analytics
technologies to address the emerging analytics requirements from
Growth Markets. Prior to that, she was on assignment from the IBM
Thomas J. Watson Research Center to the IBM China Research Lab
as the CTO for business analytics and optimization. Ms. Cao and her
team participated in many IBM on-demand business transformation
projects and successfully helped the business to improve performance
through analytics. She was the recipient of many IBM awards including
the IBM Outstanding Technical Achievement. She also received the
2008 National Women of Color Rising Star Award and the 2010
INFORMS (Institute for Operations Research and the Management
Sciences) Daniel H. Wagner Prize.
Wei Shan Dong IBM Research - China, Beijing 100193, China
(dongweis@cn.ibm.com). Dr. Dong is a Research Staff Member in IBM
Research - China. He received his B.E. degree in computer science
from the University of Science and Technology of China (USTC) in
2004, and his Ph.D. degree in pattern recognition and intelligent system
from the Institute of Automation, Chinese Academy of Sciences in
2009. He joined IBM Research - China in 2009. His research interests
include data mining (especially on spatiotemporal data), evolutionary
computation, and computer vision.
Leslie S. Liu IBM Research - China, Beijing 100193, China
(lesliu@cn.ibm.com). Dr. Liu currently is a Research Staff Member
at IBM Research - China working on Big Data-related research
including telecom mobility patterns and user proling for the connected
vehicle industry. Before joining IBM Research - China, Dr. Liu was
a Staff Member at the IBM Thomas J. Watson Research Center,
where he worked on innovations and research such as secure and
scalable mobile systems in the enterprise, next-generation application
development platforms, and cloud-based service models. Dr. Liu also
led a mobile service engagement team with members from North
America, China, Taiwan, and India. Dr. Liu and his team have been
actively engaged with opportunities with customers from nancial,
automotive, defense and insurance industries. The team has generated
multi-million dollars worth of revenue since 2008. Dr. Liu is the author
of nine patent applications and has published many technical papers
in mobile, multimedia, and cloud-related conferences and journals.
Dr. Liu also served as program chair and technical committee members
on multiple IEEE (Institute of Electrical and Electronics Engineers)
and ACM (Association for Computing Machinery) conferences.
IBM J. RES. & DEV.
VOL. 58
NO. 5/6
PAPER 9
Chun Yang Ma IBM Research - China, Beijing 100193,

China (machybj@cn.ibm.com). Dr. Ma is a Staff Researcher in IBM
Research - China. She received her B.S. and Ph.D. degree in computer
science from Zhejiang University, China, in 2006 and 2012. Her current
research interests include spatial database, data access methods, and
spatiotemporal data mining.
Wei Hong Qian IBM Research - China, Beijing 100193, China
(qianwh@cn.ibm.com). Ms. Qian is a staff researcher in IBM
Research - China. She received her B.S. and M.S. degrees from
Zhejiang University, majoring in computer science and technology. Her
research interests include interactive visual text analysis, interactive
visual social network analysis, simple visualization, text analytics,
embedded systems, etc.
Ju Wei Shi IBM Research - China, Beijing 100193, China (jwshi@
cn.ibm.com). Mr. Shi is a Research Staff Member in the Information
Management department at IBM Research - China. He received his B.S.
and M.S. degrees in electrical engineering from Beijing University of
Posts and Telecommunications, Beijing, China, in 2005 and 2008,
respectively. He subsequently joined IBM Research - China, where he
worked on Big Data analytics, such as Hadoop self-tuning, social
network analysis using MapReduce, Hadoop performance on PowerPC,
and data management and analytics applications across industries.
He also worked in Microsoft Research Asia as a visiting student in
2005. Mr. Shi has more than 10 papers published and 20 patents led.
Chun Hua Tian IBM Research - China, Beijing 100193, China
(chtian@cn.ibm.com). Dr. Tian is a Research Staff Member and
Manager in the Service Research department. He holds a Ph.D. degree
in automation science and engineering from Tsinghua University. His
current research interests include data mining, logistics and supply
chain management, and rule-based optimization.
Yu Wang IBM Research - China, Shanghai 201203, China
(wangyuwyu@cn.ibm.com). Mr. Wang is a Researcher in the IBM
Research - China Shanghai Lab. He received his B.S. degree at XiDian
University and an M.S. degree at SiChuan University. His current
research interests include real-time database and Big Data analytics
such as spatial-temporal data analysis and social network analysis using
MapReduce.
David Konopnicki IBM Research Division, IBM Research Haifa, Haifa University Campus, 31905 Haifa (davidko@il.ibm.com).
Dr. Konopnicki manages the Information Retrieval Group in IBM
Research - Haifa and has been involved in unstructured content analytics
both from a theoretical and a practical point of view. In academia,
Dr. Konopnicki developed search systems for the early web. In the IBM
Software Group, and in IBM Research, he has been leading a variety
of projects: development of large-scale full-text search engines,
building customer proles from enterprise and social media sources,
massive-scale analytics with applications to Telco companies, and more.
Dr. Konopnicki is an IBM Master Inventor and holds a Ph.D. degree in
computer science from the Technion-Israel Institute of Technology.
Michal Shmueli-Scheuer IBM Research Division, IBM
Research - Haifa, Haifa University Campus, 31905 Haifa (shmueli@il.
ibm.com). Dr. Shmueli-Scheuer is a Researcher in the Information
Retrieval Group in IBM Research - Haifa. Dr. Shmueli-Scheuer
received her Ph.D. degree in information and computer science at
the University of California, Irvine, in 2009. Her area of expertise is in
the elds of large-scale analytics, database, and information systems,
focusing on user-behavior analytics and information management on
the web. She has authored numerous papers on data management and
information retrieval in leading conferences.
IBM J. RES. & DEV.
VOL. 58
NO. 5/6
PAPER 9
Doron Cohen IBM Research Division, IBM Research - Haifa,

Haifa University Campus, 31905 Haifa (doronc@il.ibm.com).
Mr. Cohen is a Researcher in the Information Retrieval Group in IBM
Research - Haifa. He holds an M.Sc. degree from the Technion-Israel
Institute for Technology, and in 1990 he joined IBM to rst work
on compiler backend optimizations and later on information retrieval.
Mr. Cohen has authored several papers on information retrieval in
leading conferences.
Natwar Modani IBM Research - India, ISID Campus, Vasant
Kunj, New Delhi-70, India (namodani@in.ibm.com). Mr. Modani is a
Senior Software Engineer in the Telecom Research Innovation Center
at the IBM Research - India Lab. He received an M.E. (Integrated)
degree in electrical communication engineering from Indian Institute of
Science (IISc), Bangalore, India. He subsequently joined IBM Research
- India, where he has worked in eCommerce, autonomic systems,
and social network analysis areas. He has received an IBM Client
Value Outstanding Technical Achievement Award and IBM Research
Division Award. He is coauthor of 15 patents and 19 technical papers.
Hemank Lamba IBM Research - India, ISID Campus, Vasant
Kunj, New Delhi-70, India (helamba1@in.ibm.com). Mr. Lamba is a
software engineer in the Social Network Analytics Group at IBM
Research - India. Prior to IBM, he was at Indraprastha Institute of
Information Technology Delhi, where he completed his B.Tech. (with
Hons.) in computer science engineering in 2012. He has authored
or coauthored six papers in peer-reviewed international conferences.
He has worked on solutions dealing with viral marketing, mining
unusual patterns, and incentive mechanism design.
Ananth Dwivedi IBM Research - India, ISID Campus, Vasant
Kunj, New Delhi-70, India (anadwive@in.ibm.com). Mr. Dwivedi is
a software engineer in the Social Network Analysis Group at IBM
Research - India. He received a B.Tech. degree in computer engineering
from Indian Institute of Technology, Banaras Hindu University in 2012.
He has worked on viral marketing campaign management (Vibes).
Amit A. Nanavati IBM Research - India, ISID Campus, Vasant

Kunj, New Delhi-70, India (namit@in.ibm.com). Dr. Nanavati is
a Research Staff Member in the Mobile and Telecom Research
department at the India Research Lab. He received a B.S. degree in
computer science from Maharaja Sayajirao University in 1989, and an
M.S. degree in systems science and a Ph.D. degree in computer science
from Louisiana State University in 1994 and 1996, respectively. He
subsequently joined Netscape Communications Corporation and then
moved to IBM Research - India in 2000. In 2011, he was named
a Master Inventor and became an IBM Academy of Technology
member in 2013. He has authored over 40 patents (19 issued) and
45 publications. He coauthored a book on Speech in Mobile and
Pervasive Environments published by John Wiley, United Kingdom,
in 2012.
Manish Kumar IBM Telecom Industry, Sales and Distribution,
IBM Singapore Pte Ltd, Singapore 486048 (manish@sg.ibm.com).
Mr. Kumar is a Solutions Leader in the IBM Asia Pacic region.
He received his B.Sc. Honors degree in physics from Dibrugarh
University, India, in 1994. He has worked in a number of large
communications service provider companies and subsequently
joined IBM Global Telco Solutions Center, where he created the
industry-dening service delivery platform solution that created
a multi-million revenue opportunity for IBM customers and IBM.
He was awarded the Asia Pacic Hypergrowth Award. Mr. Kumar
specializes in new growth and revenue-generating services and
platforms and is an active member of multiple Telco forums
and communities.
H. CAO ET AL.
9 : 13

SoLoMo Analytics For Telco Big Data Monetization - 06964900

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

SoLoMo Analytics For Telco Big Data Monetization - 06964900

Hochgeladen von

Copyright:

Verfügbare Formate

SoLoMo analytics for telco

Big Data monetization

Digital Object Identifier: 10.1147/JRD.2014.2336177

phone users through the phone calls they make, web

IBM J. RES. & DEV.

system-oriented view, and due to space constraint, we do

The functionalities provided by the solution turn vast

IBM J. RES. & DEV.

format, and the analyzed data is saved in the IBM Netezza

Social networks profiling

IBM J. RES. & DEV.

social network, such as their social inuence and the

are close to one another. Furthermore, individual nodes are

features (with physical context) that are directly extracted

IBM J. RES. & DEV.

may be more interested in which competitor shopping

IBM J. RES. & DEV.

The customer micro-segmentation can be done by

However, empirical results show that the noise often,

Mobile usage profiling

browsing activity information: browsed pages are rst

IBM J. RES. & DEV.

categorizes more than 4 million web pages into more than

IBM J. RES. & DEV.

a hierarchical pattern analysis phase. Each pattern is driven

Example of pattern analyzer.

ODP URL index also contains a document for each ODP

IBM J. RES. & DEV.

Table 1 Example of user prole with top-three

and BTitle Query[ types, the extracted words are used

Finding like-minded communities: Combining

IBM J. RES. & DEV.

techniques. However, if peer pressure or social inuence

calling the ARULE function of Netezza. Identiers for the

subscribers. Using the Netezza High Capacity Appliance,

IBM J. RES. & DEV.

applied the ODP to enhance the HITS algorithm [34] using

Conclusions and future work

IBM J. RES. & DEV.

*Trademark, service mark, or registered trademark of International

22. N. Modani, S. Nagar, S. Shannigrahi, R. Gupta, K. Dey, S. Goyal,

45. M. A. Bayir, M. Demirbas, and N. Eagle, BDiscovering

Received February 22, 2014; accepted for publication

IBM J. RES. & DEV.

Chun Yang Ma IBM Research - China, Beijing 100193,

IBM J. RES. & DEV.

Doron Cohen IBM Research Division, IBM Research - Haifa,

Amit A. Nanavati IBM Research - India, ISID Campus, Vasant

Das könnte Ihnen auch gefallen