Rapport de Stage-Newversion-Yujie ZHENG

Ecole Français d’électronique et informatique
IT for finance
Promotion 2019
Analyze customer data to build a

recommendation algorithm
Internship report presented in October 2018
by
ZHENG Yujie
Internship done at: Hang Zhou et Kun Ming

Name of company：
Hangzhou Eastcom-BUPT Information Technology Co. Ltd
Complete address of internship：
1. Block A, B, Xixi Ginza, 780, Wen’er Ouest road, district de Xihu,
Hangzhou.
2. 118 Huancheng S Rd, Guandu Qu, Kunming
Thanks to Mr. Liu Tongcun who gives me the most important and
selfless help.
At the end of this internship I want to express my respects and my
sincere thanks to the SHE Guiwei, JIANG Bo who is engineer of the
company, contributed to realize my internship and helped me to
integrate into the company.in addition, Buli Hai, the engineer of
China Moblie, who helped me to do my internship in the best
conditions during this period.
I would like to express my gratitude to all those who helped me
directly or indirectly during this internship.
2
1 Table des matières
1 Introduction .......................................................................................................5
2 Content ...............................................................................................................6
2 Background ...........................................................................................................6
2.1 Description of the company............................................................................ 6
2.1.1 Eastcom-BUPT Information Technology Company ........................ 6
2.1.2 MIGU ................................................................................................................ 7
2.1.3 China mobile ............................................................................................... 10
2.2 Technical background ..................................................................................... 11
2.2.1 Hadoop.......................................................................................................... 11
2.2.2 Oracle database ........................................................................................ 12
2.2.3 Recall and precision ................................................................................. 12
2.2.4 K-means cliasstifier les produit ........................................................... 13
3 Recommendation system...........................................................................14
3.1 Theme and main purpose.............................................................................. 14
3.1.1 Approach to meet the objectives ....................................................... 14
3.2 working environnement ................................................................................. 15
3.3 Technical Environment ................................................................................... 16
3.4 Technical methods ........................................................................................... 17
3.4.1 Use PL/SQL for extract the data of the MIGU. .............................. 17
3.5 Analysis and results ......................................................................................... 17
3.5.1 Data ............................................................................................................... 17
3.5.2 Statistical analysis .................................................................................... 17
3.6 A top-n recommendation algorithm based on user social network
and book classification ............................................................................................... 19
3.6.1 Functional requirements: ...................................................................... 20
3.6.2 Modeling method ...................................................................................... 20
3.6.3 Test ................................................................................................................ 22
3.7 Collaborative filtering recommendation algorithm verification ....... 23
3.7.1 Verification ideas ...................................................................................... 23
3.7.2 Results and conclusions ......................................................................... 23
4 User classification .........................................................................................24
4.1 Data ....................................................................................................................... 24
4.2 Four clustering models to generate 4 types of user tags ................. 25
3
4.2.1 Flow-flow consumption ........................................................................... 25
4.2.2 Flow-flow preference ............................................................................... 26
4.2.3 Call-consumption preferences ............................................................. 27
4.2.4 Call-behavior preference ....................................................................... 28
4.3 KNN algorithm for supervised learning classification ......................... 29
4.4 Find the flow suppressed customer ........................................................... 30
3 Conclusion ........................................................................................................32
4
Introduction
I am a fourth-year student at EFREI (Ecole français d’électronique
et informatique) major in IT for finance. I did a 20 weeks internship
as a part of my training. It is a technical experience, it allows me to
practice the techniques we learn at school and understand the
technique used in the real company.
I am doing my internship at Hangzhou Eastcom-BUPT Information
Technology Co. Ltd. Today, thanks to the new database technique,
web site technique et etc. the company has a huge customer data.
The new challenge is how to use those data. A friend offers me a
position of engineer assistant algorithm in this Eastcom-BUPT. Also,
always interested in the technique of Big data I am happy that I
have the chance to participate the big data project in a real
company to fill a need.
EB, acronyms of Hangzhou Eastcom-BUPT Information Technology
provided various business solutions to telecom operators and
industrial customers, such as the telecommunications sector, big
data, communications security, software development, integration
of systems, operational support, consulting services, etc.
My Mission is assistant algorithm engineer. The aim of this
internship is that I can realize a project alone at the end of the
internship. This internship has two big parts. The first part is being
developed in Hang Zhou at MIGU. MIGU is client and partner of the
EB. My main work of the first part is studying the different
algorithm and tests these algorithms. The second part is work at
YUNNAN branch office of China mobile. The main work of this part
is analyst customer data.
5
Content
2 Background
2.1 Description of the company
2.1.1 Eastcom-BUPT Information Technology Company
Generalization:
I did my internship at a company called Hangzhou Eastcom-BUPT
Information Technology Co. Ltd. It is located at 4 / f, Dongxin
Mansion, No.398, Wensan Road. Hangzhou. He is engaged in the
research, development, integration and engineering of mobile
communications and Internet software.
Hangzhou Eastcom-BUPT Information Technology Co. Ltd. (referred
to as Eastcom-BUPT or EB for short) was co-founded by professor
LIAO Jianxin and Eastcom Communications Group in February
2000. Eastcom - BUPT has 50 million yuan of more than 20
branches in different provinces across the country. In addition, EB
has more than 4,000 square meters of office space and 20 million
yuan of investment in a large-scale laboratory.
Eastcom-BUPT's core technical team is made up of doctors and
masters in the fields of computer and telecommunication, and some
of them are the top researchers in the field of research and product
development value-added.
Eastcom - BUPT is engaged in the research, development,
integration and engineering of mobile communications and Internet
software. EB has provided various business solutions for telecom
operators as well as for industrial customers, such as
telecommunications, big data, communications security, software
development, systems integration, operational support, consulting
services and the solutions. It has become one of the leading
systems providers for the national telecommunications market and
one of the systems vendors that can provide controllable
technology and products to define the core capabilities of
telecommunication networks (4G VoLTE AS). Eastcom-BUPT, As the
second biggest market share holder, Eastcom-BUPT accounts for
1/3 market share for mobile intelligent network products.
I am working in the Big Data department. We offer the Big Data
solution for others companies. Our main partners are the branches
of China Mobile and China telecommunication. Most of the time, we
work with our client at location of our client.
The site web of company is www.ebupt.com
6
Human environment:
Eastcom-BUPT's has 800 employeds, two centers, 20 departments
and several branch offices.
Manager Director
office
General Managemet
Department
Sales and service Telecommunications

Big Data departement Test department Finance departement commerce department
Department product business
sales department Technical department
Pre-sale services Product department
Engineering
department
Technical Service
Department
I am work in the technique department and my tutor also work in

technique department which is a branch department of Big data
department. There are one hundred fifty-eight employees in the big
data department. There are ninety-seven individuals work in the
product department and sixty-one individual work in technical
department. in the technical department we work in the small
group in charge of different project. Like in Hang Zhou our project
has tree people contribute to algorithm and seventeen people
contribute to data base.
2.1.2 MIGU
Generalization:
I did the first part of my internship in Migu Co Ltd, which is one of
subsidiaries of China Mobile Communications Corp. It is a full-media
publishing of China Mobile. The aim of established this company is
to help the largest telecom company run its content business
better.
7
It is a culture and technology service arm of China Mobile. Migu
Digital Media have a registered capital of 10.4 billion yuan, it
integrates music, video, reading, game and comic sections which
had been run by China Mobile separately into one platform,
migu.cn. last year, the volume of business of the Migu Co Ltd, is 6
billion. It has also five branch company. They are MIGU music Ltd.,
MIGU video technology Ltd. Migu Digital Media Co.Ltd, MIGU
Interactive Entertainment Co., Ltd. And MIGU Animation Co., Ltd.
China Mobile
YUNNAN Branch other 42 Branch

Migu Co. Ltd.
company company
MIGU music
MIGU video
technology
Migu Digital
Media
MIGU Interactive
Entertainment Co.
MIGU Animation
Co.
I work in MIGU Interactive Entertainment Co. who is located at

Hang Zhou which is a beautiful city. Migu Digital Media Co. Ltd. was
established on December 18, 2014. A professional Internet
company with rich media mobile newspaper services. Its
predecessor, China Mobile's mobile phone reading base, was
launched in China Mobile Zhejiang Company in early 2009, and the
mobile phone reading business was officially launched in May 2010.
In 2011, China Mobile's mobile phone reading base signed a
strategic cooperation memorandum with the former General
Administration of Press and Publication, which further integrated
8
the national policy platform. In December 2013, China Mobile
released the main business brand “HE”, and the mobile phone
reading business was renamed “and reading”. In April 2015, China
Mobile's mobile phone reading base was officially listed and
transformed into Mi Wei Digital Media Co., Ltd.
By the end of 2017, the company has realized an industry value of
5.1 billion yuan, and its MIGU reading business platform has
gathered more than 500,000 copies of genuine book content. The
total number of monthly users is 110 million, which has been held
in more than 200 cities across the country. 1000 famous events,
fully serving the development of reading for all, and promoting a
new era of digital reading in China.
2017 MOBILE READING MARKET SHARE

QQ reader iReade Migu Shu Qi other
other
24%
QQ reader
36%
Shu Qi
8%
Migu
8%
iReade
24%
Product of MIGU Interactive Entertainment

The main business of MIGU Interactive Entertainment is selling
eBooks online. It has a mobile platform and a site web.
At the end of the September 2017, MIGU become the third Mobile
reading provider
MIGU earn money from the subscription, Sale of copyright and Pay
monthly. The writer publishes the book on line. Usually, the first
few chapters are free for the reader, but if you are interested in the
book you need to pay for the other chapter. For the book who has
already publish in another platform, MIGU buy the copyright and
sale the book directly in his platform. MIGU has also a service
named listening to books. You can listen the book content instead
of reading it.
9
2.1.3 China mobile
Generalization:
The second part of my internship is in the YUNNAN subsidiary of
china mobile.
China mobile is the leading telecommunications services provider in
China. It provides full communications services in all 31 provinces
China and in Hong Kong and boasts a world-class
telecommunications operator with the world's largest network and
customer base, it has a leading position in profitability and market
value ranking.
Its businesses primarily contain the mobile voice and data business,
wireline broadband and other information and communications
services. At the end of 2017, the number of employees is 464,656
which includes eight hundred eighty-seven million mobile customers
and hundred thirteen million wireline broadband customers with its
annual revenue exceeding seven hundred forty billion RMB. So, his
market share indirectly reaches approximately seventy-three of the
total number of issued shares of the Company. He sells his stock at
HONGKONG and US. We know nearly 28 percent of his stock was
held by public investors.
In 2017, the Company was once again selected as one of "The
World's 2,000 Biggest Public Companies" by Forbes magazine
The Organizational Structure of china mobile
10
2.2 Technical background
2.2.1 Hadoop
The system recommendation of MIGU Interactive Entertainment use
Hadoop distributed file system.
Hadoop is an open source framework for writing and running
distributed applications for large-scale data. It is designed for
offline and large-scale data analysis and is not suitable for online
11
transaction processing modes that read and write several records at
random. Hadoop=HDFS (file system, data storage technology
related) + Mapreduce (data processing), Hadoop data source can
be any form, better performance than relational database in
processing semi-structured and unstructured data with more
flexible processing capabilities, no matter what form of data will
eventually be converted to key/value, key/value is the basic data
unit. Using function to become Mapreduce instead of SQL, SQL is a
query statement, while Mapreduce uses scripts and code, and for
relational databases, Hadoop used to SQL has an open source tool
hive instead.
2.2.2 Oracle database
The BI system of MIGU use Oracle as its database. Oracle

Database is a multi-model database management system. It has
produced and marketed by Oracle Corporation.
It is a database commonly used for running online transaction
processing (OLTP), data warehousing (DW) and mixed (OLTP & DW)
database workloads. The latest generation, Oracle Database 18c. It
has the most market share in the database market.
The Oracle relational database management system logically stores
data in a tablespace and physically stores it as a data file. A table
space can contain multiple types of memory segments, such as a
data segment, an index segment, and the like. A section is
correspondingly composed of one or more extents. The extension
consists of connected data blocks. A data block is the basic unit of
data storage.
The Oracle database management system tracks data storage
through information stored in the SYSTEM tablespace. The SYSTEM
tablespace contains the data dictionary—and (the default) indexes
and clusters. The data dictionary contains a table that holds
information about user objects in all databases. Starting with the 8i
version, Oracle began to support local management of tablespaces
by storing space management information in bitmaps of their own
headers rather than in the SYSTEM tablespace.
2.2.3 Recall and precision
Precision (another name is positive predictive value) is the fraction

of relevant instances among the retrieved instances. Recall (also
known as sensitivity) is the fraction of relevant instances that have
12
been retrieved over the total amount of relevant instances. Both
precision and recall are therefore based on an understanding and
measure of relevance.
In simple terms, high precision means that an algorithm returned
substantially more relevant results than irrelevant ones, while high
recall means that an algorithm returned most of the relevant
results.
Suppose we have 60 positive samples and 40 negative samples. We
need to find all positive samples. The system chooses 50 samples,
of which only 40 are true positive samples, and the above indicators
are calculated.
TP: predict positive class as positive class 40
FN: predict positive class as negative class 20
FP: predict negative class as positive class 10
TN: predict negative class as negative class 30
accuracy = (TP+TN)/(TP+FN+FP+TN) = 70%
precision = TP/(TP+FP) = 80%
recall = TP/(TP+FN) = 2/3
2.2.4 K-means cliasstifier les produit
The product information is the quantitative data. For classification

of products we use k-means method
The k-means method was introduced by J. McQueen in 1971.
In the context of unsupervised classification, we generally try to
partition the space into classes that are concentrated and isolated
from each other. In this perspective, the k-means algorithm aims to
minimize intra-class variance, which results in the following
minimization of energy:
With all the clusters and for each cluster, its center (called core), its
variance, the number of its elements and all the data that is one
seeks to classify.
13
3 Recommendation system
3.1 Theme and main purpose

The main content of this internship is to build a book
recommendation system based on the algorithm already existing.
Other works include but is not limited to assisting tutor in extracting
data and cleaning data
3.1.1 Approach to meet the objectives

In order to finish the project successful and adapt to the company's
environment, we divided the work into several parts.
14
Understand the internship content, familiar
with the environment, read the thesis.
Read social networking mining related
papers and what the other people did.
Understanding the structure of the data base.

comunicate with the others colleague. Propose
what data is needed
The data of the MIGU, Extract data by

myself.The data of the MIGU, Extract data by
myself.
The data of the China moblie, Ask for relevant
colleagues help me.The data of the MIGU,
Extract data by myself.
Use python to impletment the initial model.

Observe the data to summarize the reasons
why the old model is not good enough, and the
problems that arise and create new models.
Implement an improved model

Extract data and experiments again
3.2 working environnement
The working environment in MIGU Interactive Entertainment is very

relax and happy. The working time start at half of eight o’clock am.
and end of half of five o’clock. The food in the company’s cafeteria
is very delicious. At the first floor of this building there is a gym,
but unfortunately, there is no space for everyone at noon. I am
happy to work at here. The MIGU Interactive Entertainment has
700 employees work at here. And our company has un team with
50 people work at here, 3 people working for the algorithm and 17
people working for the data base. We do a project of BI platforms
which valued 20 million for MIGU Interactive Entertainment.
15
Figure 1 The lobby of company
Figure 2 My work space
3.3 Technical Environment

We use the Oracle as data base and java as programming
language. When the algorithm engineering a new algorithm, they
write a document of this algorithm with the result of test. The
leader will read this document and decide if we use it or not. And
then, if we decide to use this algorithm, the programming engineer
use JAVA to program it.
16
Our database has five layers. The first layer is that Huawei provides
all kinds of subscription information. The name of second layer is
source, the data of second reserve for seven days. The last layer is
Data Ming layer. Most table I have used are came from last two
layers.
3.4 Technical methods
3.4.1 Use PL/SQL for extract the data of the MIGU.
The fist work I need to do is extract the data from database and
this is one of my main work. I work in PL/SQL.
The request example is show blew.
select MSISDN,book_id,d.bookname,class_id,c.class_name from

(select MSISDN,book_id,a.class_id,class_name from
temp_booklistcalledah2018004 a,dim.dim_book_class_type b where
a.class_id=b.class_id) c ,
dim.dim_bookinfo d where c.book_id=d.bookid;
3.5 Analysis and results
3.5.1 Data
The data of the China Mobile I don’t have the access rights. So, I
need write a document to explain what data is needed. My
colleague will extract it to me.
And then, we use R and python to clean the data.
3.5.2 Statistical analysis

In order to choose the suitable algorithm, a statistical analysis of
data is needed.
Like the example blew. This example analyst th Call Data of Anhui
Province and Guangxi Province. Like the histogram show, more
than 90 percent the caller users call less than 100 called user every
month.
17
Another example is comparing the age distribution between the
caller et called.
18
It can be seen from the comparison chart above that the age of the
called user is more concentrated and younger than the age of the
calling party.
3.6 A top-n recommendation algorithm based on

user social network and book classification
A top-n recommendation is the principal work that I did in MIGU. This
top-n recommendation algorithm based on user social network and
book classification.
19
3.6.1 Functional requirements:
The user's social network is constructed by the call information, and
the user is recommended according to the classification of the books
and the readings of his friends.
3.6.2 Modeling method
Data Pre-processing
Save the call information as a csv file and remove the service number.
The subscription information of the calling user and the called user
and their corresponding classifications are extracted from the
database.
Create a new file containing reading information for all called and
called users. Sort books, and each time user reads book, then the
book's heat is increased by one. The resulting file is stored in a pkl
file.
Create a list of books, including the book id and its name. Create
a category list with the category id and its name, stored as csv files.
In the actual project, we can use the tables in the database directly
without create another one.
Create a list of called user information and a list of calling user
information, including the book id and its name. Create a category
list containing the category id and its name, save also as csv files. All
reading information has been deduplicated.
Calculate the proportion of the length of the call between two users
in the total call duration of the caller
More call, more close
Build a multi-level social network

The main purpose of this step is to build a multi-level social network.
The program enters the call information and returns a list of affinity
relationships of the social network centered on the target user.
In order to extend the social network, each time we add the newly
called user in the social network. We use it as the calling parts until
there are k users in the social network or reach the maximum number
of iterations n
For example, user lily makes a call to b, c, d. We add b, c, and d to
the social network to establish a layer lily social network. The
program uses b, c, and d as the starting point. The next iteration
extends the social network to a two-layer social network that has
been centered on lily. Where b calls d, e, f, we update the value of d
and add e, f to the social network. c called d, g again. We update the
value of d and add g to the network. It should be noted that the value
of d has changed at this time, and we have used the value of d to
20
calculate the affinity of other users. So, in the next loop we need to
recalculate the intimacy of the user associated with d.
For indirect social relationships like Lily and e we have two ways to
calculate. One is the calculation of entropy, and the other is the
probability model. Here we use the probabilistic model method to
calculate the intimacy, R is the intimacy, and T is the weight of the
call.
R1e  T1b *Tbe
We use the formula below to calculate the relationship between Lily
and d,
R1d  T1d  Tac *Tcd  T1d *Tac *Tcd
Calculate the value of ‘trust’ between the target user and the other
users in the social network
We suppose the people who has the similar age, the same
gender and more call time will have the similar preferences of
reading.
The main purpose of this procedure is to use the age, gender
and distance in the social network to calculate the target user's
value of ‘trust’ in the recommendations from other users in the
social network. Distance in the social network is the list of value we
have already calculated in the last step.
Enter three variables, age, gender, and distance in the social
network, and return a list of trusts value. The formula below is the
formula to calculate
ValueTrust = [(gender coefficient * gender value) ^

2 + (age coefficient * age value) ^ 2 + (intimacy
coefficient * intimate weight) ^ 2] ^ ½
The gender value has the same gender value of 1, and the
difference is 0.
The gender coefficient, age coefficient and intimacy coefficient are
three adjustable coefficients. we can change the weight of the
corresponding factor by changing the three coefficients.
Make a recommendation based on the social network and the value

of ‘trust’
The purpose of this step is to give a top-n sort of list of the
books. The program inputs a list of trusts and a csv file of the
called reading information It will returns a list of recommended
books for the circle of friends.
21
The book recommendation degree is the value of the trust
relationship. If a book is read several people, we Superimposed the
‘trust’ values of these people. If a book is read by one user in
different months in multiple months, we will not be deduplicated.
Give the relationship between the classification of reading books in

social networks and the actual reading books
This step is very important. The purpose of this step is that use
book classification label to classifier books so that the user can be
recommended by class t in the future. This step is one of the key
parts of this recommendation system.
According to the above process, I found two phenomena. The first
problem, the reading preference of the core users is not the same
as their friend. The second phenomenon is that users who have the
same type of social reseaux always have similar preference. Applied
to the current project, If the types of books read by two users’
friends are similar, the books read by the two users will be similar.
Due to the different users has different size of the social networks,
the readings of friends are also inconsistent. The number of books
we have recommended in the previous method is not consistent.
Someone has only four or five books read in the reseaux social, but
some other people has more than forty or fifty books read in the
reseaux social. By learning the relationship between the book
category read in the reseaux social and the book category that the
recommended to users, we can achieve accurate recommendations
by category.
The benefit of recommending by class is 1: which can achieve a
stable number of books recommended by the user each time.
2: A user is generally preference to some book category, if the
category recommendation is correct, we have a high possibility to
find the user's favor books.
3.6.3 Test
Result
The sample was divided into 200 parts, 5 of which were used as
test samples for 195 samples. In previous tests, I found that the
user who has a grand social reseaux and the user who has a small
social reseaux were not applicable to such models. Delete users
who have grand social reseaux (more than 300 friends) and users
who have small social reseaux
recall (%) precision (%) F1
Group 1 23.0357143 1.9607843 1.760375929
Group 2 14.2210884 1.6806723 1.413916545
22
Group 3 44.5102041 5.0389356 4.443426213
Group 4 7.7777778 1.9607843 1.594516595
Group 5 32.4404762 4.4117647 3.889996641
mean 25.9094388 3.0104342 2.630984541
3.7 Collaborative filtering recommendation

algorithm verification
3.7.1 Verification ideas
Verification ideas: divide the historical data of the same batch of

users into ten copies, one for testing, nine for model training, and
output the top 30 books with the highest matching degree and
compare with the test data to get the accuracy rate. Recall rate and
F1 value. Repeat ten times, take the accuracy of the five results,
the average of the recall rate and the F1 value as the verification
result of the final model.
Verification data: The verification data uses a random selection of
10,000 users from January 1 to June 30 for a total of six months of
historical behavior data, including: click, watch, download, and
collect. The user behavior score is used as the score input of
collaborative filtering, and each behavior score is obtained by
weighted summation of each behavior number. The weight is
calculated by the entropy method.
3.7.2 Results and conclusions
Table blow Collaborative filtering offline test result table
Precision (%) Recall (%) F1
Group 1 3,17% 58% 6,01%

Group 2 3,14% 56,83% 5,96%
Group 3 3,21% 57,65% 6,09%
Group 4 3,23% 58,83% 6,12%
Group 5 3,16% 68,14% 6,00%
mean 3,18% 60% 6%
From the results above, it is found that the accuracy rate, the recall
rate and the F1 value are not much different each time, indicating
that the extracted users are relatively random, and the model has
stability. The recall rate is very high, indicating that the
recommendation results include an animation that most users will
23
actively read even if they do not follow the recommended list. The
accuracy rate is about 3%, which means that in addition to the
animation that the user will take the initiative to watch, there are
many results based on the user's historical interest. The user's true
feedback on this part of the animation needs to be tested after
online recommendation.
4 User classification
After 12 weeks work in MIGU, my tutor has another program. He

gone to another city to do another project. So, I go to another
branche company also and I work at there for 8 weeks. This time, I
work for the Yunnan Branch of China Mobile. My new task is
classification clients into different group and analyst the figure of
client.
4.1 Data
My colleague extracts the data for me. Those data contain the
information of the user, product, the type of telephone of the user,
the flow used and calling information.
The number of users in YUNNAN is 22.64 million. My colleague

extracts a sample with 1 million for me. Most of those data is
quantitative. Only the product name, Mobile phone brands, type of
system are data qualitative.
In order to understanding the data and find the best algorithms. We
need to do a descriptive statistical.
I found the covariance between those variable and sort those
covariances.
24
I found many very interesting results. Such as the amount which
people use flow between twenty o’clock to six o’clock second day is
relative to price of telephone. The number of supplement packages
ordered is proportional to music preferences and free time flow
used.
4.2 Four clustering models to generate 4 types

of user tags
We use algorithm k-means to do this clustering. We want to find
the different client by flow and call records in order to Sell our
packages
4.2.1 Flow-flow consumption

In the first model we use the flow data to cluster users into 4
group. Those four group are:
Type 1, Medium flow + in-suite flow is enough to use +3% to buy a
flow packet supplement.
Type 2, High and medium traffic + insufficient in-set traffic with
+17% to buy traffic packets.
Type3, Low traffic + in-suite traffic is enough to buy +2% traffic
packs.
Type 4, High traffic + in-suite traffic is enough to buy +12% traffic
pack
25
4.2.2 Flow-flow preference
We use the data of flow and flow preference to separate customers
int five group. They are economical entertainment flow preferences,
entertainment flow preferences, no entertainment flow preference,
entertainment flow preference _ music aversion and entertainment
flow preferences _ social aversion.
26
4.2.3 Call-consumption preferences
We use the data of call and flow preference to separate customers

int four group. They are normal call requirements, enough voice in
the set, normal call requirements, local voice is not enough
normal call requirements, domestic voice is not enough
Weak call requirements and enough voice in the set.
27
4.2.4 Call-behavior preference
We use the data of Call-behavior preference
to separate customers int four group. They are Local resident and
local reseaux social, frequent intra-provincial travel users, mild
domestic travel users in the province and frequent outbound travel
users.
28
The result is not satisfactory. We need only eight to nine group and
we want to use nine to ten variables. But in this case, we use more
than ten variables and have more than fifteen groups. We can’t use
it into the actual system even if the model has a good score.
4.3 KNN algorithm for supervised learning

classification
After observing the data, there are three types of package

supplement. I consider those three types as target to do the
supervised learning classification. I choose 6 variable who has most
correlation with the variable of package supplement. And I used
KNN algorithm to do a s KNN algorithm supervised learning
classification. The score of this model reach to 0.695.
We need pay attention to the balance of sample and overfitting
29
4.4 Find the flow suppressed customer
There are some customers use much flow at the beginning of the
month and then they find that they don’t have enough flow. So,
they repress demand of flow until the end of the month.
Diagram of flow
repression curve Repression index: The area
between the cumulative flow
1.5 percentage and the reference line
represents the degree of flow
1 suppression. The larger the area,
the higher the flow suppression.
0.5
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
With using the history data of the flow to construct a model to find
the customers who repress their demand of flow. The China Mobile
can propose un flow supplement package to them.
30
31
Conclusion
Conclusion of top-n recommendation algorithm
What can be improved on the original model:
Model aspect:
1. The current data generates only one-layer of social network. If
we can build multiple layers of social networks, we will reduce the
noise impact and the results will be more accurate. It is even
possible to develop some new users for the user community.
2. MIGU have already used a system recommendation algorithm in
the current system, but the result of classification of their current
recommendation system is not very accurate and diverse. The
system always recommends the most popularity book in a book
category, so the result of recommendation is always similar.
But people will like several different types of books maybe other
non-popular subcategories. It need a recommendation results with
more diversity. If we considered more about other information such
as book tags, we can have a more accurate model.
Data aspect:
1. The relevant parameters can be tuned in the further. The setting
of the number of books recommended for each category and the
setting of the total number of books to recommend have the
greatest influences on the recall and accuracy.
2. I don’t have the data of telecommunication for other provinces.
we cannot generate multiple layers of social networks because of
sample data is small. The social network in the current experiment
is only one layer. So, the small sample data is not enough. As the
number of training samples increases, the recommended model will
be more accurate.
Update the model:

How to get a better model in this case? first, to build a multi-
layered social network through communication data. And then,
establish the active-passive relationship between the recommended
user and the recommended user.
Conclusion of user classification:

In the actual project. We don’t need an algorithm too complex to do
a classification. The key of successful is understand the business of
the company. For me, I prefer to extract the data by myself.
Because the data extract by others lose information sometimes and
it is no use to me understand the database and business. The most
32
important in the company is solve the problem. At most time, a
simple model can solve a grand problem.
Conclusion for this experience

I am internship at Hangzhou Eastcom-BUPT Information Technology
Co., Ltd., working at MIGU Reading Base. Although this internship
is only five months, my gains are huge. First, this is my first
internship in this major. This internship gave me a preliminary
understanding of the Internet industry and big data direction.
My internship is mainly to establish a social network-based book
recommendation system. I conducted a survey and found that
recommendation system papers are usually similar, such as, most
paper talk about collaborative filtering recommendation algorithms
buy not social network-based recommendation algorithm.
Recommendation system is most used in e-commerce to
recommend the product, few of recommendation is specializing to
recommend a book. Daily supplies are so different from the book in
the recommendation field. There are more categories of goods than
books, the people repeated purchase of book with same categories,
but they rarely buy the same daily supplies in a short time. For
example, someone who bought a mouse won't buy a second one in
a short time, but someone who reads a romantic novel will look at
the second one similar. Through those researches, I have a
preliminary idea of building a model. By learning python, I
implemented the model according to my own thoughts. However, in
the verification, I found that the model did not get good results,
because the books read by the users and read by their friends. This
is the same as the common saying that people always look for
different people as their partners in the same social class. Then by
observing data I found that similar users always have similar
reseaux social. So, the second model was proposed, built and
verified. This model has gotten good results. In addition, I found
that most people read a few kinds of popular books, which may be
due to previous popular recommendations. This also explains that,
firstly, users are greatly influenced by the recommendation results
Sometimes they read without purpose. Second, our user types are
relatively simple. Most users are reading the city of enthusiasm.
Users who like other types of books may not become MIGU users. If
our recommendations are more diverse, we may also attract more
differences type of users, expanding the number of users.
There are two main aspects to the improvement of the internship.
The first aspect is that the time is not enough for me. The second
aspect is that my ability of programming is not enough. I wasted A
lot of time in debugging. And my code is not so clean, it is difficult
for others to understand.
33
An internship is important for students. A good internship can
improve professional quality, enhance analytical and problem-
solving skills. An internship is also an important way to train
talents, expand our knowledge, expand our contact with society,
increase our experience in social competition, exercise and improve
our ability. In addition, the purpose of the internship may be
adapted to the needs of the job this time I analyze the data in
practice. I found it is important to quickly adapt to the new working
environment after you enter the work space.
And learning quickly is also important because the knowledge
updated very quickly in this era.
34
References：
[1] http://www.ebupt.com/list/roll/snavid/5
[2] https://www.w3cschool.cn/hadoop/hadoop_introduction.html
[3] http://218.200.230.11/about/intro.html
[4] https://www.chinamobileltd.com/en/about/overview.php
[5] Data Mining Concepts and Techniques, Jiawei Han
[6] Some methods for classification and analysis of multivariate
observations.
[7] In Proceedings of the Fifth Berkeley Symposium on Mathematical
Statistics and Probability, 1:281-297, 1967
35

Rapport de Stage-Newversion-Yujie ZHENG

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Rapport de Stage-Newversion-Yujie ZHENG

Hochgeladen von

Copyright:

Verfügbare Formate

Ecole Français d’électronique et informatique

Analyze customer data to build a

Internship done at: Hang Zhou et Kun Ming

2.1 Description of the company

2.1.1 Eastcom-BUPT Information Technology Company

Sales and service Telecommunications

sales department Technical department

Pre-sale services Product department

I am work in the technique department and my tutor also work in

YUNNAN Branch other 42 Branch

I work in MIGU Interactive Entertainment Co. who is located at

2017 MOBILE READING MARKET SHARE

Product of MIGU Interactive Entertainment

The Organizational Structure of china mobile

2.2.2 Oracle database

The BI system of MIGU use Oracle as its database. Oracle

2.2.3 Recall and precision

Precision (another name is positive predictive value) is the fraction

2.2.4 K-means cliasstifier les produit

The product information is the quantitative data. For classification

3.1 Theme and main purpose

3.1.1 Approach to meet the objectives

Understanding the structure of the data base.

The data of the MIGU, Extract data by

Use python to impletment the initial model.

Implement an improved model

3.2 working environnement

The working environment in MIGU Interactive Entertainment is very

Figure 2 My work space

3.3 Technical Environment

3.4 Technical methods

3.4.1 Use PL/SQL for extract the data of the MIGU.

select MSISDN,book_id,d.bookname,class_id,c.class_name from

3.5 Analysis and results

3.5.2 Statistical analysis

3.6 A top-n recommendation algorithm based on

3.6.2 Modeling method

Build a multi-level social network

ValueTrust = [(gender coefficient * gender value) ^

Make a recommendation based on the social network and the value

Give the relationship between the classification of reading books in

3.7 Collaborative filtering recommendation

3.7.1 Verification ideas

Verification ideas: divide the historical data of the same batch of

3.7.2 Results and conclusions

Table blow Collaborative filtering offline test result table

Precision (%) Recall (%) F1

Group 1 3,17% 58% 6,01%

After 12 weeks work in MIGU, my tutor has another program. He

The number of users in YUNNAN is 22.64 million. My colleague

4.2 Four clustering models to generate 4 types

4.2.1 Flow-flow consumption

We use the data of call and flow preference to separate customers

4.3 KNN algorithm for supervised learning

After observing the data, there are three types of package

Update the model:

Conclusion of user classification:

Conclusion for this experience

Das könnte Ihnen auch gefallen