Beruflich Dokumente
Kultur Dokumente
IT for finance
Promotion 2019
by
ZHENG Yujie
2
1 Table des matières
1 Introduction .......................................................................................................5
2 Content ...............................................................................................................6
2 Background ...........................................................................................................6
2.1 Description of the company............................................................................ 6
2.1.1 Eastcom-BUPT Information Technology Company ........................ 6
2.1.2 MIGU ................................................................................................................ 7
2.1.3 China mobile ............................................................................................... 10
2.2 Technical background ..................................................................................... 11
2.2.1 Hadoop.......................................................................................................... 11
2.2.2 Oracle database ........................................................................................ 12
2.2.3 Recall and precision ................................................................................. 12
2.2.4 K-means cliasstifier les produit ........................................................... 13
3 Recommendation system...........................................................................14
3.1 Theme and main purpose.............................................................................. 14
3.1.1 Approach to meet the objectives ....................................................... 14
3.2 working environnement ................................................................................. 15
3.3 Technical Environment ................................................................................... 16
3.4 Technical methods ........................................................................................... 17
3.4.1 Use PL/SQL for extract the data of the MIGU. .............................. 17
3.5 Analysis and results ......................................................................................... 17
3.5.1 Data ............................................................................................................... 17
3.5.2 Statistical analysis .................................................................................... 17
3.6 A top-n recommendation algorithm based on user social network
and book classification ............................................................................................... 19
3.6.1 Functional requirements: ...................................................................... 20
3.6.2 Modeling method ...................................................................................... 20
3.6.3 Test ................................................................................................................ 22
3.7 Collaborative filtering recommendation algorithm verification ....... 23
3.7.1 Verification ideas ...................................................................................... 23
3.7.2 Results and conclusions ......................................................................... 23
4 User classification .........................................................................................24
4.1 Data ....................................................................................................................... 24
4.2 Four clustering models to generate 4 types of user tags ................. 25
3
4.2.1 Flow-flow consumption ........................................................................... 25
4.2.2 Flow-flow preference ............................................................................... 26
4.2.3 Call-consumption preferences ............................................................. 27
4.2.4 Call-behavior preference ....................................................................... 28
4.3 KNN algorithm for supervised learning classification ......................... 29
4.4 Find the flow suppressed customer ........................................................... 30
3 Conclusion ........................................................................................................32
4
Introduction
I am a fourth-year student at EFREI (Ecole français d’électronique
et informatique) major in IT for finance. I did a 20 weeks internship
as a part of my training. It is a technical experience, it allows me to
practice the techniques we learn at school and understand the
technique used in the real company.
I am doing my internship at Hangzhou Eastcom-BUPT Information
Technology Co. Ltd. Today, thanks to the new database technique,
web site technique et etc. the company has a huge customer data.
The new challenge is how to use those data. A friend offers me a
position of engineer assistant algorithm in this Eastcom-BUPT. Also,
always interested in the technique of Big data I am happy that I
have the chance to participate the big data project in a real
company to fill a need.
EB, acronyms of Hangzhou Eastcom-BUPT Information Technology
provided various business solutions to telecom operators and
industrial customers, such as the telecommunications sector, big
data, communications security, software development, integration
of systems, operational support, consulting services, etc.
My Mission is assistant algorithm engineer. The aim of this
internship is that I can realize a project alone at the end of the
internship. This internship has two big parts. The first part is being
developed in Hang Zhou at MIGU. MIGU is client and partner of the
EB. My main work of the first part is studying the different
algorithm and tests these algorithms. The second part is work at
YUNNAN branch office of China mobile. The main work of this part
is analyst customer data.
5
Content
2 Background
Generalization:
I did my internship at a company called Hangzhou Eastcom-BUPT
Information Technology Co. Ltd. It is located at 4 / f, Dongxin
Mansion, No.398, Wensan Road. Hangzhou. He is engaged in the
research, development, integration and engineering of mobile
communications and Internet software.
Hangzhou Eastcom-BUPT Information Technology Co. Ltd. (referred
to as Eastcom-BUPT or EB for short) was co-founded by professor
LIAO Jianxin and Eastcom Communications Group in February
2000. Eastcom - BUPT has 50 million yuan of more than 20
branches in different provinces across the country. In addition, EB
has more than 4,000 square meters of office space and 20 million
yuan of investment in a large-scale laboratory.
Eastcom-BUPT's core technical team is made up of doctors and
masters in the fields of computer and telecommunication, and some
of them are the top researchers in the field of research and product
development value-added.
Eastcom - BUPT is engaged in the research, development,
integration and engineering of mobile communications and Internet
software. EB has provided various business solutions for telecom
operators as well as for industrial customers, such as
telecommunications, big data, communications security, software
development, systems integration, operational support, consulting
services and the solutions. It has become one of the leading
systems providers for the national telecommunications market and
one of the systems vendors that can provide controllable
technology and products to define the core capabilities of
telecommunication networks (4G VoLTE AS). Eastcom-BUPT, As the
second biggest market share holder, Eastcom-BUPT accounts for
1/3 market share for mobile intelligent network products.
I am working in the Big Data department. We offer the Big Data
solution for others companies. Our main partners are the branches
of China Mobile and China telecommunication. Most of the time, we
work with our client at location of our client.
The site web of company is www.ebupt.com
6
Human environment:
Eastcom-BUPT's has 800 employeds, two centers, 20 departments
and several branch offices.
Manager Director
office
General Managemet
Department
Engineering
department
Technical Service
Department
2.1.2 MIGU
Generalization:
I did the first part of my internship in Migu Co Ltd, which is one of
subsidiaries of China Mobile Communications Corp. It is a full-media
publishing of China Mobile. The aim of established this company is
to help the largest telecom company run its content business
better.
7
It is a culture and technology service arm of China Mobile. Migu
Digital Media have a registered capital of 10.4 billion yuan, it
integrates music, video, reading, game and comic sections which
had been run by China Mobile separately into one platform,
migu.cn. last year, the volume of business of the Migu Co Ltd, is 6
billion. It has also five branch company. They are MIGU music Ltd.,
MIGU video technology Ltd. Migu Digital Media Co.Ltd, MIGU
Interactive Entertainment Co., Ltd. And MIGU Animation Co., Ltd.
China Mobile
MIGU music
MIGU video
technology
Migu Digital
Media
MIGU Interactive
Entertainment Co.
MIGU Animation
Co.
8
the national policy platform. In December 2013, China Mobile
released the main business brand “HE”, and the mobile phone
reading business was renamed “and reading”. In April 2015, China
Mobile's mobile phone reading base was officially listed and
transformed into Mi Wei Digital Media Co., Ltd.
By the end of 2017, the company has realized an industry value of
5.1 billion yuan, and its MIGU reading business platform has
gathered more than 500,000 copies of genuine book content. The
total number of monthly users is 110 million, which has been held
in more than 200 cities across the country. 1000 famous events,
fully serving the development of reading for all, and promoting a
new era of digital reading in China.
other
24%
QQ reader
36%
Shu Qi
8%
Migu
8%
iReade
24%
9
2.1.3 China mobile
Generalization:
The second part of my internship is in the YUNNAN subsidiary of
china mobile.
China mobile is the leading telecommunications services provider in
China. It provides full communications services in all 31 provinces
China and in Hong Kong and boasts a world-class
telecommunications operator with the world's largest network and
customer base, it has a leading position in profitability and market
value ranking.
Its businesses primarily contain the mobile voice and data business,
wireline broadband and other information and communications
services. At the end of 2017, the number of employees is 464,656
which includes eight hundred eighty-seven million mobile customers
and hundred thirteen million wireline broadband customers with its
annual revenue exceeding seven hundred forty billion RMB. So, his
market share indirectly reaches approximately seventy-three of the
total number of issued shares of the Company. He sells his stock at
HONGKONG and US. We know nearly 28 percent of his stock was
held by public investors.
In 2017, the Company was once again selected as one of "The
World's 2,000 Biggest Public Companies" by Forbes magazine
10
2.2 Technical background
2.2.1 Hadoop
The system recommendation of MIGU Interactive Entertainment use
Hadoop distributed file system.
Hadoop is an open source framework for writing and running
distributed applications for large-scale data. It is designed for
offline and large-scale data analysis and is not suitable for online
11
transaction processing modes that read and write several records at
random. Hadoop=HDFS (file system, data storage technology
related) + Mapreduce (data processing), Hadoop data source can
be any form, better performance than relational database in
processing semi-structured and unstructured data with more
flexible processing capabilities, no matter what form of data will
eventually be converted to key/value, key/value is the basic data
unit. Using function to become Mapreduce instead of SQL, SQL is a
query statement, while Mapreduce uses scripts and code, and for
relational databases, Hadoop used to SQL has an open source tool
hive instead.
12
been retrieved over the total amount of relevant instances. Both
precision and recall are therefore based on an understanding and
measure of relevance.
In simple terms, high precision means that an algorithm returned
substantially more relevant results than irrelevant ones, while high
recall means that an algorithm returned most of the relevant
results.
Suppose we have 60 positive samples and 40 negative samples. We
need to find all positive samples. The system chooses 50 samples,
of which only 40 are true positive samples, and the above indicators
are calculated.
TP: predict positive class as positive class 40
FN: predict positive class as negative class 20
FP: predict negative class as positive class 10
TN: predict negative class as negative class 30
accuracy = (TP+TN)/(TP+FN+FP+TN) = 70%
precision = TP/(TP+FP) = 80%
recall = TP/(TP+FN) = 2/3
With all the clusters and for each cluster, its center (called core), its
variance, the number of its elements and all the data that is one
seeks to classify.
13
3 Recommendation system
14
Understand the internship content, familiar
with the environment, read the thesis.
Read social networking mining related
papers and what the other people did.
15
Figure 1 The lobby of company
16
Our database has five layers. The first layer is that Huawei provides
all kinds of subscription information. The name of second layer is
source, the data of second reserve for seven days. The last layer is
Data Ming layer. Most table I have used are came from last two
layers.
The fist work I need to do is extract the data from database and
this is one of my main work. I work in PL/SQL.
The request example is show blew.
3.5.1 Data
The data of the China Mobile I don’t have the access rights. So, I
need write a document to explain what data is needed. My
colleague will extract it to me.
And then, we use R and python to clean the data.
17
Another example is comparing the age distribution between the
caller et called.
18
It can be seen from the comparison chart above that the age of the
called user is more concentrated and younger than the age of the
calling party.
19
3.6.1 Functional requirements:
The user's social network is constructed by the call information, and
the user is recommended according to the classification of the books
and the readings of his friends.
Data Pre-processing
Save the call information as a csv file and remove the service number.
The subscription information of the calling user and the called user
and their corresponding classifications are extracted from the
database.
Create a new file containing reading information for all called and
called users. Sort books, and each time user reads book, then the
book's heat is increased by one. The resulting file is stored in a pkl
file.
Create a list of books, including the book id and its name. Create
a category list with the category id and its name, stored as csv files.
In the actual project, we can use the tables in the database directly
without create another one.
Create a list of called user information and a list of calling user
information, including the book id and its name. Create a category
list containing the category id and its name, save also as csv files. All
reading information has been deduplicated.
Calculate the proportion of the length of the call between two users
in the total call duration of the caller
More call, more close
20
calculate the affinity of other users. So, in the next loop we need to
recalculate the intimacy of the user associated with d.
For indirect social relationships like Lily and e we have two ways to
calculate. One is the calculation of entropy, and the other is the
probability model. Here we use the probabilistic model method to
calculate the intimacy, R is the intimacy, and T is the weight of the
call.
R1e T1b *Tbe
We use the formula below to calculate the relationship between Lily
and d,
R1d T1d Tac *Tcd T1d *Tac *Tcd
Calculate the value of ‘trust’ between the target user and the other
users in the social network
We suppose the people who has the similar age, the same
gender and more call time will have the similar preferences of
reading.
The main purpose of this procedure is to use the age, gender
and distance in the social network to calculate the target user's
value of ‘trust’ in the recommendations from other users in the
social network. Distance in the social network is the list of value we
have already calculated in the last step.
Enter three variables, age, gender, and distance in the social
network, and return a list of trusts value. The formula below is the
formula to calculate
The gender value has the same gender value of 1, and the
difference is 0.
The gender coefficient, age coefficient and intimacy coefficient are
three adjustable coefficients. we can change the weight of the
corresponding factor by changing the three coefficients.
21
The book recommendation degree is the value of the trust
relationship. If a book is read several people, we Superimposed the
‘trust’ values of these people. If a book is read by one user in
different months in multiple months, we will not be deduplicated.
3.6.3 Test
Result
The sample was divided into 200 parts, 5 of which were used as
test samples for 195 samples. In previous tests, I found that the
user who has a grand social reseaux and the user who has a small
social reseaux were not applicable to such models. Delete users
who have grand social reseaux (more than 300 friends) and users
who have small social reseaux
recall (%) precision (%) F1
Group 1 23.0357143 1.9607843 1.760375929
Group 2 14.2210884 1.6806723 1.413916545
22
Group 3 44.5102041 5.0389356 4.443426213
Group 4 7.7777778 1.9607843 1.594516595
Group 5 32.4404762 4.4117647 3.889996641
mean 25.9094388 3.0104342 2.630984541
23
actively read even if they do not follow the recommended list. The
accuracy rate is about 3%, which means that in addition to the
animation that the user will take the initiative to watch, there are
many results based on the user's historical interest. The user's true
feedback on this part of the animation needs to be tested after
online recommendation.
4 User classification
4.1 Data
My colleague extracts the data for me. Those data contain the
information of the user, product, the type of telephone of the user,
the flow used and calling information.
24
I found many very interesting results. Such as the amount which
people use flow between twenty o’clock to six o’clock second day is
relative to price of telephone. The number of supplement packages
ordered is proportional to music preferences and free time flow
used.
25
4.2.2 Flow-flow preference
We use the data of flow and flow preference to separate customers
int five group. They are economical entertainment flow preferences,
entertainment flow preferences, no entertainment flow preference,
entertainment flow preference _ music aversion and entertainment
flow preferences _ social aversion.
26
4.2.3 Call-consumption preferences
27
4.2.4 Call-behavior preference
We use the data of Call-behavior preference
to separate customers int four group. They are Local resident and
local reseaux social, frequent intra-provincial travel users, mild
domestic travel users in the province and frequent outbound travel
users.
28
The result is not satisfactory. We need only eight to nine group and
we want to use nine to ten variables. But in this case, we use more
than ten variables and have more than fifteen groups. We can’t use
it into the actual system even if the model has a good score.
29
4.4 Find the flow suppressed customer
There are some customers use much flow at the beginning of the
month and then they find that they don’t have enough flow. So,
they repress demand of flow until the end of the month.
Diagram of flow
repression curve Repression index: The area
between the cumulative flow
1.5 percentage and the reference line
represents the degree of flow
1 suppression. The larger the area,
the higher the flow suppression.
0.5
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
With using the history data of the flow to construct a model to find
the customers who repress their demand of flow. The China Mobile
can propose un flow supplement package to them.
30
31
Conclusion
Conclusion of top-n recommendation algorithm
What can be improved on the original model:
Model aspect:
1. The current data generates only one-layer of social network. If
we can build multiple layers of social networks, we will reduce the
noise impact and the results will be more accurate. It is even
possible to develop some new users for the user community.
2. MIGU have already used a system recommendation algorithm in
the current system, but the result of classification of their current
recommendation system is not very accurate and diverse. The
system always recommends the most popularity book in a book
category, so the result of recommendation is always similar.
But people will like several different types of books maybe other
non-popular subcategories. It need a recommendation results with
more diversity. If we considered more about other information such
as book tags, we can have a more accurate model.
Data aspect:
1. The relevant parameters can be tuned in the further. The setting
of the number of books recommended for each category and the
setting of the total number of books to recommend have the
greatest influences on the recall and accuracy.
2. I don’t have the data of telecommunication for other provinces.
we cannot generate multiple layers of social networks because of
sample data is small. The social network in the current experiment
is only one layer. So, the small sample data is not enough. As the
number of training samples increases, the recommended model will
be more accurate.
32
important in the company is solve the problem. At most time, a
simple model can solve a grand problem.
33
An internship is important for students. A good internship can
improve professional quality, enhance analytical and problem-
solving skills. An internship is also an important way to train
talents, expand our knowledge, expand our contact with society,
increase our experience in social competition, exercise and improve
our ability. In addition, the purpose of the internship may be
adapted to the needs of the job this time I analyze the data in
practice. I found it is important to quickly adapt to the new working
environment after you enter the work space.
And learning quickly is also important because the knowledge
updated very quickly in this era.
34
References:
[1] http://www.ebupt.com/list/roll/snavid/5
[2] https://www.w3cschool.cn/hadoop/hadoop_introduction.html
[3] http://218.200.230.11/about/intro.html
[4] https://www.chinamobileltd.com/en/about/overview.php
[5] Data Mining Concepts and Techniques, Jiawei Han
[6] Some methods for classification and analysis of multivariate
observations.
[7] In Proceedings of the Fifth Berkeley Symposium on Mathematical
Statistics and Probability, 1:281-297, 1967
35