Sie sind auf Seite 1von 6

Data Mining for Technical Operation of

Telecommunications Companies: a Case Study


Wiktor Daszczuk*, Piotr Gawrysiak*, Tomasz Gerszberg+, Marzena Kryszkiewicz* -HU]\ 0LHFLFNL*,
*

, Tomasz Traczyk, Zbigniew Walczak*

0LHF]\VDZ 0XUDV]NLHZLF]  0LFKD 2NRQLHZVNL  +HQU\N 5\EL VNL

{wbd, gawrysia, mkr, jms, mrm, okoniews, hrb, walczakz}@ii.pw.edu.pl


Institute of Computer Science, Warsaw University of Technology

T.Traczyk@ia.pw.edu.pl

Institute of Control and Computation Engineering, Warsaw University of Technology


ul. Nowowiejska 15/19, 00-665 Warsaw, Poland
+

tgerszberg@eragsm.com.pl

Polska Telefonia Cyfrowa Sp. z o.o


Al. Jerozolimskie 181, Warsaw, Poland

the areas where other data analysis methods (such as


statistics) have successfully been used for years [1].

Abstract: This paper is an overview of a Data Mining


project carried out by the Warsaw University of
Technology in the Network Planning and Maintenance
Department of a Polish cellular telecom provider. This
project has provided an excellent opportunity to test
various Data Mining methods on real, non-classic (i.e.
mostly not related to purely marketing problems) data
from the technology area.

We should however realize that the increase of amount of


"intelligence" (in a form of microprocessor based
controllers) embedded into various machinery and tools
enlarges the amount of diagnostic information generated
automatically, which can not be efficiently analyzed by
humans. The cellular telephone network is a very good
example of this phenomenon. An average GSM network
consists of several thousand so-called base stations, each
incorporating several controllers possibly informing
network monitoring center about their status every couple
of seconds. This activity generates enormous amount of
data, which is purely of a technological nature and usually
difficult for interpreting. In most situations this
information is simply discarded - in fact, according to W.
Schmidt [2], several companies adopt the "switch off"
methodology and simply disable most of the network
telemetry equipment.

In this paper the Data Mining experiment results are


presented together with a short description of the applied
methods and algorithms. Some remarks on managerial
problems that have emerged during the Data Mining
techniques implementation in a large corporation have also
been included.
Keywords: data mining, automatic knowledge discovery,
cellular telecommunication systems, business process
analysis

Hence, the demand for a kind of automated knowledge


discovery in similar environments seems to be obvious, yet
relatively few research projects have been undertaken
towards this end. The occasion for performing a Data
Mining analysis in a department of ERA GSM (one of
three Polish cellular telecom providers) was therefore very
stimulating and promising research opportunity for the
Warsaw University of Technology and especially for the
Data Mining Team set up within the Information Systems
Division. We present the overview of our research in this
paper.

1. Introduction.
The Data Mining methodology was evolving rapidly over
last five years, and despite being quite a new concept in
Information Technology applications, it has gained a
widespread market acceptance.
The above mainly refers to such applications where
analyzed data is easily interpretable by humans and can be
relatively easily discretized. This includes marketing, sales
analysis, company strategy building and so on - in short,

The paper is organized as follows. Section 2 contains


description of several experiments performed by the Data
Mining Team together with their results evaluation. In
Section 3 the managerial aspects of Data Mining projects
are discussed. Concluding remarks contained in Section 4,
complete the paper.

The Data Mining team was supplied with information


about existing cellular network. The data had a form of a
simple relational table with the following attributes:

2. Case study.
As the business processes analysis proved (see Section 3),
data
mining
solutions
may
enhance
the
telecommunications company value chain at the level of
many different stages. However, because the idea of data
mining project came out from the technology department
managers, the research was focused on, but not limited to,
the problems of this particular area. The research team
generally analyzed processes of two sections of the
company - the Network Planning section, and the Network
Quality section. The Network Planning's main
responsibility includes tasks related to network expansion
and increasing coverage - either by building new base
stations or by reconfiguring existing ones. On the other
hand the Network Quality Management ensures the quality
of services i.e. attempts to minimize the amount of
dropped calls, unsuccessful handover1 attempts etc.

unique cell identification number;

cell size in pixels;

amount of each landuse (terrain type) in this


cell in pixels;

average traffic in this cell in Erlangs.

A pixel is a unit of land area, of dimensions 5 by 5 arc


seconds, what gives (the average) of 100 by 150 meters in
Poland. There were 9 landuse classes: forests, agricultural,
water, swamps, concrete, residential, dense residential,
city, industrial. On the basis of this data, the Data Mining
team was expected to predict the cellular traffic for new
cells.
Obviously, the major scope of research was the assessment
on how much traffic is generated by a single pixel of a
particular landuse. This may be done using multiple
regression, which is a good method to find the coefficients
for equation:
T= a1*l1+ a2*l2+ ... + a9*l9
Where T is the traffic (in Erlangs) in the cell, l1..l9
number of pixels for every landuse, and a1..a9 traffic
coefficients for particular landuse (in Erlangs per pixel).

2.1 Data mining models for cellular network planning.


One of the first problems where local engineers thought
the data mining would be applicable, was the support for
cellular radio network planning. The cellular telecom
company at the beginning of its market activity has to
establish a network of base stations and cells that are
related to them. The users of cellular phones in a given
area contact the nearest base station that collects calls and
initiates further transmission. This way the so-called
"cellular traffic" is generated. Some cells and areas
generate more traffic and need more transmitters in the
base station (to increase the number of concurrently
available communication channels), which are obviously
more expensive. The goal of the network planning team is
to establish optimal network of base stations in the area of
activity. The transmitters should have enough capacity to
handle local cellular traffic without problems, but on the
other hand installing too many transmitters is improper due
to the cost factor. Usually a planning process should
smoothly anticipate the growth of the network, and it
should also reflect the network expansion strategy of the
company. The crucial success factor here is the proper
location of a new base station and prediction of traffic that
determines transmitter power.

Applying the above method to data from the entire


network resulted in quite a poor accuracy. However, we
expected the approximation to improve when the
regression is calculated for a subset of cells having a
similar characteristic. For example, typically urban, rural
or industrialized cells were expected to have similar traffic
coefficients.
In this way a research methodology was introduced:

classification of cells into subsets with


similar characteristic using clustering and
decision trees;

building a traffic model for every subset


using multiple regression or neural networks.

It turned out that this problem is a good example of the


approach in which one used different classic data mining
techniques, while some of them were competitive and
some complementary. In addition to the above
methodology, the mining team developed a concept of a
method that would allow finding the best possible solution
to this problem. This method is described in [3] and [4] as
a regressional clustering. In brief, it is an algorithm based
on k-means clustering (or genetic algorithm), that divides
the whole population into clusters, using regression quality
estimators as a measure of cluster quality. Such an
algorithm, after a successful implementation should
produce the best possible classification of cells.

Switching moving user between GSM cells he traverses

Another approach to the problem involved the training of a


neural network (3-layer perceptron). The landuse values
were network inputs, while network output was assigned
to traffic value. This method generated similar results as
the multiple regression with clustering, but was
substantially slower.

has the following meaning:


If, in the interval of one hour, events A,B,C
occur, than also in this hour events D,F occur.
Each rule is associated with two coefficients: support and
confidence. Support describes the percentage of
transactions in the database in which this particular set of
events occur. We are interested only in the rules with the
support level being above a certain (user specified) level.
Confidence on the other hand is used to give some measure
how good (or "strong") the rule is.

The research was supported with strong feedback from the


Network Planning engineers. Only in the initial phase, the
Data Mining team discovered some strange rules in the
classification of landuses. Our experts interpreted them as
misclassification of roads, and some kinds of residential
areas, which were quite important for a traffic prediction.

Obviously a much better approach is to use so called


sliding time window analysis (see e.g. [5, 6]) since there is
no mistake generated by fixed time intervals. To the best
of our knowledge there is no available commercial system,
which can generate such rules. This is particularly
inconvenient, because the experiments proved that the use
of a time window of a fixed length could result in omitting
many important rules, and on the other hand, some rules
that are not significant might be generated.
Currently at the Warsaw University of Technology a
separate project inspired by the above observations has
been started in order to create specialized software
addressing the aforementioned problem; a first prototype is
expected in summer' 2000.

The major difficulty in this problem was caused by the


poor quality of data. The information about landuses
proved not to be too accurate and up-to-date. The purity of
decision trees based on landuse data was below 70%, and
it determined the final outcome of regression approach.
The mean square error in both models was approximately
40% of average traffic. Therefore, we decided to include
other attributes, not normally used in GSM network
planning. For example, adding information about average
population and income in the area allowed to improve the
accuracy up to 80%.

2.2 Time oriented data in a GSM network.


2.3 Network anomalies.

In our work we found out that a lot of data which are


gathered in different places of the technical department
had the following format:

The analysis of behavior of different cells in the network is


another problem that we encountered. Each cell in the
network is associated with different parameters. These
parameters are divided into a few subgroups. First of them
are configuration parameters for every cell and other
attributes are parameters gathered during some period of
time and describe the behavior of certain network elements
assigned to this cell.

Time1 Event1
Time2 Event2
...
Timen Eventn

For example parameter Attempts gives the total number


of requests to allocate a channel and the parameter
Blocks gives the number of unsuccessful requests. If
Blocks/Attempts achieve value above 2% it means
that the quality of service is not good enough and network
optimization is needed. There were defined several types
of errors by the ERA experts that indicated cell anomalies
such as channel congestion, blocking, call drops etc on
levels higher than acceptable for well designed network.

The aim of data mining within such data is to discover


interesting rules in the following form:
If a message of type A is generated, a message of type B
will also be generated in a very short time.
The main idea is to identify such types of events, which
typically occur sequentially. We should notice that this
approach could be successfully used to create efficient
rules for expert systems, which might then reduce the
number of alarms in GSM networks. Usually only one
fault in such a network can generate a lot of (lets say thousands) of different messages. Similarly such approach
can be used to analyze the network's behavior based on the
SS7 messages stored in a large database.

The team applied association rules discovery to find


relationships between values of such parameters, but the
results obtained were not very interesting as experts in the
field have already known all discovered rules.
Because analysis of standard data was not satisfactory,
additional information about cell neighborhood was used.
Two cells are defined to be neighbors if it is possible to
make a handover of a call between them. The task was
redefined to finding association rules of the form:

The association rules seem to be a good tool for the above


purpose. Time can be divided into intervals of a fixed
length (one hour, half an hour, etc.). Each interval forms a
transaction in terms of association rules discovery. For
instance a generated rule

(Cellid1,Error_or_<attribute,valueRange>)
...
(Cellidn-1,Error_or_<attribute,valueRange>)
=>(Cellidn,Error_or_<attribute,valueRange>)

[A, B, C] => [D, F]

Unfortunately, the existing data mining tools are not well


suited to the needs of radio network optimization
specialists. Necessary data pre-processing is a tedious
activity and post-processing functionality provided by
existing data mining tools is not sufficient to filter out the
knowledge of interest easily and quickly. Practically,
necessary post-processing had to be done by means of
classical query languages (e.g. SQL), which makes mining
around rules relatively slow and unfriendly activity. The
rule languages proposed in [7,8,9,10] do not address
several pre- and post-processing issues the DM team faced
when working on the problem of network anomalies.

where:

Cellid1, Cellid2,.., Cellidn-1 are identification numbers


of cells in the network that are neighbors of some cell
identified by Cellidn;

Error is an expression from the set of errors


predefined by the ERA experts; errors of different
types were allowed to occur in one rule

<attribute,valueRange> - a value range to which

the attribute value of a respective cell belongs;


different attributes were allowed to occur in one rule
The association rules discovered, representing knowledge
about mutual influence of different network elements,
were much more significant to experts. Especially
interesting were "one-way" rules, describing situations in
which one cell located at one site influenced behavior of
second cell located at another site and not the other way
round. This potentially allows to identify cells that are the
source of faults in the network.

2.4 In search for too thrifty consumers.


The Data Mining team in the final stage of the project also
performed a quick analysis for the marketing department.
The goal was defined as follows: to find the customer
profile of subscribers that make calls shorter than 5
seconds. According to a billing schema, such calls are
treated as mistakes, and thus are free of charge. However,
thrifty subscribers used quite a lot of such calls to send
short messages (like call me back) instead of, for
example, SMS texts. Some of them were supposed to use
5-second calls for automatic communication purposes
generating hundreds of them each day. Our preliminary
analysis showed that those 5-second calls, while not
generating any profit for the company, were responsible
for 40% of the total network load. The team tried therefore
to find a "thrifty customer profile" using attributes such as
type of tariff, age, and so on.

The association rules with parameters belonging to


different subgroups were evaluated as more promising
than those with parameters belonging to the same
subgroup. In particular, they were found to be useful for
specialists working on optimization of radio network. In
addition to rules that confirm experts knowledge,
unknown dependencies were identified. In the opinion of
the ERA experts, they can be applied directly to generate
intelligent trouble lists to be used by radio network
optimization groups.
It was interesting to observe that many rules with only one
condition had high confidence (greater than 90%). The
experts found useful applying an additional rule parameter,
namely lift, when looking for useful rules. Lift determined
how much the computed value of the rule confidence is
greater than the expected confidence (i.e. the confidence
the rule would have if the occurrence of condition and
decision values were statistically independent).

This problem of customer segmentation [1] is close to


classic examples of data mining analyses. Therefore, the
team adopted classic approach utilizing full scope of DM
tools and methodologies: association rules, clustering,
decision trees, statistics and neural methods.
Data about subscribers and their calls was extracted from
company's data warehouse. It contained information about
the following attributes:

Further on, the DM team processed the found set of rules


in order to identify essential neighboring cells that
influence an unrequired behavior of a faulty cell. A
neighbor of a faulty cell, say FC, was treated as essential
one if its error or attribute occurred in the body of the rule
that indicated an error for FC with sufficiently high
support, confidence and lift. The results were different for
different faulty cells and different rule threshold values,
but in general, the found set of essential neighbors was a
proper subset of originally defined set of neighbors. In
particular, for some faulty cell with 9 predefined
neighboring cells there were extracted 7 essential
neighbors from the set of association rules with minSup >
70%, minConf > 20%, and lift > 1.2. On the other hand,
there were extracted only 2 essential neighboring cells
when the minSup was increased to 25%.
After checking robustness of this method on small subset
of cells, we are now extending it to entire network.

age;

gender;

county of residence;

tariff plan;

contract duration;

total number of calls (during 2 week period - this


gives about 0.6 million subscriber records) in four
classes :

<0-5> second calls

(5-30> second calls

(30-60> second calls

(60...+inf.) second calls

equipment, so even to decide on it's potential usefulness


for Data Mining we badly needed assistance of a
telecommunications specialist. The need for a
multidisciplinary team became therefore evident (for other
practical purposes it's of course also good to have a
dedicated person in the company's structure that would
deal with security and interpersonal contact problems for
the project).

The final outcome has proved that there is no such thing as


typical too thrifty consumer, because all groups of
subscribers generated similar number of those short calls,
what was quite unexpected result. The marketing
department learned from this that switching off the feature
of free 5-second calls might have significant impact on all
users' behavior.
This experiment is still in progress. The analyses are now
being applied to data sets gathered during a longer period
(6 months). Some interesting relationships in data have
already been observed here such as apparently smaller
tendency to use 5-second calls among women of age
between 40 and 50.

During the investigations another problem emerged.


Technology specialists from the company, not having
previous knowledge of automatic knowledge discovery
methodology, had problems with imagining what was
possible with Data Mining and how these techniques might
help them. Some of them expressed even fright that
automatic (and intelligent) data processing tools may make
their jobs redundant.

3. Management perspective.

We have decided that it's easier to show the potential of


Data Mining to telecommunications specialists than to
learn the GSM technology ourselves. In a series of five
seminars, open for all company's employees, most
important Data Mining methods such as association rules
discovery, clustering, classification and statistics have
been presented. The additional bonus of this activity was
building awareness throughout the company about the
project. This resulted, among other things, in an interest
from other departments (such as marketing) which finally
created an opportunity to study customer-behavior related
problems in the end of first phase of the project.

In the world of management Data Mining is a concept


often discussed and "referred to"; yet, implemented rarely.
Now, Data Mining is such a buzzword as Data Warehouse
was only five years ago. The reason for many companies
to "go DM" is the fact that their competitors use Data
Mining, or claim to do so.
In this project we were not driven by a Data Mining
fashion. The commissioner wanted the Data Mining team
to be creative and pro-active in terms of identifying the
areas that might be subject to data mining and the data
mining itself. The Data Mining team had an entire
technical department of a company to work with, and
surprisingly enough no specific and urgent business goal
has been defined. Therefore, the first and vital step was to
identify the business processes and information sources
available throughout the department that were amenable to
Data Mining.

Final seminars involved the commercial tools presentation,


but before we were able to exhibit them, we had to
perform extensive evaluation of Data Mining packages
available on the market. During this analysis it turned out
that not all software surveys may be trusted, as several
well-respected software tools turned out to be not so useful
as promised [12].

Such an analysis differs significantly from a conventional


business process analysis (used for example for a
reengineering purpose), in which analysts describe an
entire organization as a system of cooperating processes,
pinpoint their objectives, and than redesign the structure in
order to maximize performance. Our approach was more
oriented towards the data sources keeping in mind that
they are parts of certain business processes.

The description of problems analyzed by our team have


already been presented in Section 2. However, we would
like to remind of two very important aspects that seemed
to significantly influence our findings.
First of them is the apparent lack of methods that could
cope with raw numerical data. Such data is quite rare in
traditional Data Mining, but in technological problems
such information represents the majority of all information
available. One of the most popular ways of dealing with
this is discretization, but it turns out that without good
insight in the very nature of the data it's difficult to
perform, even with quite complex methods [13]. A manual
discretization with the thresholds proposed by domain
specialists proved to be most effective but for certain data
types this may be very laborious process. The numerical
mining (such as quantitative association rules) seems,
therefore, to be very interesting and mostly unexplored
research area.

The team was trying to find such places in a company


structure where large amounts of possibly important data
had been generated, and dumped without further analysis,
because traditional manual or OLAP methods just simply
could not cope with the complexity, or size of the data
[11]. These "hot spots not necessarily had to be crucial
from the entire company's efficiency point of view (what is
the biggest difference from a conventional business
process analysis). Several such processes have already
been described in Section 2.
During the analysis it has turned out that it's practically
impossible to finish it without the help of company's
insider. Whereas in most Data Mining projects the data is
pretty straightforward and intuitive (such as sales figures
or customer data) this was not the case in our experiments.
We have been dealing mostly with output from telemetry

The presentation of results also proved to be very


important step of the whole Data Mining process. First of
all it is very difficult to evaluate the quality of the mined
knowledge. In several of our experiments the extracted

[4]
3LRWU *DZU\VLDN 0LFKD 2NRQLHZVNL $SSO\LQJ
Data Mining Methods for Cellular Radio Network
Planning", submitted to IIS'2000 conference, 2000

rules, which seemed at first to be quite interesting, proved


to be well known to domain specialists and therefore of no
great importance (albeit their presence in extracted
knowledge proved the reliability of used Data Mining
methods). Frequent results evaluation is therefore
necessary and before presenting final report to company's
authorities an in-house expert should proofread it.

[5]
H. Mannila, H. Toivonen, and A. I. Verkamo.
Discovering Frequent Episodes in Sequences, First
International Conference on Knowledge Discovery and
Data Mining (KDD'95), 210-215, Montreal, Canada,
August 1995. AAAI Press

Because the technical problems are not very intuitive, the


final report should contain also problem description, and
presentation definitely must be understandable to nonspecialists. It turned out that even in a field so narrow as
GSM network planning and maintenance the management
staff not necessarily has the expertise needed to evaluate
the results from other than theirs departments.

[6]
H. Mannila and H. Toivonen, Discovering
generalized episodes using minimal occurrences, Second
International Conference on Knowledge Discovery and
Data Mining (KDD'96), 146-151, Portland, Oregon,
August 1996. AAAI Press
[7]
T. Imielinski., A. Virmani., Abdulghani, A.,
Discover Board Application Programming Interface and
Query Language for Database Mining, In Proc. of
KDD 96, Portland Ore., August 1996, pp. 20-26.

Finally we would like to give some words of warning to all


just starting their Data Mining projects. It is very difficult
to evaluate the results of a whole project and to determine
whether the project has been a success or no.

[8]
T. Imielinski, H. Mannila., A Database
Perspective on Knowledge Discovery, Communications
of the ACM, November 1996 - Vol. 39, No 11.

The second thing - and it is the fact that is rarely


remembered - Data Mining may serve as an analysis tool
for external data as well as internal one. Concentrating
only on internal data (i.e. generated within company) may
be very dangerous, especially for organizations operating
in dynamically changing environments.

[9]
R. Meo, G. Psaila, S. Ceri, A New SQL-like
Operator for Mining Asscociation Rules, Proc. of the
22nd VLDB Conference, Mumbai (Bombay), India, 1996.

We must also remember that Data Mining actions do not


contribute directly to company's value chain - they only
provide information. This information may be wisely used,
and thereby increase company's competitive advantage,
but may be discarded and therefore wasted. In short successful Data Mining gives companies an opportunity to
act wiser on the market but it is up to the managerial staff
how to make use of this opportunity.

[10]
T. Morzy, M. Zakrzewicz, SQL-like Language
for Data Mining, 1st International Conference on
Advances in Databases and Information Systems, St.
Petersburg, 1997
[11]
Rob Mattison, Data Warehousing and Data
Mining for Telecommunications - Artech House
Computer Science Library 1997
[12]
DM tools manuals and reference materials: IBM
Intelligent Miner, SGI Mine Set, SAS Enterprise Miner,
Rosetta, RD2, Oracle Darwin

4. Concluding remarks.

[13]
Andrzej Skowron, Son H. Nguyen, "Quantization
of Real Value Attributes", Warsaw University of
Technology Report, 1995

This project, while still not finished, has already proved


that mining the technological data creates several new
problems not experienced by "conventional" data miners.
The nature of the data and business processes makes the
whole analysis much more demanding and also a delicate
task. Fortunately the solutions that could be possibly found
by Data Mining in technical departments seem to be quite
effective as they contribute directly to efficiency of the
existing processes and systems. This is not always true in
term of marketing or strategy building.

5. References
[1]
Michael J. A. Berry, Gordon Linoff, "Data
Mining Techniques: For Marketing, Sales, and Customer
Support", John Wiley & Sons 1997
[2]

W. Schmidt, private communication, 21/01/2000

3LRWU *DZU\VLDN 0LFKD 2NRQLHZVNL +HQU\N


5\ELVNL 5HJUHVVLRQ  \HW DQRWKHU FOXVWHULQJ PHWKRG

[3]

submitted to DEXA'2000 conference, 2000

Das könnte Ihnen auch gefallen