Beruflich Dokumente
Kultur Dokumente
Bhavani Thuraisingham
The MITRE Corporation
Burlington Road, Bedford, MA
On leave at the National Science Foundation, Arlington, VA
Abstract:
Data mining is becoming a useful tool for detecting and preventing terrorism. This
paper first discusses some technical challenges for data mining as applied for counter-
terrorism applications. Next it provides an overview of the various types of terrorist
threats and describes how data mining techniques could provide solutions to counter-
terrorism. Finally some privacy concerns and potential solutions that could detect ter-
rorist activities and yet attempt to maintain privacy will be discussed.
3.1 Introduction
Data mining is the process of posing queries and extracting useful patterns or trends
often previously unknown from large amounts of data using various techniques such
as those from pattern recognition and machine learning. There have been several de-
velopments in data mining and the technology is being used for a wide variety of ap-
191
192 C HAPTER T HREE
Other directions include graph and pattern mining. For example, one has to connect
all the dots. Essentially one builds a graph structure based on the information he or she
has. If multiple agencies are working on the problem, then each agency will have its
own graph. The challenge is to be able to make inferences about missing nodes and
links in the graph. Also the graph could be very large. The question is how can one
reduce the graph to a more manageable size?
Finally finding the data to test the ideas is still a major challenge. How can we
get unclassified data? Is it possible to scrub and clean the classified data and produce
reasonable data at the unclassified level? How can we find large data sets consisting
of multimedia data types? Is it possible to develop a test-bed where one can apply the
various data mining tools to determine their efficiency?
Web mining is a challenge for detecting unusual patterns. In a way web mining
encompasses data mining as one has to mine all the data on the web as well as mine
the structure and usage patterns. By mining the usage patterns one could get patterns
such as there are an unusual number of visits to a federal web site from Paris around
3am in the morning. Data on the web includes structured as well as unstructured data.
Therefore the tools developed for data mining apply for web mining also. In addition,
we need tools to mine the structure of the web as well as the usage patterns.
Privacy is a major challenge with respect to data mining for counter-terrorism. The
challenge is to extract useful information from data mining but at the same time main-
tain privacy. Several efforts are under way for privacy preserving data mining. The
idea here is to use various techniques such as randomization, cover stories, as well
as multi-party policy enforcement for privacy preserving data mining. While there is
some progress, the effectiveness of these techniques needs to be determined.
The above are some of the challenges for data mining for counter-terrorism dis-
cussed at the workshop. That is, while data mining could become a useful tool for
counter-terrorism, there are many challenges that need to be addressed. They include
mining multimedia data, graph mining, building models in real-time, knowledge di-
rected data mining to eliminate false positives and false negatives, web mining, and
privacy sensitive data mining. Research is progressing in the right direction. However,
there is still much to be done (see also [14]).
Now that we have provided an overview of the challenges on data mining for
counter-terrorism, in the next three sections we will provide some more details on this
topic. To understand how data mining may be applied, we need a good understanding
of what the threats are. In section 3.3 we will provide an overview of various threats
and protection measures. In section 3.4 we will examine how data mining could pro-
vide potential counter-terrorism solutions, especially for the threats discussed in section
3.3. Because of the important of privacy and the potential threats to privacy due to data
mining, we will discus various privacy issues in Section 3.5.
194 C HAPTER T HREE
experiences.
Human errors are also a source of major concern. We need to continually train
say the operators and give them advice to be cautious and alert. We need to take
proper actions if humans have been careless. That is, unless there is an absolutely good
excuse, human errors should not be treated lightly. This way, humans will be cautious
and perhaps not make such errors.
Terrorist attacks are quite different. The problem is, one does not know when it will
happen and how it will happen. Many of us could never have imagined that airplanes
would be used as weapons of mass destruction to bring the famous world trade center
towers down. Many of us still may not know what the next attack may be. Would they
be attacks caused by suicide bombers or would they be attacks caused by chemical
weapons or would they be attacks caused by cyber terrorism. The counter-measures for
prevention and detection may be quite intense for terrorist attacks. As we have stated,
we are not experts on counter-terrorism or have studied the nature of the attacks. Our
goal is to examine the various data mining techniques to see how they could be applied
to handle the various threats that have been discussed almost daily in the newspapers
and on television.
It should however be noted that to develop effective techniques, the data mining
specialists have to work together with counter-terrorism experts. That is, one cannot
use the techniques without a good understanding of what the threats are. Therefore,
while the contents of this paper may be used as a reference, I would urge those in-
terested in applying data mining techniques to solve real world problems and terrorist
attacks to work with counter-terrorism specialists. In the next few sections we will
discuss various types of terrorism and counter-terrorism measures.
terrorism and vandalism. Then later on we heard of airplane hijackings where a group
of terrorists hijack airplanes and then make demands on governments such as releasing
political prisoners who could possibly be terrorists. Then we heard of suicide bomb-
ings where terrorists carry bombs and blow themselves up as well as others nearby.
Such attacks usually occur in crowded places. More recently we have heard of using
airplanes to blow up buildings.
While the above acts are all terrorist attacks, we hear almost daily about someone
shooting and killing someone else when neither party belongs to any gangs or terrorist
groups. This in a way is terrorism also, but these acts are more difficult to detect and
prevent because there are always what are called “crazy people” in our society. While
the technologies should detect and prevent such attacks also, what this paper focuses is
on how to detect attacks from people belonging to terrorist groups.
All of the threats we have discussed above are sort of external threats. These are
threats occurring from the outside. In general, the terrorists are usually neither friends
nor acquaintances of the victims involved. But there are also other kinds of threats and
they are insider threats. We will discuss them in the next section.
entering a country. We are not saying that illegal immigrants are dangerous or are
terrorists. They may be very decent people. However, they have entered a country
without the proper papers and that could be a major issue. For official immigration into
say the USA, one needs to go through interviews at US embassies, go through medical
checkups and X-rays as well as checks for diseases such as tuberculosis, background
checks and many more things. It does not mean that people who have entered a country
legally are always innocent. They could be terrorists also. At least there is some
assurance that proper procedures have been followed. Illegal immigration can also
cause problem to the economy of a society and violate human rights through cheap
illegal labor etc.
As we have stated, drug trafficking has occurred a lot at borders. Drugs are a dan-
ger to society. It could cripple a nation, corrupt its children, cause havoc in families,
and damage the education system and cause extensive damage. It is therefore critical
that we protect the borders from drug trafficking as well as other types of trafficking
including firearms and human slaves. Other threats at borders include prostitution and
child pornography, which are serious threats to decent living. It does not mean that ev-
erything is safe inside the country and these problems are only at borders. Nevertheless
we have to protect our borders so that there are no additional problems to a nation.
Transportation systems security violations can also cause serious problems. Buses,
trains and airplanes are vehicles that can carry tens of hundreds of people at the same
time and any security violation could cause serious damage and even deaths. A bomb
exploding in an airplane or a train or a bus could be devastating. Transportation systems
are also the means for terrorists to escape once they have committed crimes. Therefore
transportation systems have to be secure. A key aspect of transportation systems secu-
rity is port security. These ports are responsible for ships of the United States Navy.
Since these ships are at sea throughout the world, terrorist may have opportunities to
attack these ships and the cargo. Therefore, we need security measures to protect the
ports, cargo, and our military bases. In Section 3.3.7 we will discuss various counter-
terrorism measures for the threats we have discussed here. The next three sections will
discuss additional types of terrorism.
various threats and solutions is the one by Ghosh [10]. We also discuss some of the
cyber threats and countermeasures in [11].
Cyber-terrorism is one of the major terrorist threats posed to our nation today. As we
have mentioned earlier, there is now so much of information available electronically
and on the web. Attack on our computers as well as networks, databases and the
Internet could be devastating to businesses. It is estimated that cyber-terrorism could
cause billions of dollars to businesses. For example, consider a banking information
system. If terrorists attack such a system and deplete accounts of the funds, then the
bank could loose millions and perhaps billions of dollars. By crippling the computer
system millions of hours of productivity could be lost and that equates to money in
the end. Even a simple power outage at work through some accident could cause
several hours of productively loss and as a result a major financial loss. Therefore it is
critical that our information systems be secure. Next we discuss various types of cyber
terrorist attacks. One is spreading viruses and Trojan horses that can wipe away files
and other important documents. Another is intruding the computer networks, which we
will discuss in the next section. Information security violations such as access control
violations as well as a discussion of various other threats such as sabotage and denial
of service will be given later.
Note that threats can occur from outside or form the inside of an organization. Out-
side attacks are attacks on computers from someone outside the organization. We hear
of hackers breaking into computer systems and causing havoc within an organization.
There are hackers who start spreading viruses and these viruses cause great damage to
the files in various computer systems. But a more sinister problem is the insider threat.
Just like non-information related attacks, there is the insider threat with information
related attacks. There are people inside an organization who have studied the business
practices and develop schemes to cripple the organization’s information assets. These
people could be regular employees or even those working at computer centers. The
problem is quite serious as some one may be masquerading as someone else and caus-
ing all kinds of damage. In the next few sections we will examine how data mining
could detect and perhaps prevent such attacks.
posing as legitimate users can pose queries such as SQL queries and access the data
that they are not authorized to know.
Essentially cyber terrorism includes malicious intrusions as well as sabotage through
malicious intrusions or otherwise. Cyber security consists of security mechanisms that
attempt to provide solutions to cyber attacks or cyber terrorism. When we discuss ma-
licious intrusions or cyber attacks, we need to think about the non cyber world, that is
non information related terrorism and then translate those attacks to attacks on com-
puters and networks. For example, a thief could enter a building through a trap door. In
the same way, a computer intruder could enter the computer or network through some
sort of a trap door that has been intentionally built by a malicious insider and left unat-
tended through perhaps careless design. Another example is a thief entering the bank
with a mask and stealing the money. The analogy here is an intruder masquerading as
someone else, legitimately entering the system and taking all the information assets.
Money in the real world would translate to information assets in the cyber world. That
is, there are many parallels between non-information related attacks and information
related attacks. We can proceed to develop counter-measures for both types of attacks.
These counter-measures are discussed in Section 3.3.8.
As mentioned in section 3.3.4.1, there are numerous security attacks that can occur
due to the web. We discuss some of the web security threats in this section. As we
have mentioned, in his book Ghosh [10] has provided an excellent introduction to web
security and various threats. Note that while we have focused on web threats in this
section, the threats discussed are applicable to any information system such as net-
works, databases and operating systems. The threats include access control violations,
integrity violations, sabotage, fraud, denial of service and infrastructure attacks.
For example, the traditional access control violations could be extended to the web.
User may access unauthorized data across the web. Note that with the web there is
so much of data all over the place that controlling access to this data will be quite a
challenge. Data on the web may be subject to unauthorized modifications. This makes
it easier to corrupt the data. Also, data could originate from anywhere and the producers
of the data may not be trustworthy. Incorrect data could cause serious damages such
as incorrect bank accounts, which could result in incorrect transactions. We hear of
hackers breaking into systems and posting inappropriate messages. With so much of
business and commerce being carried out on the web without proper controls, Internet
fraud could cause businesses to loose millions of dollars. Intruder could obtain the
identity of legitimate users and through masquerading may empty the bank accounts.
We hear about infrastructures being brought down by hackers. Infrastructures could be
the telecommunication system, power system, and the heating system. These systems
are being controlled by computers and often through the Internet. Such attacks would
cause denials of service.
Other threats include violations to confidentiality, authenticity, and no repudiation.
Confidentiality violations enable intruders to listen in on the message. Authentication
violations include using passwords without permissions, and non-repudiation viola-
tions enable someone from denying that he sent the message. The web threats dis-
cussed here occur because of insecure clients, servers and networks. To have complete
security, one needs end-to-end security; that means secure clients, secure servers, se-
cure operating systems, secure databases, secure middleware and secure networks.
202 C HAPTER T HREE
The threats that we have discussed so far can be grouped into two categories; non real-
time threats or real-time threats. In a way all threats are real-time as we have to act
in real-time once the threats have occurred. However, some threats are analyzed over
a period of time while some others have to be handled immediately. We discuss the
various threats here.
Consider for example the biological, chemical and nuclear threats. These threats
have to be handled in real-time. That is, the response to these threats have timing con-
strains. If smallpox virus is being spread maliciously, then we have to start vaccinations
immediately. Similarly if networks say for critical infrastructures are being attacked,
the response has to be immediate. Otherwise we could loose millions of lives and/or
millions of dollars.
There are some other threats that do not have to be handled in real-time. For ex-
ample consider the behavior of suspicious people such as those belonging to a certain
terrorist organization or those enrolling in flight training schools. In a way these people
are also planning attacks but sometime even they are not sure when they will attack.
Therefore, one has to monitor these people, analyze their behavior and predict their
actions. While there are timing constraints for these threats, the urgency is not as great
as say the spread of the smallpox virus. But one should be vigilant about these non
real-time threats also.
In general there is no way to say that A is a real-time threat and B is a non real-time
threat. A non real-time threat could turn into a real-time threat. For example, once
the terrorists had hijacked the airplanes on September 11, 2001, the threat became a
real-time threat as action had to be taken within say an hour.
204 C HAPTER T HREE
Now that we have provided some discussion on various types of terrorist attacks includ-
ing non-information related terrorism, information related terrorism, bio-terrorism, etc.
we will discus what counter-terrorism is all about. Counter-terrorism is a collection of
techniques used to combat, prevent, and detect terrorism. Our goal in this paper is to
examine various data mining techniques to see how we can combat terrorism using
these techniques. In this section we will briefly discuss what counter-terrorism is all
about for the terrorist attacks discussed in the previous sections.
In Section 3.3.8.2 we discuss protecting from non-information related terrorism. In
section 3.3.8.3 we discuss protecting from information related terrorism. In particular,
we discuss various web security measures as well as other aspects such as intrusion
detection and access control, briefly. In section 3.3.8.4 we discuss protecting from
bio-terrorism and chemical attacks and nuclear attacks. In section 3.3.8.5 we discuss
protecting the critical infrastructures. We analyze counter-terrorism measures for non
real-time threats as well as for real-time threats in Section 3.3.8.6.
General Discussion
We will first provide an overview of counter-terrorism with respect to information re-
lated terrorism. We will give special consideration for security solution for the web
later on. Essentially protecting from information related terrorism is involved with de-
tecting and preventing malicious attacks and intrusions. These attacks could be attacks
due to viruses or spoofing or masquerading and stealing say information assets. These
attacks could also be attacks on databases and malicious corruption of data. That is,
terrorist attacks are not necessarily stealing and accessing unauthorized information.
They could also include malicious corruption and alteration of the data so that the data
will be of little or no use. Terrorist attacks also include credit card frauds and identity
thefts.
Various data mining techniques are being proposed for detecting intrusions as well
as credit card fraud. We will discuss them in later sections. Preventing malicious
attacks is more challenging. We need to design systems in such a way that malicious
attacks and intrusions are prevented. When an intruder attempts to attack the system,
the system would figure this out and alert the security officer. There is research being
carried out on secure systems design so that such intrusions are prevented. However
there is more focus on detecting such intrusions than prevention.
Enforcing appropriate access control techniques is also a way to enforce security.
For example, users may have certificates to access the information they need to carry
out the jobs that they are assigned to do. The organization should give the users no
more or no less privileges. There is much research on managing privileges and access
rights to various types of systems.
We have briefly discussed cyber security measures. We will discuss security solu-
tions for the web in more detail next. Note that there are also additional problems such
as the inference problem where users pose sets of queries and infer sensitive informa-
tion. This is also an attack. We will visit the inference problem later when we discuss
privacy.
Security Solutions for the Web
We need end-end-end security and therefore the components include secure clients,
secure servers, secure databases, secure operating systems, secure infrastructures, se-
cure networks, secure transactions and secure protocols. One needs good encryption
mechanisms to ensue that the sender and receiver communicate securely. Ultimately
whether it be exchanging messages or carrying out transactions, the communication
between sender and receiver or the buyer and the seller has to be secure. Secure client
solutions including securing the browser, securing the Java virtual machine, securing
Java applets, and incorporating various security features into languages such as Java.
Note that Java is not the only component that has to be secure. Microsoft has come up
with a collection of products including ActiveX and these products have to be secure
also. Securing the protocols include secure HTTP, the secure socket layer. Securing the
web server means the server has to be installed securely as well as it has to be ensured
that the server cannot be attacked. Various mechanisms that have been used to secure
operating systems and databases may be applied here. Notable among them are access
control lists, which specify which users have access to which web pages and data. The
206 C HAPTER T HREE
web servers may be connected to databases at the backend and these databases have
to be secure. Finally various encryption algorithms are being implemented for the net-
works and groups such as OMG (Object Management Group) are envisaging security
for middleware such as ORB (Object Request Brokers).
One of the challenges faced by the web mangers is implementing security policies.
One may have policies for clients, servers, networks, middleware, and databases. The
question is how do you integrate these policies? That is how do you make these policies
work together? Who is responsible for implementing these policies? Is there a global
administrator or are there several administrators that have to work together? Security
policy integration is an area that is being examined by researchers.
Finally, one of the emerging technologies for ensuring that an organization’s assets
are protected is firewalls. Various organizations now have web infrastructures for in-
ternal and external use. To access the external infrastructure one has to go through the
firewall. These firewalls examine the information that comes into and out of an orga-
nization. This way, the internal assets are protected and inappropriate information may
be prevented from coming into an organization. We can expect sophisticated firewalls
to be developed in the future. Other security mechanism includes cryptography.
infrastructure could cripple businesses and the country. We need to determine the mea-
sures to be taken when the infrastructures are attacked.
Essentially the counter-measures include those developed for non information-
based terrorism as well as for information-based terrorism. For example one could
bomb the telecommunication lines or create viruses that would affect the telecom-
munications software. This means that communication through telephones as well
as computer communications that occurs through phone lines could be crippled. The
counter-measures developed for non information related terrorisms well for informa-
tion related terrorism could be applied here. We need to gather information about the
terrorist groups and extract patterns. We also need to detect any unauthorized intru-
sions. Our ultimate goal is to prevent such disastrous acts.
Even biological, chemical and nuclear weapons could attack the infrastructure of
the nation. For example our food supplies, water supplies and hospitals could be dam-
aged by biological warfare. Here again we need to examine the counter-terrorism mea-
sures for biological, chemical and nuclear attacks and apply them here.
In this section we will provide a high level overview of how web data mining as well
as data mining could help toward counter-terrorism. Note that we have used web data
mining and data mining sort of interchangeably as our definition of web data mining
goes beyond just mining structured data. We have included mining unstructured data,
mining for business intelligence, web usage mining and web structure mining as part
of web data mining. That is, in a way web data mining encompasses data mining.
As we have stated data mining could contribute towards counter-terrorism. We are
not saying that data mining will solve all our national security problems. However
the ability to extract hidden patterns and trends from large quantities of data is very
important for detecting and preventing terrorist attacks.
The organization of this section is as follows. Section 3.4.2 provides an overview
of web data mining for counter-terrorism. We will analyze the techniques in Section
3.4.3. A particular technique, called link analysis, that may be very important for
counter-terrorism applications will be given more consideration in Section 3.4.4. The
section is summarized in section ??.
In Section 3.3 we grouped threats different ways. One grouping was whether they were
based on information related or non-information related. It was somewhat artificial, as
we need information for all types of threats. However in our terminology, information
related threats were threats dealing with computers; some of these threats were real-
time threats while some others were non real-time threats. Even here the grouping
was somewhat arbitrary, as a non real-time threat could become a real-time threat. For
example, one could suspect that a group of terrorists will eventually perform some act
of terrorism. However when we set time bounds such as a threat will likely occur say
before July 1, 2003, then it becomes a real-time threat and we have to take actions
immediately. If the time bounds are tighter such as a threat will occur within two days
then we cannot afford to make any mistakes in our response.
The purpose of this section is to examine both the non real-time threats and real-
time threats and see how data mining in general and web data mining in particular could
handle such threats. Again we want to stress that web data mining in our terminology
encompasses data mining as it deals with data mining on the web as well as mining
structured and unstructured data. Furthermore, we are assuming that much of the data
will be on the web whether they be public networks such as the Internet or private
networks such as corporate intranets. Therefore, we are using the terms data mining
and web data mining interchangeably. In section 3.4.2.2 we discuss non real-time
threats and in section 3.4.2.3 we discuss real-time threats. We will refer to the specific
examples that we have mentioned in the previous section in our discussions as needed.
Section 3.4.3 will examine the various data mining outcomes and techniques and see
how they can help toward counter-terrorism. Some very good articles on data mining
for counter-terrorism have been presented at the Security Informatics Workshop held
in June 2003 (see [6]).
T HURAISINGHAM 209
Non real-time threats are threats that do not have to be handled in real-time. That is,
there are no timing constraints for these threats. For example, we may need to collect
data over months, analyze the data and then detect and/or prevent some terrorist attack,
which may or may not occur. The question is how does data mining help towards such
threats and attacks? As we have stressed in [14], we need good data to carry out data
mining and obtain useful results. We also need to reason with incomplete data. This is
the big challenge, as organizations are often not prepared to share the data. This means
that the data mining tools have to make assumptions about the data belonging to other
organizations. The other alternative is to carry out federated data mining under some
federated administrator. For example, the Homeland security department could serve
as the federated administrator and ensure that the various agencies have autonomy but
at the same time collaborate when needed.
Next, what data should we collect? We need to start gathering information about
various people. The question is, who? Everyone in the world? This is quite impossible.
Nevertheless we need to gather information about as many people as possible; because
sometimes even those who seem most innocent may have ulterior motives. One possi-
bility is to group the individuals depending on say where they come from, what they
are doing, who their relatives are etc. Some people may have more suspicious back-
grounds than others. If we know that someone has had a criminal record, then we need
to be more vigilant about that person.
Again to have complete information about people, we need to gather all kinds of
information about them. This information could include information about their behav-
ior, where they have lived, their religion and ethnic origin, their relatives and associates,
their travel records etc. Yes, gathering such information is a violation to one’s privacy
and civil liberties. The question is what alternative do we have? By omitting informa-
tion we may not have the complete picture. From a technology point of view, we need
complete data not only about individuals but also about various events and entities. For
example, suppose I drive a particular vehicle and information is being gathered about
me. This will also include information about my vehicle, how long I have driven, do I
have other hobbies or interests such as flying airplanes, have I enrolled in flight schools
and asked the instructor that I would like to learn to fly an airplane, but do not care
learning about take-offs or landings, etc.
Once the data is collected, the data has to be formatted and organized. Essentially
one may need to build a warehouse to analyze the data. Data may be structured or
unstructured data. Also, there will be some data that is warehoused that may not be of
much use. For example, the fact that I like ice cream may not help the analysis a great
deal. Therefore, we can segment the data in terms of critical data and non-critical data.
Once the data is gathered and organized, the next step is to carry out mining. The
question is what mining tools to use and what outcomes to find? Do we want to find
associations or clusters? This will determine what our goal is. We may want to find
anything that is suspicious. For example, the fact that I want to learn flying without
caring about take-off or landing should raise a red flag as in general one would want to
take a complete course on flying. In Section 3.4.3 we discuss the various outcomes of
interest to counter-terrorism activities. Once we determine the outcomes we want, we
210 C HAPTER T HREE
determine the mining tools to use and start the mining process.
Then comes the very hard part. How do we know that the mining results are use-
ful? There could be false positives and false negatives. For example, the tool could
incorrectly produce the result that John is planning to attack the Empire State Building
on July 1, 2003. Then the law enforcement officials will be after John and the con-
sequences could be disastrous. The tool could also incorrectly product the result that
James is innocent when he is in fact guilty. In this case the law enforcement officials
may not pay much attention to James. The consequence here could be disastrous also.
As we have stated we need intelligent mining tools. At present we need the human
specialists to work with the mining tools. If the tool states that John could be a ter-
rorist, the specialist will have to do some more checking before arresting or detaining
John. On the other hand if the tool states that James is innocent, the specialist should
do some more checking in this case also.
Essentially with non real-time threats, we have time to gather data, build say pro-
files of terrorists, analyze the data and take actions. Now, a non real-time threat could
become a real-time threat. That is, the data mining tool could state that there could
be some potential terrorist attacks. But after a while, with some more information, the
tool could state that the attacks will occur between September 10, 2001 and September
12, 2001. Then it becomes a real-time threat. The challenge will then be to find exactly
what the attack will be? Will it be an attack on the World Trade Center or will it be an
attack on the Tower of London or will it be an attack on the Eiffel Tower? We need data
mining tools that can continue with the reasoning as new information comes in. That
is, as new information comes in, the warehouse needs to get updated and the mining
tools should be dynamic and take the new data and information into consideration in
the mining process.
again many of us could never have imagined that the sniper would do the shootings
from the trunk of a car. So the question is, how do we train the data mining tools such
as neural networks without historical data? Here we need to use hypothetical data as
well as simulated data. We need to work with counter-terrorism specialists and get as
many examples as possible. Once we gather the examples and start training the neural
networks and other data mining tools, the question is what sort of models do we build?
Often the models for data mining are built before hand. These models are not dynamic.
To handle real-time threats, we need the models to change dynamically. This is a big
challenge.
Data gathering is also a challenge for real-time data mining. In the case of non real-
time data mining, we can collect data, clean data, format the data, build warehouses and
then carry out mining. All these tasks may not be possible for real-time data mining
as there are time constraints. Therefore, the questions are what tasks are critical and
what tasks are not? Do we have time to analyze the data? Which data do we discard?
How do we build profiles of terrorists for real-time data mining? We need real-time
data management capabilities for real-time data mining.
From the pervious discussion it is clear that a lot has to be done before we can ef-
fectively carry out real-time data mining. Some have argued that there is no such thing
as real-time data mining and it will be impossible to build models in real-time. Some
others have argued that without real world examples and historical data we cannot do
effective data mining. These arguments may be true. However our challenge is to then
perhaps redefine data mining and figure out ways to handle real-time threats.
As we have stated, there are several situations that have to be managed in real-
time. Examples are the spread of smallpox, network intrusions, and even analyzing
data emanating from sensors. For example, there are surveillance cameras placed in
various places such as shopping centers and in front of embassies and other public
places. The data emanating from the sensors have to be analyzed in many cases in
real-time to detect/prevent attacks. For example, by analyzing the data, we may find
that there are some individuals at a mall carrying bombs. Then we have to alert the
law enforcement officials so that they can take actions. This also raises the questions
of privacy and civil liberties. The questions are what alternatives do we have? Should
we sacrifice privacy to protect the lives of millions of people? As stated in [12] we
need technologists, policy makers and lawyers to work together to come up with viable
solutions. We will revisit privacy in section 3.5.
associations, link analysis, forming clusters, classification and anomaly detection. The
techniques that result in these outcomes are techniques based on neural networks, de-
cisions trees, market basket analysis techniques, inductive logic programming, rough
sets, link analysis based on the graph theory, and nearest neighbor techniques. As we
have stated in [14], the methods used for data mining are top down reasoning where
we start with a hypothesis and then determine whether the hypothesis is true or bottom
up reasoning where we start with examples and then come up with a hypothesis.
Let us start with association techniques. Examples of these techniques are market
basket analysis techniques. The goal is to find which items go together. For exam-
ple, we may apply a data mining tool to data that has been gathered and find that
John comes from Country X and he has associated with James who has a criminal
record. The tool also outputs the result that an unusually large percentage of people
from Country X have performed some form of terrorist attacks. Because of the asso-
ciations between John and Country X, as well as between John and James, and James
and criminal records, one may need to conclude that John has to be under observation.
This is an example of an association. Link analysis is closely associated with making
associations. While association-rule based techniques are essentially intelligent search
techniques, link analysis uses graph theoretic methods for detecting patterns. With
graphs (i.e. node and links), one can follow the chain and find links. For example A
is seen with B and B is friends with C and C and D travel a lot together and D has a
criminal record. The question is what conclusions can we draw about A? Link analysis
is becoming a very important technique for detecting abnormal behavior. Therefore,
we will discuss this technique in a little more detail in the next section.
Next let us consider clustering techniques. One could analyze the data and form
various clusters. For example, people with origins from country X and who belong to a
certain religion may be grouped into Cluster I. People with origins from country Y and
who are less than 50 years old may form another Cluster II. These clusters are formed
based on their travel patterns or eating patterns or buying patterns or behavior patterns.
While clustering divides the population not based on any pre-specified condition, clas-
sification divides the population based on some predefined condition. The condition is
found based on examples. For example, we can form a profile of a terrorist. He could
have the following characteristics: Male less than 30 years of a certain religion and
of a certain ethnic origin. This means all males under 30 years belonging to the same
religion and the same ethnic origin will be classified into this group and could possibly
be placed under observation.
Another data mining outcome is anomaly detection. A good example here is learn-
ing to fly an airplane without wanting to learn to takeoff or land. The general pattern
is that people want to get a complete training course in flying. However there are now
some individuals who want to learn flying but do not care about take off or landing.
This is an anomaly. Another example is John always goes to the grocery store on
Saturdays. But on Saturday October 26, 2002 he goes to a firearms store and buys a
rifle. This is an anomaly and may need some further analysis as to why he is going
to a firearms store when he has never done so before. Is it because he is nervous after
hearing about the sniper shootings or is it because he has some ulterior motive? If he is
living say in the Washington DC area, then one could understand why he wants to buy
a firearm, possibly to protect him. But if he is living in say Socorro, New Mexico, then
T HURAISINGHAM 213
data. It is not straightforward to do this, as one has to make sure that all classified
information, even through implications, is removed. Another alternative is to find as
good data as possible in an unclassified setting for the researchers to work on. However,
the researchers have to work not only with counter-terrorism experts but also with data
mining specialists who have the clearances to work in classified environments. That is,
the research carried out in an unclassified setting has to be transferred to a classified
setting later to test the applicability of the data mining algorithms. Only then can we
get the true benefits of data mining.
searchers are taking different approaches to such data mining. Some have argued that
privacy enhanced data mining may be time consuming and may not be scalable. How-
ever we need to investigate this area more before we can come up with viable solutions.
Disclaimer: The views and conclusions expressed in this paper are those of the
author and do not reflect the policies or procedures of the MITRE Corporation or of
the National Science Foundation.
Bibliography
217
218 C HAPTER T HREE
[16] http://www.kdnuggets.com.