Yinghui-Web User Behavioral Profiling For User Identification-Decision Support System, 2010

Decision Support Systems 49 (2010) 261–271
Contents lists available at ScienceDirect
Decision Support Systems

j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / d s s
Web user behavioral profiling for user identification

Yinghui (Catherine) Yang ⁎
Graduate School of Management, University of California, Davis, One Shields Ave., Davis, CA 95616, United States
a r t i c l e i n f o a b s t r a c t
Article history: In this paper, we propose a simple, yet powerful approach to profile users' web browsing behavior for the
Received 3 January 2008 purpose of user identification. The importance of being able to identify users can be significant given a wide
Received in revised form 24 January 2010 variety of applications in electronic commerce, such as product recommendation, personalized advertising,
Accepted 4 March 2010
etc. We create user profiles capturing the strength of users' behavioral patterns, which can be used to
Available online 9 March 2010
identify users. Our experiments indicate that these profiles can be more accurate at identifying users than
decision trees when sufficient web activities are observed, and can achieve higher efficiency than Support
Keywords:
Data mining
Vector Machines. The comparisons demonstrate that profile-based methods for user identification provide a
User profiles viable and simple alternative to this problem.
User identification © 2010 Elsevier B.V. All rights reserved.
Behavioral patterns
1. Introduction mainly on the use of tracking techniques such as cookies, logins and keys
[20,23]. Below, we define the problem of user identification based on
The development of Information Technology has led to an user profiles built on web usage patterns. We establish a user profile for
explosion of data. Businesses now gather and store more consumer each user from a group of known users based on their web activities, and
data than ever before. Similarly, the amount of recorded data a given anonymous web session or a series of sessions can get matched
associated with individual users has grown tremendously in recent (or identified) to be one of these users or none of these users (i.e. if the
years. This is particularly true in the online world. User information is matching score is very low, then we can say that the user is not matched
collected through online registrations and surveys. Moreover, infor- to any one of the profiles). This is different from most biometric-based
mation about a web user's activity is often collected surreptitiously, identification (e.g. finger print or iris identification). For example, finger
even as the user browses the web. Efficiently summarizing user-level print identification is able to handle very large scale identification
information so that it can be used effectively in electronic commerce is problems because there is a unique match between the finger print
an important problem. and a user. Behavior-based identification will be less accurate when it
User profiles can help to summarize the large amounts of infor- comes to large scale problems because many users could share similar
mation available from a user and to achieve goals such as product behavioral patterns. Therefore, identifying web users based solely on
recommendations and personalized information delivery. A user profile their online behavior rather than using tracking techniques is a difficult
can include explicit information provided by users through registra- problem. Under cases where user identification based solely on behavior
tion and surveys [5,31]. Demographics such as a user's name, telephone is not feasible, behavioral information can be used together with other
number, address and information about hobbies are often offered by tools to achieve better performance. For example, for a large online
users explicitly. The simple facts about users' activity or transactions can retailer, several user accounts may share the same IP address (e.g.
also be categorized as explicit information. For example, such facts can members of the same family). The retailer might be able to identify the
include the frequency of a user's visit to a web site, the average amount members sharing the same IP address (assuming that there is some
of money spent per purchase transaction and the most purchased known login information from this IP address which can be used to build
product category. A user profile can also contain implicit information user profiles). The ability to further identify users within the shared IP
that is generated from analyzing users' activity often through more address can help achieve higher performance for many applications.
sophisticated statistical or data mining techniques (e.g. finding usage The Federal Financial Institutions Examination Council (FFIEC)
rules [4]). advocates that authentication methodologies should involve multiple
In this paper, we build implicit user profiles for the purpose of user channels to avoid exploiting single channel authentication systems.
identification. Previous research on web user identification has focused Behavioral pattern-based user identification systems, if designed
properly, may provide one such additional channel in a multi-channel
approach. Designing such a system will require developing accurate
⁎ Tel.: +1 530 754 5967; fax: +1 530 752 2924. user identification models. This in turn requires a deeper understand-
E-mail address: yiyang@ucdavis.edu. ing of the factors in user behavior that can result in greater accuracy in
0167-9236/$ – see front matter © 2010 Elsevier B.V. All rights reserved.
doi:10.1016/j.dss.2010.03.001
262 Y.(C.) Yang / Decision Support Systems 49 (2010) 261–271
user identification. Our research in this paper focuses on finding such models. Such models may be used to provide behavioral authentication
factors in capturing users' behavioral patterns using user profiles. services on behalf of the user. Other potential client-side or user-centric
The recurring nature of one's behavior can potentially benefit user applications can also be found in existing research [14,15,29].
identification. Behavioral patterns can represent the repeating elements The contributions of this paper are as follows. First, we introduced a
in users' activity. In the web context, such patterns can capture web simple, yet powerful method for generating user profiles from web
users' repeat visits to certain sites (e.g. in 40% of the days, a user visits usage data. The profiles capture the strength of user behavioral patterns.
yahoo.com). Some more examples are: (a) 80% of a user's web sessions1 We conducted systematic experiments to compare the performance
are over one hour long. (b) The number of pages a user visits in a web of the profile-based approach and the classification models, and have
session is often between 5 and 10. (c) Most of a user's sessions start with found good results. Second, this paper is among the first to study
cnn.com. The extent to which a behavioral pattern repeats itself can be user behavior patterns in web usage data for the purpose of user
captured by the strength of a behavioral pattern. In this paper, we profile identification. We applied user profiles for a new application — user
web users according to the strength of their behavioral patterns. The identification, and proposed a framework for profiling and identifying
behavioral patterns are learned from users' web browsing activity users. The framework was applied in the user-centric web browsing
(e.g. visiting msn.com is a frequent behavioral pattern). Such frequent context and was demonstrated to be effective at identifying users.
behavioral patterns can be discovered using existing frequent pattern The rest of the paper is organized as follows. In Section 2, we survey
discovery methods (e.g. Apriori [2]). Metrics similar to support and lift the research on user profiling and user identification. Section 3 formu-
are used to measure the strength of a behavioral pattern. For example, if lates the problem of using user profiles to identify users and describes
50% of a user's web sessions contain MySpace.com, then 50% measures the detailed procedure for user profiling and user identification. In
the strength of the pattern — visiting MySpace.com in a web session. If Section 4, we extensively evaluate the proposed method using 140
we consider MySpace.com as an item, then 50% is the support of this sub-datasets derived from a large dataset that captures user-centric
item [2]. A list of behavioral patterns together with metrics measuring web browsing activity. Finally, we discuss limitations and future work
their strength is stored in a user's profile. At a certain point of time, user in Section 5. We also include the procedures used to efficiently update
profiles are built first on the data before that time (the profiling stage), user profiles and user rankings in the Appendix.
and then are applied to the data observed after that time point in order
to identify users (identification stage). 2. User profiles and user identification
We conduct extensive experiments on user-centric web data to
evaluate the effectiveness of using our profiling approach to identify In the data mining community, user profiling is often studied in the
web users. We compare our profile-based user identification proce- context of personalization [4,25] and fraud detection [7,13,18,19]. In
dure with the classification models and demonstrate that profile- the information retrieval community, user profiles are used for per-
based methods can be more accurate at identifying users than decision sonalized information retrieval or filtering [12,16,37].
trees when sufficient web activities are observed, and can achieve Constructing accurate and comprehensive user profiles of indi-
higher efficiency than Support Vector Machines. The comparisons vidual users is one of the key issues in developing personalization
demonstrate that profile-based methods for user identification pro- applications. A user profile normally contains a list of variables [8,9]
vide a viable and simple alternative to this problem. and/or conjunctive rules (e.g. association rules [2,3] and classification
User identification based on behavioral information can be applied in rules [6]). Ref. [4] studies user profiles as a set of rules. Their focus is on
various situations. It can be directly applied to situations where enough rule validation since there are large numbers of rules discovered for
distinctive information about the users is available. Such situations each user. Rules validated by experts serve as the user profiles. Instead
could include small to medium sized web sites, intranets and envi- of just individual user profiles for personalization, there has also been
ronments with shared computers. Sometimes user identification is the work on learning aggregate profiles, such as the work of Ref. [25],
ultimate goal and sometimes it is an intermediate goal to help achieve which builds profiles that can apply across groups of users who may
better personalization and targeting. Even though the data set we used have similar interests. They use clustering to generate aggregate user
in this paper is user-centric (a richer source for researching web user profiles which are evaluated in the context of providing recommenda-
behavior [33]), the same idea can be applied to data collected by a tions as an integrated part of a personalization engine. Rather than
web site as well. The site-centric data has very detailed behavioral learning profiles from click-stream data, Ref. [17] describes an
information about pages users visited within the site. If a web site is able approach that unobtrusively monitors user activity on pages (such
to capture enough information about its users, the site can adopt the as how much the user scrolls) to build profiles that capture user
same method of profiling to capture the strength of users' behavioral interest in specific pages or content.
patterns. If a site can identify a user who is not explicitly signed in, there There is also other user profiling work based on web usage data that
may be opportunities for targeted recommendations. When the scale of is not associated with personalization. Ref. [27] generates user profiles
a site gets large, which makes identification based solely on behavioral from various web log files of a web site. They assume that the owner of
information infeasible, behavioral patterns can be combined with other the web site defines a list of topics and associates each web page with
tracking tools to enhance performance (see the Conclusion section one or more of these topics. A user profile is generated according to
for more detailed discussion). the time the user spent on the web pages of different topics. Ref. [30]
There are also potential client-side applications that can benefit generates profiles from server logs of a career web site. A user profile is
directly from profiling user-centric data. First, affiliated web sites generated from the number of clicks on job openings. Ref. [34] builds a
can share data among themselves (e.g. Windows Live ID, originally rule-based web user profiling platform that can be applied to different
Microsoft Passport, is based on similar ideas of sharing information web services to generate user profiles from their usage data, and the
among affiliated web sites). User profiles across these affiliated sites can platform is not tied to any specific applications. None of the previous
certainly help them coordinate product recommendations. Second, research on user profiling using web usage data has considered user
users may opt-in to download client-side software from a trusted third identification based on user-centric data.
party that will track client-side activity to build user identification The use of profiles for fraud detection is crucial, since it is not
possible (in real time) to extract and analyze all the associated records
1
in order to detect a potentially fraudulent deviation in behavior. Using
A web session is a commonly used unit of analyzing web user behavior. It contains
a list of consecutive URLs visited. Typically, industry heuristics draw session
profiles for fraud or intrusion detection is often studied in the context
boundaries if the time difference between consecutive clicks exceeds a chosen of telecommunication and network intrusion. For fraud and intrusion
threshold, say 30 min [36]. detection, profiles are used in two ways: in signature-based detection
Y.(C.) Yang / Decision Support Systems 49 (2010) 261–271 263
methods, and in anomaly detection methods [9,10]. In signature- 3. Building user profiles for user identification
based methods, a library of aberrant attack profiles (called signatures)
is stored in a signature data base. Profiles of user's recent traffic are User identification through behavioral patterns is a new and
then compared to these signatures to detect fraudulent behavior. In exciting area of research. In this paper, we propose a simple yet
anomaly detection, the profile for each entity is itself the baseline for powerful (as shown in the experiments) profile-based method for
comparison. A well formulated profile captures the typical behavior of user identification.
a user, allowing one to compare recent events to the appropriate A general framework for user identification based on user profiles
profile to ensure that the behavior is within the norm. New traffic for a is depicted in Fig. 1. User profiles are built from consecutive web
user is compared against their individual profile to determine if the sessions with known user IDs. These profiles are used to predict the
user's behavior has changed. A significant departure from the baseline owner of future anonymous web sessions (i.e. user identification). More
is a signal that the account may have been compromised [9]. The specifically, we assume a web browsing dataset containing N users U =
profiling problem within the context of fraud detection was first {u1,u2,…,uN}. We pick la time T as the present time. Before time T, user ui
studied in Ref. [13] in the cellular phone industry. The call data is has E(ui) number of web browsing sessions {eli,e2i ,…,eE(u) i }, where
organized by account, and each call record is labeled as fraudulent or ui ∈ {u1,u2,…,uN}. The sessions are ordered by time. The problem of
legitimate. The rule learning system is applied to an account's calls building user profiles for user identification is defined as learning the
and it produces a set of rules that serve to distinguish, within that profiles of users by using sessions observed before time T to help in
account, the fraudulent calls from the legitimate calls. The perfor- identifying users after time T, upon observing anonymous sessions.
mance of the system was compared with that of manual systems. Ref. At time T, upon observing the web activity (organized into labeled
[18] also studies telecommunication fraud. They implement three web sessions with known user IDs) of N users (U = {u1,u2,…,uN}), the
different variable-based user profiles and each profile's ability to user profile for user ui (for i = 1 to N) is constructed from user ui's web
characterize user behavior in order to discriminate normal activities sessions or together with other users' web sessions (Profiling Stage,
from fraudulent ones is tested using feed-forward neural networks. see Section 3.1). Looking forward from time T, a series of web sessions
The evaluation schema in the fraud detection domain is more straight- coming from an anonymous user is observed. Based on the behavioral
forward than that for personalization. The performance measurement patterns we observe from the list of new sessions, we estimate the
is simply the accuracy in identifying fraud in the testing data (both false likelihood that these new sessions are generated by a certain user
positive and false negative). (Identification Stage, see Section 3.2).
In the information retrieval context, a user profile takes into The curved arrows indicate that each component in the procedure
account the feedback of the retrieval task, that is, the relevance of the is dynamic. As new data comes in, we can update the present time T
previously returned documents. Research in information retrieval is and update the user profiles incorporating more recent data. When
now moving into a personalized scenario where a retrieval or filtering time T and the user profiles are fixed, we can also update the user
system maintains a separate user profile for each user. In this context, rankings used for user identification as we observe more anonymous
user profiles are often generated to capture users' feedback (both sessions (see the Appendix for methods used to update user profile
explicit and implicit) to the retrieval results. Users can indicate and rankings efficiently). The users ranked on the top are the users
explicitly whether a retrieved document is relevant, or the retrieval who are most likely generating the new anonymous sessions.
system can implicitly guess out the users' feedback based on users'
interaction with the documents (e.g. mouse and keyboard usage, 3.1. Profiling stage
page, navigation, book-marking, and editing) [37]. Such profiles
often contain a list of words [12]. The evaluation is on the retrieval The first stage in the procedure illustrated in Fig. 1 is to build user
performance. This type of profile is not the focus of this paper. profiles. We first construct user profiles based on the within-user
There is currently no research on building user behavioral profiles strength of behavioral patterns, then both the within-user strength
from web usage data for the purpose of user identification. Previous and the overall strength of behavioral patterns. Let Pall = {P1,P2,…,PK}
research on web user identification has mainly focused on using be the list of patterns we consider when building the profiles. K is
tracking techniques such as cookies [20], web bugs [20], login [23] and the number of candidate patterns considered. Below, we define the
keys [23]. Ref. [1] also disclosed several ways of identifying users and within-user strength of a behavioral pattern.
collecting demographic information and market information, including
Definition 1. Within-user pattern strength.
branding a browser with a unique identification in each user request,
identifying a user by key strokes or mouse clicks, gathering demo-
For user ui (for i = 1 to N), the within-user strength of a behavioral
graphic information using multiple datasets and by monitoring network
∣ ∣
p
∣ ∣
Duji
traffic. Very limited research is being done on analyzing web users' pattern Pj (for i = 1 to K) is , where Dui is the number of sessions
Dui∣ ∣
∣ ∣
behavior for user identification. Our previous research [28] sheds some pj
from ui and Dui is the number of sessions from user ui that contain
light on the possibilities of user identification based on users' web usage
behavioral pattern pj.
behavior, and it demonstrates that users can be identified, to some
extent, based on their behavior if enough information is observed. In our
current research in this paper, we further investigate whether user
profiles can help to identify users based on their behavioral patterns.
Recently, with the widespread use of technology, there has been
substantial interest in identifying unique behavioral characteristics
that can possibly serve as identifiers in domains outside of web usage.
Ref. [26] shows that users have distinct mannerisms in which they use
computer keyboards, and that users have unique keystroke dynamics.
Ref. [11] extends this work to the use of mouse movements in
addition to keystroke dynamics, and they note that the combination
can often be used to uniquely identify humans. Ref. [22] shows that
authors have unique writing styles that enable identifying them from
text. In a similar vein, Ref. [21,35] shows that users have unique
writing patterns when they author content for online message boards. Fig. 1. The procedure of building user profiles for user identification.
Fig. 2. Support-based profiling for user identification.
For a given user, the within-user strength of a behavioral pattern is pattern is more likely to happen for this user than expected. Because
equivalent to the support of a pattern among all the sessions of this IDF has been proven to be very successful in information retrieval, in
user (i.e. the number of sessions containing this pattern divided by the this paper, we will also investigate using lift of patterns to represent
total number of sessions of a user). user profiles, to see if it has any advantage over support.
Fig. 2 illustrates the profiling procedure that uses the within-user Before presenting the profiling procedure based on lift, we first
strength of behavioral patterns (i.e. within-user support). For the rest need to define the overall strength of a pattern and the relative pattern
of the paper, we refer to this profiling method as the support-based strength.
profiling method.
Definition 2. Overall pattern strength.
Steps 1–4 select the candidate patterns we use for building the
profiles. The candidate patterns are the union of the top X strongest
The overall strength of a behavioral pattern pj (for i = 1 to K) is
patterns from each user. For each user, we select the top X patterns with
the highest within-user strength, and then combine them together to
∣Dp ∣, where |D| is the total number of sessions from all users and Dp
j
j is
form the set of patterns we use later for building profiles. The rationale
∣D∣
the number of sessions from all users that contain behavioral pattern pj.
behind this selection heuristic is that a pattern must be strong for at least
one out of the N users in order for it to be of interest. We use M1 to Definition 3. Relative pattern strength.
discover the top patterns within each user. If we consider visiting a site
or combination of sites as the format of a pattern, M1 can be frequent For user ui (for i = 1 to N) and patterns pj (for j = 1 to K), the
itemset discovery methods such as Apriori [2]. There can be various ways
relative strength of a behavioral pattern pj for user ui is
p
Duji = Dui
, ∣ ∣ ∣ ∣
to select candidate patterns depending on the format of the pattern. Dpj = D ∣ ∣ ∣∣
Step 5 computes the within-user strength of all the candidate
patterns for every user, and user profiles are created. For each user, the
pj
where |Du| is the number of sessions from ui, Dui is the number of ∣ ∣
sessions from user ui that have behavioral pattern pj, |Du| is the total
profile is a vector containing K = |Pall| numeric numbers, where |Pall| is
the number of candidate patterns. Each number in the profile
∣ ∣
number of sessions from all users and Dpj is the number of sessions
from all users that contain behavioral pattern pj.
corresponds to the within-user strength of a candidate pattern. For For a given user, the relative strength of a behavioral pattern is
example, if the top patterns for user A are {cnn.com, yahoo.com, amazon. equivalent to the lift of a pattern2 (i.e. the support of the pattern
com}, and the top patterns for user B are {msn.com, yahoo.com, myspace. within this user divided by the support of the pattern across all users,
com}, then the patterns we consider when building profiles for both user which is the same as the within-user strength of this pattern divided
A and B are {cnn.com, yahoo.com, amazon.com, msn.com, myspace. by the overall strength of this pattern).
com}. For user A, the profile could look like (0.8, 0.6, 0.5, 0.01, 0), and the Fig. 3 below illustrates the profiling procedure using lift.
profile for user B could be (0.01, 0.7, 0.03, 0.8, 0.6). Similar to Fig. 2, Steps 1–4 select the candidate patterns for
In the field of information retrieval, the term weighting function building the profiles. Step 5 calculates the overall strength of the
known as IDF (Inverse-Document-Frequency) is widely used, usually candidate patterns. The overall strength of a given pattern is actually
as part of a TF-IDF function. The intuition was that a term which the support of that pattern across all user sessions. Step 6 computes
occurs in many documents is not a good discriminator, and should be the within-user strength of all candidate patterns for each user. In step
given less weight than one which occurs in few documents, and IDF 7, user profiles are created. Again, for each user, the profile is a vector
was a heuristic implementation of this intuition [32]. In data mining,
a corresponding term is lift. In the context of web user browsing
sessions, the lift of a pattern can be defined as the frequency of a 2
Lift is normally defined within the context of association rules. The lift of rule
pattern within a user's sessions divided by the frequency of the A → C is defined as confidence ðA→C Þ
= supportðsupport
A∩C Þ = supportðAÞ
. Definition 3 is equivalent to
supportðC Þ ðC Þ
pattern across all users. A lift value greater than 1 indicates that a the lift of rule ui → pj.
Fig. 3. Lift-based profiling.
containing K = |Pall| numeric numbers. Each number in the profile with the profile of this anonymous user to decide the identity of the
corresponds to the relative pattern strength of a candidate pattern. anonymous user.
Fig. 4 illustrates the detailed procedure for user identification,
3.2. Identification stage based on relative pattern strength. The procedure based on within-
user pattern strength is similar, so we do not repeat it here (the
The second stage in Fig. 1 is the user identification stage. In this difference is explained at the end of Fig. 4).
stage, the user profiles that were previously built are used to help in Step 1 calculates the within-user strength of all the candidate
identifying the owner of future sessions. User identification is based patterns for this anonymous user, based on newly observed sessions
on the difference between two profiles which is captured in the from this anonymous user. In Step 2, the within-user strength is then
distance function defined in Definition 4. The distance between two divided by the overall pattern strength computed before time T. This
profiles is defined as the Euclidean distance between two vectors. forms the user profile (after time T) for this anonymous user. In Steps 3
and 4, this anonymous profile is compared with all of the user profiles
Definition 4. Distance between two profiles. constructed from sessions before time T. Definition 4 is used to calculate
the distance between two profiles, and the users are ranked according to
For two profiles R1 = (a1s ,affiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 ,...,a K ) and ffi R2 = (b1,b2,...,bK), the the distance between their profiles and the anonymous profile. The
K ranking can change dynamically as W (the number of newly observed
distance between R1 and R2 is ∑ aj −bj .
j=1 sessions) changes. This approach for identification is similar to the idea
When new sessions are observed after time T, we assume that of k-nearest neighbor with k being 1. If we were to assign a label to the
these sessions are from the same anonymous user. Given this list of anonymous user, we would pick the profile with the smallest distance
sessions, we can use a similar method to calculate the support and lift and assign the user ID of that profile to the anonymous user. Of course,
values of all the candidate patterns for that anonymous user. The we can also assign several user IDs to the anonymous user, with the
candidate patterns are still the ones generated based on all of the user associated likelihood values based on the distance.
sessions before time T. The overall pattern strength used to calculate Over time, we observe more and more user activities, and so the
the lift value is also calculated from all the user sessions observed user profiles may change. In the Appendix, we provide a procedure to
before time T. Since sessions coming from all N users are observed update user profiles, based solely on incremental sessions. This
before time T, the overall pattern strength calculated from sessions procedure allows us to calculate the updated user profiles without
before time T is more accurate. Also, since the new sessions after time looking at the sessions before time T, which can be much more
T are from the same user, it is unfeasible to compute the overall efficient. Also in the Appendix, we include a procedure to update user
pattern strength from the data after time T. Instead, in the iden- rankings as we observe more anonymous sessions, when time T and
tification stage, the within-user strength of all the candidate patterns the user profiles are fixed.
for the anonymous user is computed from the new sessions after time
T. The profile for that anonymous user is based on the within-user 4. Experiments
strength calculated from sessions after time T and the overall pattern
strength calculated from sessions before time T. The user profiles In this section, we discuss the data we use, the experimental
constructed for all the N users in the profiling stage are then compared design and the results.
Fig. 4. Identification procedure.
4.1. Data First, each panelist in the sample represents a single computer that
is tracked; hence, we restrict our selection of users to those corre-
User-centric data provides benefits in understanding user online sponding to a household size of one in the household demographic file
behavior for the following reason. Site-centric data represents an provided. However, we note that this does not guarantee that only a
incomplete picture of user behavior on the web because it does not single “person” uses each computer.
capture a user's activity on external sites. User-centric data, on the Second, we need to sample users that have a sufficient level of
other hand, is data collected at the user level, so it captures entire browsing activity such that we have enough out of sample data on
histories of web surfing behavior for each user. The dataset that we which the models can be tested. The larger the (minimum) number of
use was provided by a commercial data vendor and it captures the sessions per user, the larger the training as well as the testing datasets
web browsing history of a panel of users who volunteered to be used. However, using a higher value may bias the sample toward
tracked in this manner. The data provided to us is the browsing users who are particularly active online. In the experiments here, we
behavior of a sample of 50,000 users over one year. use 300 sessions as the minimum cutoff (i.e. we choose users who
The raw data has simple summary statistics for each web session, have at least 300 sessions in the data).
specifically the name of the site visited, number of pages seen, the Third, when a given number of users are selected, the priors in the
starting time and duration of a session. In order to have adequate data data are unequal to start with. For instance, when constructing a
for testing (specifically in order to have enough number of sessions for dataset of three users, the number of sessions for each user may be
each user) and to have adequate features that can be constructed from 300, 600 and 2100, resulting in class priors of 0.1, 0.2 and 0.7
the data we adopt a user-centric perspective here, where a session is respectively. A naïve classifier predicting the most frequent class will
one continuous period of user activities where a user may visit start with an accuracy of 70%. To avoid this from introducing any
multiple web sites. accuracy bias, we select the same number of sessions for all of the
Rather than running our method on the entire dataset, we create users selected in any given sub-dataset. In the previous example, this
multiple datasets by combining data from a specific number of users. would mean selecting the first 300 sessions for each user. Since the
This enables us to explicitly study the effect of the number of users/ original sessions are ordered by time, the sessions selected in this
classes and to deal with data sparsity by selecting only those users for manner will still be ordered by time. This guarantees exactly equal
whom enough data is available. Specifically, we create several sub- class priors. For example, a 20 user dataset would have exactly 5%
datasets by combining sessions belonging to 2, 5, 10, 20, 50, 75 and class priors for all the classes. Note that this would mean having to
100 different users.3 That is, for each such sub-dataset, the dependent ignore the later sessions of some of the users selected.
variable (user ID) may take 2, 5, 10, 20, 50, 75 or 100 different values, After the user screening process, we have 2798 qualified users. We
respectively. However, since each such sub-dataset represents only then randomly select sessions from these users to form the sub-datasets
one sample of a chosen set of users, we repeat the random selection as discussed earlier. Each sub-dataset is then split into a training set
process twenty times to create 140 datasets. To each of those datasets, and a validation set. For each user in a sub-dataset, we keep the first 2/3
we apply the two profiling methods as well as two classification of their sessions in the training dataset (data before time T) and the last
methods for comparison. There are three additional criteria that we 1/3 of the sessions in the validation dataset (data after time T).
use when selecting the users, discussed below.
4.2. Experimental design
3
These numbers are chosen to reflect different scales (i.e. a few are in the lower
range (2, 5), some are in the intermediate range (10, 20, 50) and the rest are in the For each of the 140 sub-datasets, we implement three methods to
higher range (e.g. 75, 100)). predict user IDs (identify users) — the support-based profiling method,
Fig. 5. Experiment procedure.
the lift-based profiling method and the classification method. User choose weka's J4.8 (implementing C4.5 decision tree) as the classifier,
profiles are constructed using all of the user sessions in the training set since classification trees in general have been shown to be highly
(corresponding to all the sessions before time T in Fig. 1), and are then accurate classifiers, and it also works well with classification problems
used to predict user IDs in the validation set (sessions after time T in that contain a large number of classes. Here, when we mix 100 users'
Fig. 1). In the identification stage, the number of sessions observed (the sessions together, the number of classes is 100. The specific choice of
sliding window size) is changed. We use 10 different sliding window weka is also for convenience, since weka is an open-source data
sizes, 1, 10, 20, 30, 40, 50, 60, 70, 80, 90, and 100, to investigate how mining platform that lends itself easily to automation within scripts.
sliding window size affects the predictive accuracy. Each sliding We also take the sliding window approach in creating different units
window contains multiple sessions. The number of sessions contained of analysis. But for the classification model, both the training set
in a sliding window is the size of the sliding window. For each session, and the validation set are preprocessed using the sliding window
we observe all the web sites visited in that session. For multiple approach, because the data records used for both training and vali-
sessions in a sliding window, we are able to compute the support and dating the classification model should take the same format. Each data
lift of all the candidate patterns (selected using only the training set). point corresponds to a sliding window containing multiple sessions.
The sessions in a sliding window correspond to the new observed And the features for the classification model are the support of the
sessions {e1,e2, ... , eW} mentioned in Fig. 4. For each sliding window, we candidate patterns within a sliding window. There are K different
predict the user IDs using the two profiling methods and the features, corresponding to the support of the K candidate patterns.
classification method. We then compare the two profile-based methods and the sliding
In order to evaluate the effectiveness of the profile-based method window-based classification method to see which one achieves higher
for user identification, we also implement a classification method. We predictive accuracy under different settings (varying the number of
Table 1
Predictive accuracy for the three methods (in percentages).
Sliding window size
# of users 1 10 20 30 40 50 60 70 80 90 100
2 89.01 98.13 94.29 95.89 98.68 95.33 92.51 94.77 93.19 90.87 88.56
87.65 94.93 95.84 95.51 95.64 95.82 95.86 96.47 95.69 95.43 95.29
89.80 98.61 98.93 99.18 99.18 99.20 99.21 99.19 99.18 99.25 99.30
5 85.35 90.53 91.58 93.72 91.59 89.88 87.24 88.40 84.69 84.93 81.36
80.27 92.16 92.99 93.18 93.46 93.70 93.93 94.49 94.86 95.16 95.23
72.78 89.20 90.65 91.77 92.09 92.80 93.21 93.83 94.76 95.09 95.05
10 79.36 90.72 89.13 87.31 88.06 86.20 82.31 73.71 75.53 72.88 66.80
73.19 88.83 89.79 90.76 91.17 91.49 92.00 92.40 92.85 93.13 93.11
62.57 84.55 87.58 89.62 90.40 91.01 91.46 91.92 92.04 92.52 92.53
20 74.84 87.98 85.57 83.86 78.75 73.43 71.04 67.32 61.89 55.16 48.95
68.45 87.13 88.41 88.75 89.16 89.44 89.85 90.02 90.20 90.70 90.90
53.16 78.67 82.00 83.36 84.11 84.95 85.64 85.96 86.46 87.25 87.83
50 69.71 83.56 82.09 77.51 75.09 71.44 66.12 58.31 51.17 46.92 38.32
61.91 83.24 85.46 86.45 87.46 88.15 88.68 89.21 89.54 89.80 90.12
44.62 68.08 71.81 73.77 75.06 75.96 76.94 77.61 78.30 79.29 80.04
75 64.96 83.36 81.27 77.88 74.20 71.04 64.22 58.54 50.40 42.77 39.22
57.64 82.60 85.08 86.20 86.84 87.44 87.85 88.28 88.67 89.12 89.49
41.05 67.07 71.26 72.67 73.93 75.00 75.78 76.61 77.39 78.22 78.88
100 62.90 81.01 78.83 75.94 72.06 67.27 61.47 54.56 47.36 42.23 38.04
55.76 80.65 83.20 84.36 85.19 85.81 86.27 86.67 86.92 87.17 87.36
38.50 63.41 66.98 68.71 69.69 70.44 70.98 71.53 72.04 72.74 72.93
Note: in each cell, the first number is the accuracy for the classification model, the second number is for the support-based profiling method, and the third is for the lift-based
profiling method. The bold number indicates the highest accuracy in each cell.
Table 2 highest relative improvement, as shown in Table 3, reaches 135.17%.

The method with the highest accuracy. When the sliding window size is small, the profiling method doesn't
Sliding window size show an advantage because not enough sessions are observed to
portray an accurate picture of a user's behavior. For higher numbers of
# of users 1 10 20 30 40 50 60 70 80 90 100
users, say 100, the support-based profiling method achieves accuracy
2 LP LP LP LP LP LP LP LP LP LP LP
as high as 87.36%. This is a dramatic increase from the baseline. For
5 C LP LP C SP SP SP SP SP SP SP
10 C C SP SP SP SP SP SP SP SP SP example, when there are 100 users, the chance of guessing the correct
20 C C SP SP SP SP SP SP SP SP SP user is only 1%. One implication of this result is that when enough
50 C C SP SP SP SP SP SP SP SP SP sessions can be obtained, the profiling method is the best choice.
75 C C SP SP SP SP SP SP SP SP SP Otherwise, the classification model is better.
100 C C SP SP SP SP SP SP SP SP SP
Figs. 6 and 7 illustrate the trend of the absolute accuracy
Note: C stands for classification; SP stands for support-based profile; LP stands for lift- improvement. As shown in Fig. 6, for all different numbers of users
based profile.
(each corresponding to a line in the Fig. 6), the trend is clear that the
improvement in accuracy increases with the size of the sliding window.
users and the size of the sliding window). For simplicity, the pattern The 3D surface in Fig. 7 shows that the profiling-based methods dem-
presentation we choose in the experiments is items. In this web onstrate more benefit as the number of users (the difficulty of the
browsing dataset, it is the web sites visited in a web session. problem) and the sliding window size (the amount of information
Fig. 5 illustrates the detailed steps used in the experiments. available) increase.
In the experiment, we pick the top 10 web sites (X in Figs. 2 and 3) We also compared our methods with Support Vector Machines
from each user and union them to form the set of candidate patterns. (SVMs), which has been shown to be highly accurate in classification
We also did the experiments setting X to 20 and 30, but the higher X [24]. While the accuracy improvement was not significant, our
did not contribute to significantly higher accuracy. methods achieve higher efficiency than SVMs as illustrated in Figs. 8
and 9. The x-axis is the number of users, and the y-axis is the runtime
4.3. Evaluation results in seconds. We compare the time taken by SVM and the sum of the
time taken by two profile methods (SUM_profile). The comparisons
Table 1 reports the accuracy of these three methods, varying the with both decision trees and SVMs demonstrate that profile-based
number of users and the size of the sliding window. Each number is the methods for user identification provide a viable and simple alternative
average of 20 runs. Table 2 lists the method with the best performance to this problem.
within each cell. Table 3 shows both the absolute and relative im-
provement of the best profiling method over the classification model. 5. Conclusions
As shown in Tables 1, 2 and 3, the support-based profiling method
significantly dominates the classification model when the size of the In this paper, we propose an approach to build user profiles that
sliding window gets larger and the number of users increases. The lift- can be used to identify users. Our experiments show that this
based profiling method performs best when the number of users is approach can be highly effective and efficient. This paper is among the
small. The classification model performs best when the sliding first to study user behavior patterns in web usage data for the purpose
window size is small. In the cells where the classification method of user identification. As discussed in the introduction, there are many
performs the best, the difference in performance between the classi- potential applications for this approach. Once the profiles are proven
fication model and the profile-based method is small (lower left to be effective at user identification, they can be further utilized for
corner in Table 3). But the advantage that the profile-based method other applications such as targeted advertising, product recommen-
has over the classification model gets much larger as the size of the dation and fraud detection.
sliding window grows (lower right corner in Table 3). Also, as the While it is a simple yet powerful approach, it also has some
number of users increases, the decrease in accuracy for the profiling simplifying assumptions and limitations. (a) As time changes, users'
method is not as dramatic when the sliding window size increases behavioral patterns may change. Even though we have considered
(e.g. for sliding window size 100, the accuracy decreases from 95.29% updating user profiles as discussed in the Appendix, we have a fixed
to 87.36% for the support-based profiling method, but the accuracy set of candidate patterns. Essentially, this set of patterns can also
decreases from 88.56% to 38.04% for the classification method). The evolve over time. Moreover, more recent behavior can play a more
Table 3
Accuracy improvement (profile-based vs. classification method).
Sliding window size
# of users 1 10 20 30 40 50 60 70 80 90 100
2 0.79 0.48 4.64 3.29 0.50 3.87 6.70 4.42 5.99 8.38 10.74
0.89 0.49 4.92 3.43 0.51 4.06 7.25 4.66 6.43 9.22 12.13
5 −5.09 1.63 1.41 − 0.54 1.88 3.82 6.69 6.08 10.17 10.23 13.86
−5.96 1.81 1.54 − 0.58 2.05 4.25 7.67 6.88 12.01 12.05 17.04
10 −6.17 − 1.90 0.66 3.44 3.10 5.29 9.69 18.69 17.32 20.25 26.31
−7.78 −2.09 0.74 3.94 3.52 6.13 11.77 25.36 22.93 27.78 39.39
20 −6.39 −0.85 2.83 4.90 10.41 16.02 18.81 22.70 28.31 35.54 41.95
−8.53 − 0.97 3.31 5.84 13.22 21.81 26.48 33.73 45.74 64.44 85.70
50 −7.80 − 0.32 3.38 8.94 12.37 16.71 22.56 30.90 38.37 42.88 51.80
−11.19 − 0.38 4.11 11.53 16.48 23.39 34.12 52.98 74.98 91.39 135.17
75 − 7.32 − 0.77 3.81 8.32 12.64 16.41 23.62 29.74 38.28 46.36 50.27
−11.27 − 0.92 4.68 10.69 17.04 23.10 36.78 50.81 75.95 108.40 128.19
100 − 7.14 − 0.36 4.37 8.42 13.13 18.54 24.80 32.11 39.57 44.94 49.32
−11.35 − 0.44 5.54 11.08 18.22 27.56 40.34 58.85 83.55 106.40 129.64
Note: in each cell, the first number is the absolute accuracy improvement of the best profiling method over the classification model, and the second number is the relative accuracy
improvement (both in percentages).
Fig. 8. Runtime in seconds (#users 2–100).

Fig. 6. Accuracy improvement with sliding window size.
important role in determining what a user's future behavior is like. In

future work, more weight can be placed on most recent activities to classification model works the best. This is always a difficult problem,
take this into consideration. (b) The approach works well in situations but some insights can be of value. (b) When data is available, the
where there are many repeated visitors. On sites where occasional approach can be applied to more applications, such as fraud detection
users dominate, and a majority of the users do not login, it might be and product recommendation. (c) In large scale problems with huge
more beneficial to assign users to various groups based on their number of users, when direct user identification based on behavioral
behavior instead of attempting to uniquely identify them. (c) In the patterns alone becomes infeasible (e.g. due to lack of sufficient
identification stage, we assume that the sessions we observe after information to distinguish users), behavioral features can be com-
time T are from the same anonymous user. There needs to be methods bined with other tools to help identify users. For example, for a large
(e.g. IP addresses or cookies) to connect consecutive sessions. In cases online retailer, several user accounts may share the same IP address
where this assumption does not hold, identification can only be done (e.g. members of the same family). The retailer might be able to
on fairly short periods of web activities, which can be quite difficult. identify the members sharing the same IP address (assuming that
(d) Because of the difficulty in obtaining online fraud data where one there is some known login information from this IP address which can
logs in using someone else's identity, we cannot directly test the be used to build user profiles). Being able to further identify users
effectiveness of the method on fraud detection. But the results in within the shared IP address can help achieve high performance for
user identification can shed some light on the application of fraud personalization. Another solution for the large scale problem is to
detection, and the method can be slightly modified in the identifica- focus on segment level instead of individual level. The problem can
tion stage to detect fraud. (e) In our experiments, the maximum be converted into a segment identification or segment assignment
number of users in each sub-dataset is 100. Since we were not able to problem. One such approach is to first cluster users into different
extend the experiments to much larger number of users, we cannot segments based on profile similarities. Future unknown users can be
conclude that our method can be applied to large scale problems. assigned to the segment that has the highest profile similarity. While
Therefore, we should either restrict our method to small and medium the segment-based solution for large scale problems is not proven in
sized applications or convert the problem itself to manageable scale this paper, it remains a viable research question in our future work.
(see more discussions about this in the Introduction and the future
work below).
In the future, this work can be extended in several directions.
(a) More experiments can be conducted to further investigate the
conditions under which the profile-based method or a certain
Fig. 7. Accuracy improvement with sliding window size and number of users. Fig. 9. Runtime in seconds (#users 2–20).
Appendix A
Fig. 1A. Updating user profiles.
Fig. 2A. Updating user rankings.

A.1. Updating user profiles [15] D. Kurt, Fenstermacher and Mark Ginsburg, client-side monitoring for web
mining, Journal of the American Society for Information Science and Technology
54 (7) (2003) 625–637.
Over time, we observe more and more user activity, and the [16] D. Godoy, A. Amandi, User profiling in personal information agents: a survey, The
individual user profiles may change. The procedure in Fig. 1A updates Knowledge Engineering Review 20 (4) (2006) 329–361.
[17] J. Goecks, J. Shavlik, Learning Users' Interests by Unobtrusively Observing Their
user profiles based solely on incremental sessions. This procedure Normal Behavior, Proceedings of the 2000 ACM Intelligent User Interfaces
allows us to calculate the updated user profiles without looking at the Conference, 2000, pp. 129–132.
sessions before time T. Here, we only provide the update procedure for [18] C.S. Hilas, J.N. Sahalos, Testing the Fraud Detection Ability of Different User Profiles
by Means of FF-NN Classifiers, Proceedings of the 16th International Conference
lift-based profiling, since the update procedure for support-based
on Artificial Neural Networks, Part II, Lecture Notes in Computer Science, 4132,
profiling is similar. 2006, pp. 872–883.
Once the user profiles before time T are calculated, we simply need [19] J. Hollmen, User profiling and Classification for Fraud Detection in Mobile
Communications Networks, PhD dissertation, Helsinki University of Technology,
to incorporate the new sessions between time T and T1 into the old
Department of Computer Science and Engineering. 2000.
user profiles in order to generate the updated user profiles at time T1. [20] A. Juels, M. Jakobsson, T. Jagatic, Cache Cookies for Browser Authentication,
We do not need to follow the profiling algorithm in Fig. 3 and we do Proceedings of 2006 IEEE Symposium on Security and Privacy, 2006, pp. 301–305.
not need to use any session before time T. This can be much more [21] J. Li, R. Zheng, H. Chen, From fingerprint to writeprint, Communications of the
ACM 49 (4) (2006) 76–82.
efficient. [22] D. Madigan, A. Genkin, D.D. Lewis, S. Argamon, D. Fradkin, L. Ye, Author
The set of candidate patterns Pall does not require frequent updates. Identification on the Large Scale, Proceedings of the Classification Society of North
It can be updated when there are significant amount of additional America, 2005.
[23] K. Mangipudi, R. Katti, A secure identification and key agreement protocol with
sessions that need to be considered for generating candidate patterns. user anonymity (SIKA), Computers and Security 25 (6) (2006) 420–425.
[24] D. Meyer, Friedrich Leisch, Kurt Hornik, The support vector machine under test,
A.2. Updating user rankings Neurocomputing 55 (1–2) (2003) 169–186.
[25] B. Mobasher, H. Dai, T. Luo, M. Nakagawa, Discovery and evaluation of aggregate
usage profiles for web personalization, Data Mining and Knowledge Discovery 6
Again, we use time T as the present time. As we observe more and (1) (2002) 61–82.
more sessions from an anonymous user after time T, we are able to [26] F. Monrose, A. Rubin, Authentication via Keystroke Dynamics, Proceedings of the
4th ACM Conference on Computer and Communications Security, 1997, pp. 48–56.
update the user rankings using the procedure in Fig. 2A. Again, Fig. 2A
[27] N. Mushtao, K. Tolle, P. Werner, R. Zicari, Building and Evaluating Non-obvious
is based on lift, and the updating procedure based on support can be User Profiles for Visitors of Web Sites, Proceedings of IEE International Conference
easily derived, so we do not include the details here. on E-commerce Technology, 2004.
[28] B. Padmanabhan and Y. Yang, Clickprints on the Web: Are There Signatures in
Web Browsing Data? Working Paper, http://ssrn.com/abstract=931057.
References [29] B. Padmanabhan, Z. Zheng, S. Kimbrough, An Empirical Analysis of the Value of
Complete Information for eCRM Models, MIS Quarterly 30 (2) (2006) 247–267.
[1] M. Abraham, M. Brown and S. Heyman, Systems and Methods for User [30] R. Rafter, B. Smyth, Passive Profiling from Server Logs in an Online Recruitment
Identification, User Demographic Reporting and Collecting Usage Data Usage Environment, Proceedings of the Workshop on Intelligent Techniques for Web
Biometrics, United States Patent 7260837. Personalization, 2001, pp. 35–41.
[2] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, A.I. Verkamo, Fast Discovery of [31] T.S. Raghu, P.K. Kannan, H.R. Rao, A.B. Whinston, Dynamic Profiling of Consumers
Association Rules, in: U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy for Customized Offerings Over the Internet: A Model and Analysis, Decision Support
(Eds.), Advances in Knowledge Discovery and Data Mining, Ch.12, AAAI Press, Systems 32 (2) (2001) 117–134.
1996. [32] S. Robertson, Understanding Inverse Document Frequency: on theoretical
[3] C.C. Aggarwal, Z. Sun, P.S. Yu, Fast algorithms for online generation of profile arguments for IDF, Journal of Documentation 60 (5) (2004) 503–520.
association rules, IEEE Transactions on Knowledge and Data Engineering 14 (5) [33] Y. Yang, B. Padmanabhan, GHIC: a hierarchical pattern based clustering algorithm for
(2002) 1017–1028. grouping web transactions, IEEE Transactions on Knowledge and Data Engineering
[4] G. Adomavicius, A. Tuzhilin, Expert-driven validation of rule-based user models in 17 (9) (2005) 1300–1304.
personalization applications, Data Mining and Knowledge Discovery 5 (1/2) [34] J. Zhang, M. Shukla, Rule-Based Platform for Web User Profiling, Proceedings of
(2001) 33–58. the Sixth International Conference on Data Mining (ICDM), 2006, pp. 1183–1187.
[5] J. Ahn, P. Brusilovsky, J. Grady, D. He, S.Y. Syn, Open User Profiles for Adaptive [35] R. Zheng, J. Li, Z. Huang, H. Chen, A framework for authorship identification of
News Systems: Help or Harm? Proceedings of the 16th International Conference online messages: writing style features and classification techniques, Journal of
on World Wide Web, 2007, pp. 11–20. the American Society for Information Science and Technology (JASIST) 57 (3)
[6] L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression (2006) 378–393.
Trees, Wadsworth Publisher, 1984. [36] Z. Zheng, B. Padmanabhan, S. Kimbrough, On the existence and significance of
[7] P. Burge, J. Shawe-Taylor, C. Cooke, Y. Moreau, B. Preneel, C. Stoermann, Fraud data preprocessing biases in web usage mining, INFORMS Journal on Computing
Detection and Management in Mobile Telecommunications Networks, Proceed- 15 (2) (2003) 148–170.
ings of the 2nd European Conference on Security and Detection, 1997, pp. 91–96. [37] P. Zigoris, Y. Zhang, Bayesian Adaptive User Profiling with Explicit & Implicit
[8] I.V. Cadez, P. Smyth, H. Mannila, Probabilistic Modeling of Transaction Data with Feedback, Proceedings of the 15th ACM International Conference on Information
Applications to Profiling, Visualization, and Prediction, Proceedings of the Seventh and Knowledge Management, 2006, pp. 397–404.
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(KDD), 2001, pp. 37–46.
[9] C. Cortes, D. Pregibon, Proceedings of the Fifth ACM SIGKDD International Yinghui (Catherine) Yang is an assistant professor of the
Conference on Knowledge Discovery and Data Mining (KDD), 1999, pp. 327–331. Graduate School of Management at University of California,
[10] C. Cortes, D. Pregibon, Signature-based methods for data streams, Data Mining Davis. She received her Ph.D. in Operations and Information
and Knowledge Discovery 5 (3) (2001) 167–182. management from The Wharton School at the University of
[11] R.A. Everitt, P.W. McOwan, Java-based internet biometric authentication system, Pennsylvania, and B.E. in Management Information Systems
IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (9) (2003) from The School of Economics and Management at Tsinghua
1166–1172. University. Her research is interdisciplinary between data
[12] W. Fan, M.D. Gordon, P. Pathak, Effective profiling of consumer information mining and marketing. Her research has been published in
retrieval needs: a unified framework and empirical comparison, Decisision top data mining journals (e.g. IEEE Transactions on Knowl-
Support Systems 40 (2) (2005) 213–233. edge and Data Engineering) and Marketing journals (e.g.
[13] T. Fawcett, F. Provost, Combining Data Mining and Machine Learning for Effective Marketing Science).
User Profiling, Proceedings of the Second International Conference on Knowledge
Discovery and Data Mining (KDD), 1996, pp. 8–13.
[14] Kurt D. Fenstermacher, Mark Ginsburg, Mining Client-Side Activity for Person-
alization, Proceedings of the Fourth IEEE International Workshop on Advanced
Issues of E-Commerce and Web-Based Information Systems, 2002.

Yinghui-Web User Behavioral Profiling For User Identification-Decision Support System, 2010

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Yinghui-Web User Behavioral Profiling For User Identification-Decision Support System, 2010

Hochgeladen von

Copyright:

Verfügbare Formate

Decision Support Systems 49 (2010) 261–271

Contents lists available at ScienceDirect

Decision Support Systems

Web user behavioral proﬁling for user identiﬁcation

Fig. 2. Support-based proﬁling for user identiﬁcation.

Fig. 3. Lift-based proﬁling.

Fig. 4. Identiﬁcation procedure.

Fig. 5. Experiment procedure.

Sliding window size

Table 2 highest relative improvement, as shown in Table 3, reaches 135.17%.

Sliding window size

Fig. 8. Runtime in seconds (#users 2–100).

important role in determining what a user's future behavior is like. In

Fig. 1A. Updating user proﬁles.

Fig. 2A. Updating user rankings.

Das könnte Ihnen auch gefallen