Sie sind auf Seite 1von 6

Dynamic Neural Portal for Minimal Knowledge Anonymous User Proling

Tomas Arredondo, Rodrigo Gmez, Daniel Arancibia, Csar Len, o e o Bruno Mundaca
Departamento de Electrnica o tarredondo@elo.utfsm.cl Universidad Tcnica Federico Santa Mara, Av. Espaa 1680 e Casilla 110-V, Valpara so-Chile

Abstract. This paper assesses the automatic proling of anonymous internet users used in web portal or search engine sites. The objective is to be able to show a web user with the most interesting content for his tastes without requiring the user to login or give any explicit personal information. The system tries to determine the user prole based on the users selections in his current HTTP session. Towards this a neural network is previously trained based on user internet usage information and is embedded in a dynamically generated Java page. In order to train the neural network, a web based survey was developed and user information was obtained from it. A test web portal using this method was implemented and experiments were made.

Introduction

Optimizing web navigation and data mining web usage is of current research interest [1-3], we investigate an associated problem of optimizing providing personalized content to users based on predicting their proles given on minimal information. By giving personalized content, the users perception of the portal/search site is improved. If we consider the millions of web sites in existence and the fact that for many dierent motives (e.g. privacy, lack of time, and lack of interest) many users do not want to give prole information then this problem becomes an important one. The interpretation for minimal information can vary, Milani [4] has proposed a fuzzy similarity method for matching an appropriate target prole using only information from the current HTTP request (e.g. time/date, form keywords and IP location information). We consider minimal information to be information in the current HTTP session. Towards this goal, we developed a web based survey that captures the internet usage preferences for various users [5]. We also developed a neural network engine that given survey results stored in a database trains a multilayer neural network to predict the user prole (e.g. age, sex) based on their browsing selection preferences (e.g. rst, second, third, fourth browsing preferences). The neural network is encoded as a JAVA applet to use in the portal [6]. The neural engine periodically retrains the neural network and regenerates the Java code.

Architecture and Implementation

In this section, we present the architecture of our system. Figure 1 shows the overall organization of the dynamic neural training system (WeNDi). Based on the information stored in the database the neural engine trains a multilayer neural network [7].

Fig. 1. Neural Training System

2.1

Neural Network

We investigated the use of several neural network congurations towards obtaining prole information and a minimum training error. The conguration nally used is shown in Figure 2, the number of input nodes were set at 4 where each node corresponds to a users browsing preference (an integer from 0 to 7) as obtained in the questionnaire (rst, second, third, and fourth preferences), three hidden layers were used (with 22, 14, and 10 nodes), the output layer has two real valued nodes for prole information (sex and age). Each node uses a sigmoidal activation function. Other architectures (e.g. three layers) were tried but were not capable of converging to a low enough error (under .1) given the small training set given (only the most recent 120 surveys were used). The surveys obtained were selected at random in groups of 30 to be presented to network for training. 2.2 Neural Engine

This module is tasked with training the multi layer perceptron (MLP) using gradient descent backpropagation. The neural engine creates a thread that periodically connects to the database and extracts the new user preference patterns used to train the network. The neural engine uses a Euclidean distance measure for error calculations. Figure 3 shows the neural engine pseudocode used to update the neural weights.

Fig. 2. Neural Network

begin NeuralEngine Create_Thread = Trainer MLP = Create_MLP(Struct, Sigmoid) // create MLP CF = CUADRATIC //specify quadratic error cost function Create_Algorithm(ERROR,ITE_MAX, LEARNING_RATE) Start_Trainer while(true) DB.Connect if ( DB.Registers - Actual_Registers > Threshold ) // enough new entries Actual_Registers = DB.Registers Pattern = DB.getDataPattern // retrieve database patterns [Weights, Error]=MinimunVector(MLP, CF, Pattern) // train MLP Trainer.WriteHTML // generate HTML code using Weights Trainer.Report // generate error report using Error end if DBClose Sleep_Trainer // sleep for a while end while end NeuralEngine Fig. 3. Neural Engine Pseudocode

2.3

Web Survey

Obtaining data was a fundamental part of the process; Figure 4 shows the questionnaire with the alternatives (an integer from 0 to 7) that were used to obtain user preferences for the web sites that they visited most frequently: Search Engines, Chat/Email, Press, E-Commerce, Downloads, Sports, Beauty, and Games. The chosen preferences are stored in a database together with the user prole (sex and age).

Fig. 4. Web Survey

2.4

Dynamic Portal

After training the neural network, the Neural Engine retrieves the weight set for the resultant neural network and generates a dynamic HTML page written in Java with the portal buttons and the associated weights. The portal utilizes the Java code with the embedded neural network to determine user prole information (sex, age) based in the order of user selections. Figure 5 portrays the dynamic portal generation process. When a user accesses the portal page in a session and selects four buttons, the page determines his prole and can present information appropriate to his interests.

Fig. 5. Dynamic Portal Generation

Experimental Results

A summary of some of the survey statistics for the percentage of the preferences obtained by the web questionnaire is given in Figure 6. This survey is for a range of eight ages (0-15, 16-20, 21-25, 26-30, 31-40, 41-50, 51-60, 60+) and preferences (Search Engines, Chat/Email, Press, E-Commerce, Downloads, Sports, Beauty, and Games).
First selection preference 1 2 3 4 5 6 7 Second selection preference 1 2 3 4 5 6 7 .039 .052 .013 .026

Age 0 1 2 3 Male 4 5 6 7 0 1 2 3 Female 4 5 6 7

.013 .013 .026 .104 .065 .039 .143 .130 .026 .026 .013 .039 .026 .052 .013 .013 .013 .039 .013 .052 .039 .013 .013 .013 .095 .048 .024 .095 .095 .119 .071 .095 .024 .048 .024 .071 .024 .048 .024 .024 .024 .024 .024

.013 .091 .052 .091 .065 .143 .013 .026 .013 .013 .039 .013 .039 .013 .026 .039 .026 .013 .013 .052 .026 .026 .026 .013 .013

.024 .048 .024 .024 .024 .024 .024 .095 .071 .071 .167 .024 .024 .024 .024 .024 .048 .048 .024 .024 .048 .024 .048 .024

Fig. 6. Statistics for Button Selections 1 and 2

Figure 7, shows the prole prediction Euclidean error obtained by the neural network based on the preference patterns presented to the network. As can be seen the error evolution through the dierent iterations is improved with a greater number of patterns. This result is for a single training process (hence the resultant noise).

Conclusions

We remarked that the problem of obtaining prole information from minimal knowledge is important for a variety of reasons to portal and search engine web sites. Our contribution is a demonstration that using only information from within an HTTP session is sucient to obtain knowledge about the user. Analyzing the test results it can be appreciated that a small error was obtained given the minimal number of patterns used (120), this reects that there are specic patterns of use for Internet usage depending on age and sex. The error obtained should continue to decrease given a higher number of patterns.

Error evolution 2 Error 1 0.1 0 30 patterns

100

200

300

400

500

600

700

800

900

1000

2 Error 1 0.1 0

60 patterns

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

2 Error 1 0.1 0

90 patterns

1500

3000

4500

6000

7500

9000

10500

12000 120 patterns

2 Error 1 0.1 0

0.25

0.5

0.75

1 1.25 Number of iterations

1.5

1.75

2 x 10
4

Fig. 7. Prediction Error

It can also be seen that when the number of patterns presented is increased the number of iterations necessary to converge the network increases substantially. This may be due to the subtle changes in dierences with a greater training pattern set. A possible improvement is to increase the number of user prole elements returned by the network (e.g. education level, geographic location, religious and political orientation) but this would require a greater number of patterns and possibly dierent network characteristics.

References
[1] Baldonado, M., Chang, C.-C.K., Gravano, L., Paepcke, A.: The Stanford Digital Library Metadata Architecture. Int. J. Digit. Libr. 1 (1997) 108-121 [2] Borzemski, L., Lopatka, P.: Complementing Search Engines with Text Mining, Lecture Notes in Computer Science, LNAI 3533 (2005) 743-745 [3] Abraham, A., Ramos, V.: Web Usage Mining Using Articial Ant Colony Clustering and Genetic Programming, Congress on Evolutionary Computation, (CEC 03) 2003 1384-1391 [4] Milani, A.: Minimal Knowledge Anonymous User Proling for Personalized Services, Lecture Notes in Computer Science, LNAI 3533 (2005) 709-711 [5] Survey: http://www.elo.utfsm.cl/wendi/index.html [6] Portal: http://www.elo.utfsm.cl/wendi/Portal/index.html [7] Jang, J.-S, Sun, C.-T., Sun, Mizutani, E.: Neuro-Fuzzy and Soft Computing: a computational approach to learning and machine intelligence, Prentice Hall, NJ, (1997)

Das könnte Ihnen auch gefallen