Sie sind auf Seite 1von 10

International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) ISSN 2249-6831 Vol.

3, Issue 2, Mar 2013, 113-122 TJPRC Pvt. Ltd.

COMPARISON OF CLUSTERING BASED ON SEARCH ENGINE DATASET USING TANAGRA AND WEKA
PREETI BANSAL & MD. EZAZ AHMED Department of Computer Science, ITM University, Gurgaon, Haryana, India

ABSTRACT
The paper introduced the concept of clustering on a particular dataset of some influencing factors in Search Engine Optimization using Tanagra and Weka data mining tool and find the values of factors that affect relevance. As the number of available Web pages grows, it is become more difficult for users to find documents relevant to their interest. To a search engine, relevance means more than simply finding a page with the right words. In the early days of the web, search engines didnt go much further than this simplistic step, and their results suffered as a conse quence. Thus, through evolution, smart engineers at the engines devised better ways to find valuable results that searchers would appreciate and enjoy. Today, 100s of factors influence relevance, many of which well d iscuss through this paper. Clustering is the classification of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait according to some defined distance measure. It can enable users to find the relevant documents more easily and also help users to form an understanding of the different facets of the query that have been provided for web search engine. We used clustering algorithm is K-means and EM and compare the result of clustering on Weka and Tanagra and find the values of factors like title length, number of backlinks, Domain length, keywords in title that affect search engine optimization.

KEYWORDS: Weka, Tanagra, K-Means, EM, Search Engine, Dataset INTRODUCTION


With the rapid growth of network information resources, the result obtained through the search engine is very large. Users have to filter the results list one by one to get the results they want [1] .According to survey, users will generally turn back to read no more than five pages of the results. How to quickly and efficiently extract valuable information from the massive network information, how to organize the display form of the query results is becoming the objective that the information industry compete to research and develop [17], [18]. Data mining tools predict future trends and behavior, allowing business to make proactive, knowledge-driven decisions. [6] Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Clustering is one of the key technology [2] [8].We used K-means and EM clustering algorithms in Weka and Tanagra .K means include the Euclidean k-medians, in which the objective is to minimize the sum of distances to the nearest center, and the geometric k-center problem, in which the objective is to minimize the maximum distance from every point to its closest center. The EM algorithm is used to find the maximum likelihood parameters of a statistical model in cases where the equations cannot be solved directly. Typically these models involve latent variables in addition to unknown parameters and known data observations. That is, either there are missing values among the data, or the model can be formulated more simply by assuming the existence of additional unobserved data points [20].

114

Preeti Bansal & Md. Ezaz Ahmed

THE K-MEANS ALGORITHM


From a practical point of view, clustering analysis is one of the main tasks of data mining. It is now used in many areas, such as data mining and knowledge discovery, pattern recognition and so on. There are many clustering analysis algorithm, of which the most well-known is the K-means algorithm. K-Means clustering is a very popular algorithm to find the clustering in dataset by iterative computations. It has the advantages of simple implementing and finding at least local optimal clustering. K-Means algorithm is employed to find the clustering in dataset. The algorithm [3] is composed of the following steps: Initialize k cluster centers to be seed points. (These centers can be randomly produced or use other ways to generate). For each sample, find the nearest cluster center, put the sample in this cluster and recompute centers of the altered cluster (Repeat n times). Exam all samples again and put each one in the cluster identified with the nearest center (dont recompute any cluster centers). If members of each cluster havent been changed, stop. If changed, go to step 2.

EM ALGORITHM
EM (Expectation Maximization, called EM) algorithm is a method generally from the "incomplete data" to solve the maximum likelihood estimation of model parameters, the "incomplete data" generally has two situations: one is to observe the process by its own limitations or mistake, then make observed data as mistakes incomplete data, another is the directly optimization of the likelihood function of parameter very difficult, and introduce additional parameters (hidden or lost) make optimization easy, so the definition of original observational data with additional data compose "complete data", the original observations naturally become "incomplete data." Basic principle of the EM can be expressed as follows: Y, is the observed data, complete data X = (Y, Z), Z is missing data, is model parameters. About Y on the posterior distribution p (Y) is very complicated and difficult for a variety of statistical calculation. If the missing data Z is known, it may be getting a simple added posterior distribution p ( y, Z) about . We can use the simplicity of p ( Y, Z) for statistical calculation. Then, we can return to examine and improve the assumption of Z, so we can transform a complex maximization or sampling problem into a simple one. The greatest advantages of EM algorithm are simplicity and stability, and its main purpose is to provide a simple iterative algorithm to calculate the posterior plural [14].

ORGANIZATION OF DATA
Its important for search engine to maintain a high quality websites this will improve the optimization. According to 2011 search engine report following are the factors that affect ranking as shown in Figure 1. [15].We make a database of 30 different websites[12],[13] in which following factors we take length of title and keywords in title[18] from page level keyword usage, Domain length from Domain level keyword usage, and number of backlinks from page level keyword Agnostic feature and one discrete factor Top rank website[21],[11]. It is true that other tags, namely the title tag and Meta description tag are of critical importance to SEO best practices but there are many other factors like relevant contents [19], paid advertisements etc also that affect relevance. Here we discuss some of the on page factors.

Comparison of Clustering Based on Search Engine Dataset Using Tanagra and Weka

115

Figure 1: Dataset

Figure 2: Search Ranking Factors Working with Weka on Dataset Using K -Means Open Weka, then click on right side option explorer then open data file under preprocesses option which is in csv or arff format. As we choose the explorer option it will appear as given below, the screen shot in Figure 3. [5] Clearly indicate the open file option. Now we click on view open file and choose the data set. Weka provides filters to accomplish all of these preprocessing tasks, they are not necessary for clustering in Weka. This is because Weka SimpleKMeans algorithm automatically handles a mixture of categorical and numerical attributes. This algorithm automatically normalizes numerical attributes when doing distance computations. This gives all attributes that are present in dataset. We can select any one which we want to include or select all.

Figure 3: Opening Page

116

Preeti Bansal & Md. Ezaz Ahmed

After this just click on cluster tab and click on choose button on left side and select clustering algorithm which we want to apply, we select simple k means the screen appears below in Figure 4.

Figure 4: Select Algorithm Next, click on the text box to the right of the "Choose" button to get the pop-up window shown in Figure 5, for editing the clustering parameter. In the pop-up window we enter as the number of clusters and we leave the value of "seed" as is. The seed value is used in generating a random number which is, in turn, used for making the initial assignment of instances to clusters. Note that, in general, K-means is quite sensitive to how clusters are initially assigned, when we increase the value of seed then result may differ and number of iterations also different. If we give the value of seed like 100, 1000, 500 , 300 etc then there is no difference in result but iterations are different but if we give value like 150,175 etc then result may also different.

Figure 5: Choose Parameters Once the options have been specified, we can run the clustering algorithm. Here we make sure that in the "Cluster Mode" panel, the "Use training set" option is selected, and we click "Start". We can right click the result set in the "Result list" panel and view the results of clustering in a separate window. The result window shows the centroid of each cluster as well as statistics on the number and percentage of instances assigned to different clusters. Cluster centroids are the mean vectors for each cluster (so, each dimension value in the centroid represents the mean value for that dimension in the cluster). Thus, centroids can be used to characterize the clusters. The result shows that in cluster 0 there are 13 websites that are not top rank and have length of title is 59, number of backlinks are 6638, keywords in title are 5.3, Domain length is 22.53 and in cluster 1 there are 16 websites that are top rank and have length of title 36.37 and number of backlinks are 19163, keywords in title are 2.9 and Domain length is 29.81 as shown in Figure 6.

Comparison of Clustering Based on Search Engine Dataset Using Tanagra and Weka

117

Figure 6: Result of Clustering in Weka Working with Tanagra on Dataset Using K Means Open Tanagra, and then open data files which are in txt, xls or arff format. As we open the file in dataset it will appear as given below, the screen shot in Figure 7. Clearly indicate the open file name research.xls. Now we right click on view dataset and choose view from pop-up menu which will appear after right click on view dataset [7]

Figure 7: Opening Page Now we select view dataset from data visualization tab and drag it and drop to that on dataset. Now we select define status from feature selection tab and then drags it and drop to dataset then right click on define status and select parameters from popup menu as in Figure 8.

Figure 8: Selection of Input Parameters

118

Preeti Bansal & Md. Ezaz Ahmed

Now select attributes as input as page rank, backlinks, length of title, keywords in title, Domain length and press OK button. From statistics tab we choose Univariate continuous stat, drag and drop it in define status1.Then we use view command from popup menu we will get result as Min, Max values. We want to standardize the variables before performing the kmeans approach. The aim is to eliminate the discrepancy of scales between the variables. We add the Standardize component (Feature Construction tab) into the diagram. Then, we click on the View menu. In fact, this operation is not necessary with Tanagra. It can automatically standardize the variables with the K Means component [9]. Now again we select define status from feature selection tab and then drag it and drop to dataset then right click on define status and select parameters from popup menu and select all standardize variables like std_length of title_1 .We insert the Kmeans component under the Clustering tab. We click on the parameters contextual menu. We set the following Parameters as in Figure.9.

Figure 9: Define the Number of Clusters We ask a partitioning into two groups. It is not necessary to normalize the distance because we use already standardized variables. We validate and we click on the View menu. This gives the TSS, WSS and centroids of clusters. Tanagra computes and adds automatically a new column to the current dataset. We can visualize it with the View Dataset component (Data visualization tab). We again insert the Define Status component into the k means to distinguish the clusters. We set as Target a computed column cluster k means_1 as input as in Figure 10 .Then we add the Group Characterization component under Statistics tab to get the final result. The result shows that in cluster 1 the websites are top rank and have length of title is 41.54 , number of backlinks are 14908.42 , keywords in title are 3.42, Domain length is 27.46 and in cluster 2 the websites that are not top rank and have length of title 105 and number of backlinks are 2511 ,keywords in title are 13 and Domain length is 18 as shown in Figure 11.

Figure 10: Selection of Discrete Attribute

Comparison of Clustering Based on Search Engine Dataset Using Tanagra and Weka

119

Figure 11: Final Result Working with Weka on Dataset Using EM Open Weka, then click on right side option explorer then open data file under preprocesses option which is in csv or arff format [4], [5] like in K means. Starting points are similar to as describe above in K means. After this just click on cluster tab and click on choose button on left side and select clustering algorithm which we want to apply, we select simple EM the screen appears below in Figure 12

Figure 12: Selection of Algorithm Once the options have been specified, we can run the clustering algorithm. Here we make sure that in the "Cluster Mode" panel, the "Use training set" option is selected, and we click "Start". We can right click the result set in the "Result list" panel and view the results of clustering in a separate window. The result shows that there are 11 websites in cluster 0, 14 websites, in cluster 1, 1 websites in cluster 2 and in cluster 3 the 3 websites are present .The websites present in cluster 1 are top rank and have length of title 33 and number of backlinks are 3941, keywords in title are 3.44 and Domain length is 18.61 as shown in Figure 13.

Figure 13: Final Result

120

Preeti Bansal & Md. Ezaz Ahmed

Working with Tanagra on Dataset Using EM Open Tanagra, and then open data files which are in txt, xls or arff format. As we open the file in dataset it will appear as given above, the screen shot in Figure 7. Now we select define status from feature selection tab and then drag it and drop to dataset then right click on define status and select parameters from popup menu as in Figure 14. Now select attributes backlinks, length of title, keywords in title, Domain length as input and press OK button [16].After this drag and drop EM-Clustering in clustering tab under define Status. Then right click on EM clustering and select number of clusters as 2 and press ok in Figure 15. and then drag and drop EM-selection under EM-clustering and select start and end value like we want to make 2 clusters then start value is 1 and end value is 2.To see the output just double click on EM clustering and check the values of factors in clusters as shown in Figure 16.

Figure 14: Selection of Input Parameters

Figure 15: Specify Number of Clusters

Figure 16: Final Result

Comparison of Clustering Based on Search Engine Dataset Using Tanagra and Weka

121

The result shows that in cluster 1 the websites are in top rank and have length of title is 41, number of backlinks are 14908, keywords in title are 3.42, Domain length is 27.46 and in cluster 2 the websites that are not top rank and have length of title 105 and number of backlinks are 2511, keywords in title are 13 and Domain length is 18.

Comparison of K Means and EM in Tanagra and Weka


Weka In Weka, using K means the websites that are top rank have length of title is 36.37 while using EM the length of title should be 33.62, number of backlinks in Weka by K-means are 19143 while using EM it should be 3941.4, keywords in title in Weka by K means are 2.93 while by EM it should be 3.42, Domain length by K means is 29.81 while by EM it should be 18.16 .The graphical comparison is shown in Figure 17 and Figure 18. Tanagra Tanagra, using K means the websites that are top rank have length of title is 27.46 while using EM the length of title should be 41.5, number of backlinks in Tanagra by K-means are 14908.42 while using EM it should be 14908.9 , keywords in title in Tanagra by K means are 3.42 while by EM it should be 3.42, Domain length by K means is 27.46 while by EM it should be 27.41 .The graphical comparison is shown in Figure 19 and Figure 20, the values are nearby same because of this the line shown in Figure 19. is of one colour.

CONCLUSIONS
As we continue to fight with huge data on web ,to optimised the result we should find many more influencing factors.By this comparison we conclude that Tanagra is good tool in comparison of Weka as results of both algorithms k means and EM are near by same[10].In future we work on Social signals as this feature has great importance in optimization.

Figure 17: Comparison of Optimized Factors in Weka

Figure 18: Comparison of Backlinks in Weka

Figure 19: Comparison of Optimized Factors in Tanagra

Figure 20: Comparison of Backlinks in Tanagra

122

Preeti Bansal & Md. Ezaz Ahmed

ACKNOWLEDGEMENTS
We thanks to Mrs. Latika Singh for their invaluable comments and suggestions to improve the manuscript.

REFERENCES
1. A Document Clustering Algorithm for Web Search Engine Retrieval System, Hongwei Yang School of Software Yunnan University, Kunming 650021, China; 2. S. Kantabutra, Efficient Representation of Cluster Structure in Large Data Sets, Ph.D. Thesis, Tufts University, Medford MA, September 2001. 3. Wang Jun, OuYang Zheng-Zheng The Research of K- Means Clustering Algorithm Based on Association Rules . 4. 5. 6. http://maya.cs.depaul.edu/classes/ect584/weka/ k- means.html. http://www.cs.ccsu.edu/~markov/weka-tutorial.pdf. http://thesai.org/Downloads/Volume3No4/Paper_20Knowledge_Discovery_in_Health_Care_Datasets Using_Data_Mining_Tools.pdf. 7. 8. Tanagra: An Evaluation. Jessica Enright Jonathan Klippenstein. C. Romero, S. Ventura "Educational data Mining: A Survey from 1995 to 2005", Expert System with Applications (33), pp. 135-146, 2007. 9. http://eric.univ-lyon2.fr/~ricco/tanagra/fichier/ tanagra_etles autres_KMeans.pdf.

10. R. Kannan, S. Vempala, and Adrian Vetta, On Clusterings Good, Bad, and Spectral Proc. of the 41st Foundations of Computer Science, Redondo Beach, 2000.5. 11. http://klageswebdesign.com/seo-blog/2011/06/search-ranking-factors-released-what-you- need-to-know/. 12. http://www.backlinkswatch.com. 13. http://www.submitexpress.com/cgi-bin/analyzer/meta.pl. 14. http://cptra.ln.edu.hk/~mlwong/conference/isda2002.pdf 15. http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=04381759 16. http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/ en- Tanagra_EM_Clustering.pdf 17. http://www.seomoz.org/beginners-guide-to-seo. 18. http://www.ieee.org/about/webteam/resources/search_ optimization.html. 19. http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35291 20. http://en.wikipedia.org/wiki/Expectation%E2%80%93 maximization algorithm. 21. http://www.seomoz.org/article/search-ranking-factors

Das könnte Ihnen auch gefallen