Sie sind auf Seite 1von 18

qwertyuiopasdfghjklzxcvbnmqwertyui opasdfghjklzxcvbnmqwertyuiopasdfgh jklzxcvbnmqwertyuiopasdfghjklzxcvb nmqwertyuiopasdfghjklzxcvbnmqwer Search Engines tyuiopasdfghjklzxcvbnmqwertyuiopas Text Mining in Action dfghjklzxcvbnmqwertyuiopasdfghjklzx cvbnmqwertyuiopasdfghjklzxcvbnmq wertyuiopasdfghjklzxcvbnmqwertyuio pasdfghjklzxcvbnmqwertyuiopasdfghj klzxcvbnmqwertyuiopasdfghjklzxcvbn

mqwertyuiopasdfghjklzxcvbnmqwerty uiopasdfghjklzxcvbnmqwertyuiopasdf ghjklzxcvbnmqwertyuiopasdfghjklzxc vbnmqwertyuiopasdfghjklzxcvbnmrty uiopasdfghjklzxcvbnmqwertyuiopasdf ghjklzxcvbnmqwertyuiopasdfghjklzxc


Himanshu Joshi Roll no. 1114025

Contents
Introduction to Text Mining .......................................................................................................................... 3 What is Text Mining? ................................................................................................................................ 3 Why Did I Choose Text Mining? ................................................................................................................ 3 Comparison with Data Mining ...................................................................................................................... 4 Similarities................................................................................................................................................. 4 Dissimilarities ............................................................................................................................................ 4 Internet Industry ........................................................................................................................................... 4 History ....................................................................................................................................................... 4 The Uprising of Google.............................................................................................................................. 6 Social Media and Micro-blogging ............................................................................................................. 6 Search Engines .............................................................................................................................................. 7 What is a Search Engine? .......................................................................................................................... 7 Types of Search Engines............................................................................................................................ 7 Web Search Engines .............................................................................................................................. 7 Vertical Search Engines ......................................................................................................................... 7 Semantic Search Engines....................................................................................................................... 8 Application of Text Mining to Various Types of Search Engines ................................................................... 9 Process of Retrieval in Web Search Engine............................................................................................... 9 Usage of Text Mining in Search Engines ................................................................................................. 11 Text categorization (faceted search systems) ................................................................................... 11 Contextualized clustering .................................................................................................................. 12 Concepts in Action .................................................................................................................................. 12 Text Categorization............................................................................................................................ 12 Contextualized Clustering .................................................................................................................. 13 Usage of Text Mining in Semantic Search and Natural Language Processing ........................................ 15 Conclusion and Learning ............................................................................................................................. 18

Search Engines Text Mining in action by Himanshu Joshi

Introduction to Text Mining


What is Text Mining?
Text mining is a burgeoning new field that attempts to glean meaningful information from natural language text1. In simple terms, it is a way to extract meaning from text. The meaning that is extracted from the text is useful in a particular purpose depending on need of text mining. As compared to the data stored in databases and tables, the data stored in the form of text is much difficult to analyze using computers as the algorithms to extract the data might be highly sophisticated. However, the payoff is good as the simplest and most common way of data exchange is text. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities)2.

Why Did I Choose Text Mining?


With the advent of technology and the popularity of micro-blogging and social media, the patterns of the internet are changing. Search engines are moving towards providing more relevant data, semantic search engines are coming up, emotions are freely expressed on social websites such as Twitter, Facebook etc. So in the coming times, searching a website like Twitter and Facebook will provide more sensible and user provided content rather than normal web search engines like Google, Bing or Yahoo.
1 2

Text Mining Ian H. Witten, Computer Science, University of Waikato, Hamilton, New Zealand Text Mining Wikipedia, The Free Encyclopedia

Search Engines Text Mining in action by Himanshu Joshi

Being a technocrat and also the creator of a search engine, I do not want to miss the developments and thus have chosen this topic so that I may gain some knowledge about the changes.

Comparison with Data Mining


Similarities
As Data mining refers to patterns in data, text mining refers to mining patterns in a chunk of text. This inevitably means that both the techniques attempt to mine context out of reference. Another important similarity between the two is that the information which is supposed to be extracted should be potentially useful. There is no point in mining for data which is of very less or practically no importance to the user. The definition of potentially useful is, however, different for data mining and text m ining. For data mining, the definition says that the information extracted should be comprehensible i.e. it helps to explain the data. However, in case of text mining, the data itself is comprehensible without the help of machines. Still, the stark similarity between text mining and data mining remains.

Dissimilarities
Even though the two techniques are similar conceptually, they have some differences between them. The information to be extracted in data mining is hidden, unknown and could hardly be extracted without using the automatic techniques of data mining. On the contrary, the data in text mining is still useful if used alone and can very easily be comprehended without the usage of sophisticated techniques and technology. It is the context that is missing in the data.

Internet Industry
History
The Internet was the result of some visionary thinking by people in the early 1960s that saw great potential value in allowing computers to share information on research and development in scientific and military fields. The Internet, then known as ARPANET, was brought online in 1969 under a contract let by the renamed Advanced Research Projects Agency (ARPA) which initially connected four major

Search Engines Text Mining in action by Himanshu Joshi

computers at universities in the southwestern US (UCLA, Stanford Research Institute, UCSB, and the University of Utah). The early Internet was used by computer experts, engineers, scientists, and librarians. There was nothing friendly about it. There were no home or office personal computers in those days, and anyone who used it, whether a computer professional or an engineer or scientist or librarian, had to learn to use a very complex system. E-mail was adapted for ARPANET by Ray Tomlinson of BBN in 1972. He picked the @ symbol from the available symbols on his teletype to link the username and address.

5
The Internet matured in the 70's as a result of the TCP/IP architecture first proposed by Bob Kahn at BBN and further developed by Kahn and Vint Cerf at Stanford and others throughout the 70's. It was adopted by the Defense Department in 1980 replacing the earlier Network Control Protocol (NCP) and universally adopted by 1983. In 1986, the National Science Foundation funded NSFNet as a cross country 56 Kbps backbone for the Internet. They maintained their sponsorship for nearly a decade, setting rules for its non-commercial government and research uses. As the commands for e-mail, FTP, and telnet were standardized, it became a lot easier for non-technical people to learn to use the nets. It was not easy by today's standards by any means, but it did open up use of the Internet to many more people in universities in particular. Other departments besides the libraries, computer, physics, and engineering departments found ways to make good use of the nets--to communicate with colleagues around the world and to share files and resources. While the number of sites on the Internet was small, it was fairly easy to keep track of the resources of interest that were available. But as more and more universities and organizations--and their libraries-connected, the Internet became harder and harder to track. There was more and more need for tools to index the resources that were available. 3 This is where web directories like Yahoo!, Excite and DMOZ came into picture. However, the user himself had to find the category he was looking for in such directories. Moreover, the updating of such directories was not done on the realtime basis. To address these problems, people started what are known as Search Engines today. Altavista, Ask Jeeves, Google etc. started as a way to address this problem.
3

A Brief History of The Internet by Walt Howe (http://www.walthowe.com/navnet/history.html)

Search Engines Text Mining in action by Himanshu Joshi

The Uprising of Google


Google began in January 1996 as a research project by Larry Page and Sergey Brin when they were both PhD students at Stanford University in California. While conventional search engines ranked results by counting how many times the search terms appeared on the page, the two theorized about a better system that analyzed the relationships between websites. They called this new technology PageRank, where a website's relevance was determined by the number of pages, and the importance of those pages, that linked back to the original site. A small search engine called "RankDex" from IDD Information Services designed by Robin Li was, since 1996, already exploring a similar strategy for site-scoring and page ranking. The technology in RankDex would be patented[33] and used later when Li founded Baidu in China. Page and Brin originally nicknamed their new search engine "BackRub", because the system checked backlinks to estimate the importance of a site. Eventually, they changed the name to Google, originating from a misspelling of the word "googol", the number one followed by one hundred zeros, which was picked to signify that the search engine wants to provide large quantities of information for people. Originally, Google ran under the Stanford University website, with the domain google.stanford.edu. The domain name for Google was registered on September 15, 1997, and the company was incorporated on September 4, 1998. It was based in a friend's (Susan Wojcicki) garage in Menlo Park, California. Craig Silverstein, a fellow PhD student at Stanford, was hired as the first employee. In May 2011, unique visitors of Google surpassed 1 billion for the first time, an 8.4 percent increase from a year ago with 931 million unique visitors. Google specializes in searching the internet for links, images, videos and other multimedia files. Google has made its mark as the search engine having the best search results as its output.

Social Media and Micro-blogging


A recent trend that is building up is that of social media and micro-blogging. The amount of content being uploaded on to the internet per second goes into terra bites and the database of search engines need to be updated very quickly. With the advent to websites like Twitter and Facebook, there is a shift

Search Engines Text Mining in action by Himanshu Joshi

in presence of relevant data towards the social media platform. Search engines based on twitter are making their presence felt on the internet quickly and facebook is also planning for a search engine to find public content across its platform. With these developments, data mining techniques such as Text Mining and Web Mining become highly relied upon and relevant today.

Search Engines
What is a Search Engine?
A Search Engine is an internet based website that allows a user to search for content on the internet. The data to be searched can be in any form text, links, images or even natural language processed data. A search engine essentially uses a web crawler or spider a program to crawl various web pages on the internet and extract meaningful data from the web pages using complex algorithms and concepts of text mining. The data extracted is stored in servers and is retrieved for the user when he enters a query on the website of the search engine.

Types of Search Engines


There are many types of search engines. Few types of search engines are listed here: Web Search Engines A web search engine is designed to search for information on the World Wide Web and FTP servers. The search results are generally presented in a list of results often referred to as SERPS, or "search engine results pages". The information may consist of web pages, images, information and other types of files. Some search engines also mine data available in databases or open directories. Unlike web directories, which are maintained only by human editors, search engines also maintain real-time information by running an algorithm on a web crawler4. Vertical Search Engines A vertical search engine, as distinct from a general web search engine, focuses on a specific segment of online content. The vertical content area may be based on topicality, media type, or genre of content.

Web Search Engine Wikipedia, The Free Encyclopedia

Search Engines Text Mining in action by Himanshu Joshi

Common verticals include shopping, the automotive industry, legal information, medical information, and travel. In contrast to general Web search engines, which attempt to index large portions of the World Wide Web using a web crawler, vertical search engines typically use a focused crawler that attempts to index only Web pages that are relevant to a pre-defined topic or set of topics.

Some vertical search sites focus on individual verticals, while other sites include multiple vertical searches within one search engine. Vertical search offers several potential benefits over general search engines: Greater precision due to limited scope Leverage domain knowledge including taxonomies and ontologies Support specific unique user tasks5 Semantic Search Engines Semantic search seeks to improve search accuracy by understanding searcher intent and the contextual meaning of terms as they appear in the searchable dataspace, whether on the Web or within a closed system, to generate more relevant results. There are two major forms of search: Navigational and Research. In navigational search, the user is using the search engine as a navigation tool to navigate to a particular intended document. Semantic Search is not applicable to navigational searches. In Research Search, the user provides the search engine with a phrase which is intended to denote an object about which the user is trying to gather/research information. There is no particular document which the user knows about that s/he is trying to get to. Rather, the user is trying to locate a number of documents which together will give him/her the information s/he is trying to find. Semantic Search lends itself well here.

Rather than using ranking algorithms such as Google's PageRank to predict relevancy, Semantic Search uses semantics, or the science of meaning in language, to produce highly relevant search results. In most

Vertical Search Engines Wikipedia, the free encyclopedia (http://en.wikipedia.org/wiki/Vertical_search)

Search Engines Text Mining in action by Himanshu Joshi

cases, the goal is to deliver the information queried by a user rather than have a user sort through a list of loosely related keyword results.6

Application of Text Mining to Various Types of Search Engines


9
We now move on to investigate the functioning of a search engine. For the purpose of simplicity in the report (as the area of research will become too wide), we are primarily concentrating on Web Search Engines and Semantic Search Engines in our study. Vertical search engine more or less has ingredients inherited from a web search engine and thus can be skipped. The functionality of a search engine can be broadly based on two data mining techniques: a. Web Mining or Link Mining wherein the spiders collect data from various websites. The data is passed back to the text mining tool to extract meaning out of it & to the web mining tool to extract other potential information from it (such as web links, images etc.) b. Text Mining for fetching the data when a user puts up a query. Our study here is primarily concerned with Text Mining and thus we will assume that the web search spider has collected the data and have passed the data back to the text mining tool.

Process of Retrieval in Web Search Engine


The process flow of retrieval of data in a web search engine is shown as under:

Semantic Search Wikipedia, the free encyclopedia (http://en.wikipedia.org/wiki/Semantic_search)

Search Engines Text Mining in action by Himanshu Joshi

Stop Word Removal

All common usage words like 'a', 'an', 'the' are removed from the input string. If the search query is given in quotations (""), stop words are not removed for phrase search.

Stemming

Similar words are 'stemmed' down to their root. Mostly the root is a noun or a verb. Eg. Cat, Catty, Catlike are all stemmed down to the word 'cat'. Words with similar 'stems' are treated as synonyms. This process is called as conflation.

10
Database is mined. First the complete phrase is searched and then the words are searched. Data is retreived and shown to the user

Database Search

The process of retrieval might look like an easy process but it is probably the most difficult process for a search engine. The task is difficult because of the following reasons: Overwhelming information in typical user query results. The data to be mined is huge and the number of combinations to be mined makes it more difficult. Results are only partly related to each other. There is a likely probability that two results that appear at rank 1 and rank 2 in the search engine result page (SERP) are not linked at all. Example will be given in the upcoming section. Many users investigate only the two or three top ranked documents. Thus a search engine needs to have an algorithm that will put up the most relevant documents at the top. This essentially means that the efficiency of retrieval and efficiency of indexing both need to be at their best. Traditional lists of ranked documents do not seem to be sufficient for the exploratory search tasks.

Search Engines Text Mining in action by Himanshu Joshi

Usage of Text Mining in Search Engines


Text mining is primarily used in two areas in web search engines. These are: Text categorization (faceted search systems) Faceted search (sometimes known as faceted browsing or faceted navigation), is a technique for accessing and exploring a collection of information (database, catalogue, repository).It presents the user with a faceted (layered, categorized, grouped) classification, allowing them to explore by filtering available information. A faceted search system allows each item in the catalogue/repository/database to be assigned multiple classifications, enabling the

11

classifications to be ordered in multiple ways, rather than in a single, pre-determined, taxonomic order.

Facets can be derived manually from analysis of the item, or from pre-existing fields in the item's metadata such as author, descriptor, language, and format. The former enables facet to be derived and sourced via a range of user and content research methods. The latter permits existing items in a catalogue/repository/database to have this extra metadata extracted, mapped and presented as a navigation facet, without extra data input being needed.7

Key benefits of faceted Search:

Enhanced feedback - users receive an overview of their search results broken down by category that they can then use for refining their search.

Informed choices - users know in advance how many items are available in each category, so they can search first in categories more likely to bring them a successful result. Categories with zero items in a given field are usually not shown; hence the user is very unlikely to encounter a 'no results' outcome.

Users can select their own searching path or hierarchy based on the information presented to them and can add or remove filters or facets at will.

Faceted Search | EIFL (http://www.eifl.net/faceted-search)

Search Engines Text Mining in action by Himanshu Joshi

Contextualized clustering Contextualized clustering is a method to group similar search results together and cluster documents or pages (web) according to the terms found in the documents. The core advantage of providing contextualized results is easiness for the user to find the relevant content which he is looking for. This provides a more meaningful search experience to the user. A newly developed system, HOBSearch, makes use of suffix tree clustering to overcome many of the weaknesses of traditional clustering approaches. Using result snippets rather than full documents, HOBSearch both speeds up clustering substantially and manages to tailor the clustering to the topics indicated in users query. An inherent problem with clustering, though, is the choice of cluster labels.

12

Concepts in Action
Text Categorization Most of the examples I will quote here are from the search engine which I have created: Molu The Search Spider (http://www.themolu.com). Text categorization can be applied in various ways in a search engine. Some of them are shown below:

The above screenshot is from an upcoming version of the website. The concept allows drilling down the search of the user to a single website and will allow him to narrow his scope to the websites he trusts.

Search Engines Text Mining in action by Himanshu Joshi

13

Using data mining techniques called as faceted categorization, the data item (news in this sense) is assigned two properties a date and relevance. Both of them have their own hierarchy according to popularity and other factors. When any one of them is invoked, relevant search results are mined and thrown back to the user. This concept is widely applied to search engines these days. The best example being FlickR.com where the user gets to see the search results by Date/Relevance/Interestingness. FlickR was also the first few companies to apply this concept. Contextualized Clustering The search engines today have become intelligent. Using text mining techniques, they group the results in different categories and present it to the user. The method of clustering is simple to describe but difficult to apply. The density of tags (different words extracted after stop word removal and stemming) in one particular record (how many words of similar origin match in a particular page) decides the group in which a particular page will fall. A very nice implementation of the technique can be found at Yippy (http://search.yippy.com/). The Search engine classifies the results based on groups (city, software, india etc.)/sources (Bing, Google etc.)/sites (.com, .org etc.)/time of update.

Search Engines Text Mining in action by Himanshu Joshi

14

Some search engines also present the results in tree based form making it much easier for the user to use them:

Search Engines Text Mining in action by Himanshu Joshi

Usage of Text Mining in Semantic Search and Natural Language Processing


Semantic Search requires a lot of text mining application so as to get the accurate results what the user is looking for. The major difference between a semantic or natural language processing search engine and a web search engine is that in a semantic search engine, the engine needs to understand the meaning of search query and answer accordingly. In a web search engine, it needs to just take the query as a whole and give results that match the query. A Natural Language Linguistic Extractor (NLLE) automatically identifies the concepts structuring the texts. Each significant word is a semantic chain. One word suffices to find all the documents containing that word and its equivalents using plain English (or French, Spanish, etc.) For instance, a query on the word "election" will retrieve documents containing the words "campaigning", "ballot" and "vote", even if the word "election" does not occur explicitly in the source document. The number of Semantic Search engines has been increasing everyday. However, no search engine of today can be called as 100% semantic. The search engine that comes very close to being semantic is Wolframalpha. Upon searching Who is Mahatma Gandhi? in the search engine, the following results were achieved:

15

Search Engines Text Mining in action by Himanshu Joshi

The results are intelligent and better than a normal web search. This is a very good example of how Information Extraction technique has been used in selecting the output. Another question thrown at the search engine was Why is sky blue in colour? As expected, the result was a plain answer rather than a cluster of web search results pointing to web pages containing the keywords Sky blue and colour.

16

Using semantic technologies and mathematical computation power, Wolframalpha was able to calculate and predict a mathematical function correctly:

Search Engines Text Mining in action by Himanshu Joshi

17

Search Engines Text Mining in action by Himanshu Joshi

Conclusion and Learning


By means of this study, I came to know about the working of a search engine in a better way. It is obvious that every day, the amount of data being posted on to the internet is huge. To make sense of the data and make it available to the users, the search engines should be able to index it faster and in a sensible manner. Techniques like web mining and text mining come handy in such a scenario. Users are moving towards customized search and semantic search. With the advent of semantic search engines like Hakia, True Knowledge and Wolfram Alpha, the scope for text mining has further increased. The results of mining are used in a better manner to get a meaning out of the query and not just context that was the case before. The results are getting improved day-by-day and the search engines are way advanced than they were some 10 years ago. This industry is definitely a one to watch in the next 10 years.

18

Search Engines Text Mining in action by Himanshu Joshi

Das könnte Ihnen auch gefallen