Beruflich Dokumente
Kultur Dokumente
Contents
Introduction to Text Mining .......................................................................................................................... 3 What is Text Mining? ................................................................................................................................ 3 Why Did I Choose Text Mining? ................................................................................................................ 3 Comparison with Data Mining ...................................................................................................................... 4 Similarities................................................................................................................................................. 4 Dissimilarities ............................................................................................................................................ 4 Internet Industry ........................................................................................................................................... 4 History ....................................................................................................................................................... 4 The Uprising of Google.............................................................................................................................. 6 Social Media and Micro-blogging ............................................................................................................. 6 Search Engines .............................................................................................................................................. 7 What is a Search Engine? .......................................................................................................................... 7 Types of Search Engines............................................................................................................................ 7 Web Search Engines .............................................................................................................................. 7 Vertical Search Engines ......................................................................................................................... 7 Semantic Search Engines....................................................................................................................... 8 Application of Text Mining to Various Types of Search Engines ................................................................... 9 Process of Retrieval in Web Search Engine............................................................................................... 9 Usage of Text Mining in Search Engines ................................................................................................. 11 Text categorization (faceted search systems) ................................................................................... 11 Contextualized clustering .................................................................................................................. 12 Concepts in Action .................................................................................................................................. 12 Text Categorization............................................................................................................................ 12 Contextualized Clustering .................................................................................................................. 13 Usage of Text Mining in Semantic Search and Natural Language Processing ........................................ 15 Conclusion and Learning ............................................................................................................................. 18
Text Mining Ian H. Witten, Computer Science, University of Waikato, Hamilton, New Zealand Text Mining Wikipedia, The Free Encyclopedia
Being a technocrat and also the creator of a search engine, I do not want to miss the developments and thus have chosen this topic so that I may gain some knowledge about the changes.
Dissimilarities
Even though the two techniques are similar conceptually, they have some differences between them. The information to be extracted in data mining is hidden, unknown and could hardly be extracted without using the automatic techniques of data mining. On the contrary, the data in text mining is still useful if used alone and can very easily be comprehended without the usage of sophisticated techniques and technology. It is the context that is missing in the data.
Internet Industry
History
The Internet was the result of some visionary thinking by people in the early 1960s that saw great potential value in allowing computers to share information on research and development in scientific and military fields. The Internet, then known as ARPANET, was brought online in 1969 under a contract let by the renamed Advanced Research Projects Agency (ARPA) which initially connected four major
computers at universities in the southwestern US (UCLA, Stanford Research Institute, UCSB, and the University of Utah). The early Internet was used by computer experts, engineers, scientists, and librarians. There was nothing friendly about it. There were no home or office personal computers in those days, and anyone who used it, whether a computer professional or an engineer or scientist or librarian, had to learn to use a very complex system. E-mail was adapted for ARPANET by Ray Tomlinson of BBN in 1972. He picked the @ symbol from the available symbols on his teletype to link the username and address.
5
The Internet matured in the 70's as a result of the TCP/IP architecture first proposed by Bob Kahn at BBN and further developed by Kahn and Vint Cerf at Stanford and others throughout the 70's. It was adopted by the Defense Department in 1980 replacing the earlier Network Control Protocol (NCP) and universally adopted by 1983. In 1986, the National Science Foundation funded NSFNet as a cross country 56 Kbps backbone for the Internet. They maintained their sponsorship for nearly a decade, setting rules for its non-commercial government and research uses. As the commands for e-mail, FTP, and telnet were standardized, it became a lot easier for non-technical people to learn to use the nets. It was not easy by today's standards by any means, but it did open up use of the Internet to many more people in universities in particular. Other departments besides the libraries, computer, physics, and engineering departments found ways to make good use of the nets--to communicate with colleagues around the world and to share files and resources. While the number of sites on the Internet was small, it was fairly easy to keep track of the resources of interest that were available. But as more and more universities and organizations--and their libraries-connected, the Internet became harder and harder to track. There was more and more need for tools to index the resources that were available. 3 This is where web directories like Yahoo!, Excite and DMOZ came into picture. However, the user himself had to find the category he was looking for in such directories. Moreover, the updating of such directories was not done on the realtime basis. To address these problems, people started what are known as Search Engines today. Altavista, Ask Jeeves, Google etc. started as a way to address this problem.
3
in presence of relevant data towards the social media platform. Search engines based on twitter are making their presence felt on the internet quickly and facebook is also planning for a search engine to find public content across its platform. With these developments, data mining techniques such as Text Mining and Web Mining become highly relied upon and relevant today.
Search Engines
What is a Search Engine?
A Search Engine is an internet based website that allows a user to search for content on the internet. The data to be searched can be in any form text, links, images or even natural language processed data. A search engine essentially uses a web crawler or spider a program to crawl various web pages on the internet and extract meaningful data from the web pages using complex algorithms and concepts of text mining. The data extracted is stored in servers and is retrieved for the user when he enters a query on the website of the search engine.
Common verticals include shopping, the automotive industry, legal information, medical information, and travel. In contrast to general Web search engines, which attempt to index large portions of the World Wide Web using a web crawler, vertical search engines typically use a focused crawler that attempts to index only Web pages that are relevant to a pre-defined topic or set of topics.
Some vertical search sites focus on individual verticals, while other sites include multiple vertical searches within one search engine. Vertical search offers several potential benefits over general search engines: Greater precision due to limited scope Leverage domain knowledge including taxonomies and ontologies Support specific unique user tasks5 Semantic Search Engines Semantic search seeks to improve search accuracy by understanding searcher intent and the contextual meaning of terms as they appear in the searchable dataspace, whether on the Web or within a closed system, to generate more relevant results. There are two major forms of search: Navigational and Research. In navigational search, the user is using the search engine as a navigation tool to navigate to a particular intended document. Semantic Search is not applicable to navigational searches. In Research Search, the user provides the search engine with a phrase which is intended to denote an object about which the user is trying to gather/research information. There is no particular document which the user knows about that s/he is trying to get to. Rather, the user is trying to locate a number of documents which together will give him/her the information s/he is trying to find. Semantic Search lends itself well here.
Rather than using ranking algorithms such as Google's PageRank to predict relevancy, Semantic Search uses semantics, or the science of meaning in language, to produce highly relevant search results. In most
cases, the goal is to deliver the information queried by a user rather than have a user sort through a list of loosely related keyword results.6
All common usage words like 'a', 'an', 'the' are removed from the input string. If the search query is given in quotations (""), stop words are not removed for phrase search.
Stemming
Similar words are 'stemmed' down to their root. Mostly the root is a noun or a verb. Eg. Cat, Catty, Catlike are all stemmed down to the word 'cat'. Words with similar 'stems' are treated as synonyms. This process is called as conflation.
10
Database is mined. First the complete phrase is searched and then the words are searched. Data is retreived and shown to the user
Database Search
The process of retrieval might look like an easy process but it is probably the most difficult process for a search engine. The task is difficult because of the following reasons: Overwhelming information in typical user query results. The data to be mined is huge and the number of combinations to be mined makes it more difficult. Results are only partly related to each other. There is a likely probability that two results that appear at rank 1 and rank 2 in the search engine result page (SERP) are not linked at all. Example will be given in the upcoming section. Many users investigate only the two or three top ranked documents. Thus a search engine needs to have an algorithm that will put up the most relevant documents at the top. This essentially means that the efficiency of retrieval and efficiency of indexing both need to be at their best. Traditional lists of ranked documents do not seem to be sufficient for the exploratory search tasks.
11
classifications to be ordered in multiple ways, rather than in a single, pre-determined, taxonomic order.
Facets can be derived manually from analysis of the item, or from pre-existing fields in the item's metadata such as author, descriptor, language, and format. The former enables facet to be derived and sourced via a range of user and content research methods. The latter permits existing items in a catalogue/repository/database to have this extra metadata extracted, mapped and presented as a navigation facet, without extra data input being needed.7
Enhanced feedback - users receive an overview of their search results broken down by category that they can then use for refining their search.
Informed choices - users know in advance how many items are available in each category, so they can search first in categories more likely to bring them a successful result. Categories with zero items in a given field are usually not shown; hence the user is very unlikely to encounter a 'no results' outcome.
Users can select their own searching path or hierarchy based on the information presented to them and can add or remove filters or facets at will.
Contextualized clustering Contextualized clustering is a method to group similar search results together and cluster documents or pages (web) according to the terms found in the documents. The core advantage of providing contextualized results is easiness for the user to find the relevant content which he is looking for. This provides a more meaningful search experience to the user. A newly developed system, HOBSearch, makes use of suffix tree clustering to overcome many of the weaknesses of traditional clustering approaches. Using result snippets rather than full documents, HOBSearch both speeds up clustering substantially and manages to tailor the clustering to the topics indicated in users query. An inherent problem with clustering, though, is the choice of cluster labels.
12
Concepts in Action
Text Categorization Most of the examples I will quote here are from the search engine which I have created: Molu The Search Spider (http://www.themolu.com). Text categorization can be applied in various ways in a search engine. Some of them are shown below:
The above screenshot is from an upcoming version of the website. The concept allows drilling down the search of the user to a single website and will allow him to narrow his scope to the websites he trusts.
13
Using data mining techniques called as faceted categorization, the data item (news in this sense) is assigned two properties a date and relevance. Both of them have their own hierarchy according to popularity and other factors. When any one of them is invoked, relevant search results are mined and thrown back to the user. This concept is widely applied to search engines these days. The best example being FlickR.com where the user gets to see the search results by Date/Relevance/Interestingness. FlickR was also the first few companies to apply this concept. Contextualized Clustering The search engines today have become intelligent. Using text mining techniques, they group the results in different categories and present it to the user. The method of clustering is simple to describe but difficult to apply. The density of tags (different words extracted after stop word removal and stemming) in one particular record (how many words of similar origin match in a particular page) decides the group in which a particular page will fall. A very nice implementation of the technique can be found at Yippy (http://search.yippy.com/). The Search engine classifies the results based on groups (city, software, india etc.)/sources (Bing, Google etc.)/sites (.com, .org etc.)/time of update.
14
Some search engines also present the results in tree based form making it much easier for the user to use them:
15
The results are intelligent and better than a normal web search. This is a very good example of how Information Extraction technique has been used in selecting the output. Another question thrown at the search engine was Why is sky blue in colour? As expected, the result was a plain answer rather than a cluster of web search results pointing to web pages containing the keywords Sky blue and colour.
16
Using semantic technologies and mathematical computation power, Wolframalpha was able to calculate and predict a mathematical function correctly:
17
18