Sie sind auf Seite 1von 5

DWDM(CASE STUDY) MINING TEXT DATA PRACTICAL-12 PRACTICAL-12

08IT007 08IT008

AIM:- To study the research papers on the advanced topics on data mining and prepare and present the report on it. TOPIC:- Mining Text Databases Text mining, also known as text data mining or knowledge discovery from textual databases, refers to the process of extracting interesting and non-trivial patterns or knowledge from text documents. Motivation for Text Mining:-

Approximately 90% of the worlds data is held in unstructured formats Information intensive business processes demand that we transcend from simple document retrieval to knowledge discovery. What is Text Mining?

Text mining is a multidisciplinary field, involving information retrieval, text analysis, information extraction, clustering, categorization, visualization, database technology, machine learning, and data mining. A text mining framework.

Text refining converts unstructured text documents into an intermediate form (IF). IF can be document-based or concept-based. Knowledge distillation from a document-based IF deduces patterns or knowledge across documents. A document-based IF can be projected onto a concept-based IF by extracting object information relevant to a domain. Knowledge distillation from a concept-based IF deduces patterns or knowledge across objects or concepts.

7th IT-1

CITC,CHANGA

DWDM(CASE STUDY) MINING TEXT DATA PRACTICAL-12 Text data analysis and information retrieval

08IT007 08IT008

Information retrieval (IR) is a field that has been developed in parallel with database systems for many years. Different from database system that has been focused on query and transaction processing of structured data, information retrieval has been focused on the organization and retrieval of information in a large number of text based documents. A typical information retrieval problem is to locate relevant documents based on user input, such as keywords or example documents, and the typical information retrieval systems include online library catalog systems and online document management systems. Problems with information retrieval

Since information retrieval and database systems are handling different kinds of data, there are some database system problems which are usually not present in information retrieval systems, such as concurrency control, recovery, transaction management and update. There are also some common information retrieval problems which are usually not encountered in traditional database systems, such as unstructured documents Basic measures for text retrieval

There are two basic measures for content-based text retrieval. One is precision, which is the percentage of retrieved documents are in fact correct (i.e., relevant to the query). The other is recall, which is the percentage of documents which should be retrieved (i.e., which are in the database and are relevant to the query) were in fact retrieved. How do such keyword-based and similarity-based information retrieval systems work?

A text retrieval system often associates with a set of documents a stop list, which is a set of words that are deemed irrelevant. For example, a, the, of, for, with, and so on are stop words even they may appear frequently. A group of syntactically minorly different words may share the same word stem. A text retrieval system needs to identify the group of words which are small syntactic variants of each other and collect only their common word stem. For example, a group of words drug, drugged, and drugs, share a common word stem, drug, and one may view them as the different appearances of the same word.

7th IT-1

CITC,CHANGA

DWDM(CASE STUDY) MINING TEXT DATA PRACTICAL-12 Latent Semantic Indexing

08IT007 08IT008

The latent semantic indexing method uses a singular value decomposition (SVD) technique. To reduce the size of the term frequency table and retain the K most significant rows of the frequency table, where K is usually taken to be around a few hundred (e.g., 200) for large document collections. Notice that such a reduction, taken the input of D*T matrix and represent it as a much smaller K*K matrix leads to some information loss. We must ensure that they must miss only the least significant parts of the frequency table. Other Text Retrieval Indexing Techniques

There are also several other popularly adopted text retrieval indexing techniques, including inverted indices and signature files. Inverted indices

An inverted index is an index structure widely used in industry for indexing text documents. It maintains two hash indexed or B+-tree indexed tables: document table and term table. The former (document table) consists of a set of document records, each containing two fields: doc id and posting list, where the posting list is a list of terms (or pointers to terms) that occur in the document, sorted according to some relevance measure. The latter (term table) consists of a set of term records, each containing two fields: term id and posting list, where the posting list specifies a list of document identifiers in which the term appears. With such organization, it is easy to answer queries like find all the documents associated with a set of terms", or find all the terms associated with a set of documents"

7th IT-1

CITC,CHANGA

DWDM(CASE STUDY) MINING TEXT DATA PRACTICAL-12 Signature Files

08IT007 08IT008

A signature file is a file which stores a signature record for each document in the database. Each signature has a fixed size of b bits. A simple encoding scheme goes as follows. Every bit of a document is initialized to 0. A bit is set if the corresponding term appears in the document. A signature S1 matches another signature S2 if each bit set in signature S2 is also set in S1. Keyword-based association analysis

Text data consists of structured, semi-structures or unstructured text, including Term Extraction Text Mining at the Word Level ? The association generation process detected either compounds, i.e. Domain-dependent terms such as [wall, street] or [treasury, secretary, james, baker]? Or uninterpretable associations such as [dollars, shares, exchange, total, commission, stake, securities] Conclusion 1. Term level text mining attempts to benefit from the advantages of two extremes. 2. On the one hand there is no need for human effort in tagging document, and we do not loose most of the information present in the document as in the tagged documents approach. 3. On the other hand the number of meaningless results is greatly reduced and the execution time of the mining algorithms is also reduced. Questions-Answers

1. How can we prevent to loose the data? To expect to get useful results, one needs to create a warehouse first before mining the converted database. This warehouse is essentially a relational database that has the essential data from the text data. 2. What is possibility of Multilingual text refining? Whereas data mining is largely language independent, text mining involves a significant language component. It is essential to develop text refining algorithms, that process multilingual text documents and produce language-independent intermediate forms.

7th IT-1

CITC,CHANGA

DWDM(CASE STUDY) MINING TEXT DATA 08IT007 PRACTICAL-12 08IT008 While most text mining tools focus on processing English documents, mining from documents in other languages allows access to previously untapped information and offers a new host of opportunities.

7th IT-1

CITC,CHANGA

Das könnte Ihnen auch gefallen