Sie sind auf Seite 1von 11

Text mining

From Wikipedia, the free encyclopedia


Text mining, also referred to as text data mining, roughly equivalent to text analytics,
refers to the process of deriving high-quality information from text. High-quality
information is typically derived through the devising of patterns and trends through
means such as statistical pattern learning. Text mining usually involves the process of
structuring the input text (usually parsing, along ith the addition of some derived
linguistic features and the removal of others, and su!sequent insertion into a data!ase",
deriving patterns ithin the structured data, and finally evaluation and interpretation of
the output. #High quality# in text mining usually refers to some com!ination of relevance,
novelty, and interestingness. Typical text mining tasks include text categori$ation, text
clustering, concept%entity extraction, production of granular taxonomies, sentiment
analysis, document summari$ation, and entity relation modeling (i.e., learning relations
!eteen named entities".
Text analysis involves information retrieval, lexical analysis to study ord frequency
distri!utions, pattern recognition, tagging%annotation, information extraction, data mining
techniques including link and association analysis, visuali$ation, and predictive analytics.
The overarching goal is, essentially, to turn text into data for analysis, via application of
natural language processing (&'(" and analytical methods.
) typical application is to scan a set of documents ritten in a natural language and either
model the document set for predictive classification purposes or populate a data!ase or
search index ith the information extracted.
Contents
* Text mining and text analytics
+ History
, Text analysis processes
- )pplications
o -.* .ecurity applications
o -.+ /iomedical applications
o -., .oftare applications
o -.- 0nline media applications
o -.1 2arketing applications
o -.3 .entiment analysis
o -.4 )cademic applications
1 .oftare
3 5mplications
4 .ee also
6 &otes
7 8eferences
*9 :xternal links
Text mining and text analytics
The term text analytics descri!es a set of linguistic, statistical, and machine learning
techniques that model and structure the information content of textual sources for
!usiness intelligence, exploratory data analysis, research, or investigation.
;*<
The term is
roughly synonymous ith text mining= indeed, 8onen Feldman modified a +999
description of >text mining>
;+<
in +99- to descri!e >text analytics.>
;,<
The latter term is
no used more frequently in !usiness settings hile >text mining> is used in some of the
earliest application areas, dating to the *769s,
;-<
nota!ly life-sciences research and
government intelligence.
The term text analytics also descri!es that application of text analytics to respond to
!usiness pro!lems, hether independently or in con?unction ith query and analysis of
fielded, numerical data. 5t is a truism that 69 percent of !usiness-relevant information
originates in unstructured form, primarily text.
;1<
These techniques and processes discover
and present knoledge @ facts, !usiness rules, and relationships @ that is otherise locked
in textual form, impenetra!le to automated processing.
History
'a!or-intensive manual text mining approaches first surfaced in the mid-*769s,
;3<
!ut
technological advances have ena!led the field to advance during the past decade. Text
mining is an interdisciplinary field that dras on information retrieval, data mining,
machine learning, statistics, and computational linguistics. )s most information (common
estimates say over 69A"
;1<
is currently stored as text, text mining is !elieved to have a
high commercial potential value. 5ncreasing interest is !eing paid to multilingual data
miningB the a!ility to gain information across languages and cluster similar items from
different linguistic sources according to their meaning.
The challenge of exploiting the large proportion of enterprise information that originates
in >unstructured> form has !een recogni$ed for decades.
;4<
5t is recogni$ed in the earliest
definition of !usiness intelligence (/5", in an 0cto!er *716 5/2 Cournal article !y H.(.
'uhn, ) /usiness 5ntelligence .ystem, hich descri!es a system that illB
>...utili$e data-processing machines for auto-a!stracting and auto-encoding of documents
and for creating interest profiles for each of the #action points# in an organi$ation. /oth
incoming and internally generated documents are automatically a!stracted, characteri$ed
!y a ord pattern, and sent automatically to appropriate action points.>
Det as management information systems developed starting in the *739s, and as /5
emerged in the #69s and #79s as a softare category and field of practice, the emphasis
as on numerical data stored in relational data!ases. This is not surprisingB text in
>unstructured> documents is hard to process. The emergence of text analytics in its
current form stems from a refocusing of research in the late *779s from algorithm
development to application, as descri!ed !y (rof. 2arti ). Hearst in the paper
Entangling Text Fata 2iningB
;6<
For almost a decade the computational linguistics community has vieed large text
collections as a resource to !e tapped in order to produce !etter text analysis algorithms.
5n this paper, 5 have attempted to suggest a ne emphasisB the use of large online text
collections to discover ne facts and trends a!out the orld itself. 5 suggest that to make
progress e do not need fully artificial intelligent text analysis= rather, a mixture of
computationally-driven and user-guided analysis may open the door to exciting ne
results.
Hearst#s *777 statement of need fairly ell descri!es the state of text analytics technology
and practice a decade later.
Text analysis processes
.u!tasks G components of a larger text-analytics effort G typically includeB
5nformation retrieval or identification of a corpus is a preparatory stepB collecting
or identifying a set of textual materials, on the We! or held in a file system,
data!ase, or content management system, for analysis.
)lthough some text analytics systems apply exclusively advanced statistical
methods, many others apply more extensive natural language processing, such as
part of speech tagging, syntactic parsing, and other types of linguistic analysis.
;citation needed<
&amed entity recognition is the use of ga$etteers or statistical techniques to
identify named text featuresB people, organi$ations, place names, stock ticker
sym!ols, certain a!!reviations, and so on. Fisam!iguation G the use of
contextual clues G may !e required to decide here, for instance, >Ford> can
refer to a former E... president, a vehicle manufacturer, a movie star, a river
crossing, or some other entity.
8ecognition of (attern 5dentified :ntitiesB Features such as telephone num!ers, e-
mail addresses, quantities (ith units" can !e discerned via regular expression or
other pattern matches.
Horeference B identification of noun phrases and other terms that refer to the same
o!?ect.
8elationship, fact, and event :xtractionB identification of associations among
entities and other information in text
.entiment analysis involves discerning su!?ective (as opposed to factual" material
and extracting various forms of attitudinal informationB sentiment, opinion, mood,
and emotion. Text analytics techniques are helpful in analy$ing sentiment at the
entity, concept, or topic level and in distinguishing opinion holder and opinion
o!?ect.
;7<
Iuantitative text analysis is a set of techniques stemming from the social sciences
here either a human ?udge or a computer extracts semantic or grammatical
relationships !eteen ords in order to find out the meaning or stylistic patterns
of, usually, a casual personal text for the purpose of psychological profiling etc.
;*9<
Applications
The technology is no !roadly applied for a ide variety of government, research, and
!usiness needs. )pplications can !e sorted into a num!er of categories !y analysis type
or !y !usiness function. Esing this approach to classifying solutions, application
categories includeB
:nterprise /usiness 5ntelligence%Fata 2ining, Hompetitive 5ntelligence
:-Fiscovery , 8ecords 2anagement
&ational .ecurity %5ntelligence
.cientific discovery , especially 'ife .ciences
.entiment )nalysis Tools, 'istening (latforms
&atural 'anguage%.emantic Toolkit or .ervice
(u!lishing
)utomated ad placement
.earch%5nformation )ccess
.ocial media monitoring
Security applications
2any text mining softare packages are marketed for security applications, especially
monitoring and analysis of online plain text sources such as 5nternet nes, !logs, etc. for
national security purposes.
;**<
5t is also involved in the study of text
encryption%decryption.
Biomedical applications
2ain articleB /iomedical text mining
) range of text mining applications in the !iomedical literature has !een descri!ed.
;*+<
0ne online text mining application in the !iomedical literature is (u!Jene that com!ines
!iomedical text mining ith netork visuali$ation as an 5nternet service.
;*,<;*-<
T(K is a
concept-assisted search and navigation tool for !iomedical literature analyses
;*1<
- it runs
on (u!2ed%(2H and can !e configured, on request, to run on local literature repositories
too.
Jo(u!2ed is a knoledge-!ased search engine for !iomedical texts.
Software applications
Text mining methods and softare is also !eing researched and developed !y ma?or
firms, including 5/2 and 2icrosoft, to further automate the mining and analysis
processes, and !y different firms orking in the area of search and indexing in general as
a ay to improve their results. Within pu!lic sector much effort has !een concentrated on
creating softare for tracking and monitoring terrorist activities.
;*3<
Online media applications
Text mining is !eing used !y large media companies, such as the Tri!une Hompany, to
clarify information and to provide readers ith greater search experiences, hich in turn
increases site >stickiness> and revenue. )dditionally, on the !ack end, editors are
!enefiting !y !eing a!le to share, associate and package nes across properties,
significantly increasing opportunities to moneti$e content.
Marketing applications
Text mining is starting to !e used in marketing as ell, more specifically in analytical
customer relationship management. Houssement and Lan den (oel (+996"
;*4<;*6<
apply it to
improve predictive analytics models for customer churn (customer attrition".
;*4<
Sentiment analysis
.entiment analysis may involve analysis of movie revies for estimating ho favora!le a
revie is for a movie.
;*7<
.uch an analysis may need a la!eled data set or la!eling of the
affectivity of ords. 8esources for affectivity of ords and concepts have !een made for
Word&et
;+9<
and Honcept&et,
;+*<
respectively.
Text has !een used to detect emotions in the related area of affective computing.
;++<
Text
!ased approaches to affective computing have !een used on multiple corpora such as
students evaluations, children stories and nes stories.
Academic applications
The issue of text mining is of importance to pu!lishers ho hold large data!ases of
information needing indexing for retrieval. This is especially true in scientific disciplines,
in hich highly specific information is often contained ithin ritten text. Therefore,
initiatives have !een taken such as &ature#s proposal for an 0pen Text 2ining 5nterface
(0T25" and the &ational 5nstitutes of Health#s common Cournal (u!lishing Focument
Type Fefinition (FTF" that ould provide semantic cues to machines to anser specific
queries contained ithin text ithout removing pu!lisher !arriers to pu!lic access.
)cademic institutions have also !ecome involved in the text mining initiativeB
The &ational Hentre for Text 2ining (&aHTe2", is the first pu!licly funded text
mining centre in the orld. &aHTe2 is operated !y the Eniversity of
2anchester
;+,<
in close colla!oration ith the Tsu?ii 'a!,
;+-<
Eniversity of Tokyo.
;+1<
&aHTe2 provides customised tools, research facilities and offers advice to the
academic community. They are funded !y the Coint 5nformation .ystems
Hommittee (C5.H" and to of the EM 8esearch Houncils (:(.8H N //.8H".
With an initial focus on text mining in the !iological and !iomedical sciences,
research has since expanded into the areas of social sciences.
5n the Enited .tates, the .chool of 5nformation at Eniversity of Halifornia,
/erkeley is developing a program called /ioText to assist !iology researchers in
text mining and analysis.
Further, private initiatives also offer tools for academic text miningB
&esanalytics.net provides researchers ith a free scala!le solution for keyord-
!ased text analysis. The initiative#s research apps ere developed to support nes
analytics, !ut are equally useful for regular text analysis applications.
Software
Text mining computer programs are availa!le from many commercial and open source
companies and sources. .ee 'ist of text mining softare.
Implications
Entil recently, e!sites most often used text-!ased searches, hich only found
documents containing specific user-defined ords or phrases. &o, through use of a
semantic e!, text mining can find content !ased on meaning and context (rather than
?ust !y a specific ord".
)dditionally, text mining softare can !e used to !uild large dossiers of information
a!out specific people and events. For example, large datasets !ased on data extracted
from nes reports can !e !uilt to facilitate social netorks analysis or counter-
intelligence. 5n effect, the text mining softare may act in a capacity similar to an
intelligence analyst or research li!rarian, al!eit ith a more limited scope of analysis.
Text mining is also used in some email spam filters as a ay of determining the
characteristics of messages that are likely to !e advertisements or other unanted
material.
See also
)pproximate nonnegative matrix factori$ation , an algorithm used for text mining
/ioHreative text mining evaluation in !iomedical literature
Honcept 2ining
&ame resolution
.top ords
Text classification sometimes is considered a (su!"task of text mining.
We! mining , a task that may involve text mining (e.g. first find appropriate e!
pages !y classifying craled e! pages, then extract the desired information from
the text content of these pages considered relevant".
-shingling
.equence mining B .tring and .equence 2ining
&oisy text analytics
&amed entity recognition
5dentity resolution
&es analytics
Notes
This article uses !are E8's for citations, hich may !e threatened !y link rot.
(lease consider adding full citations so that the article remains verifia!le. .everal
templates and the 8eflinks tool are availa!le to assist in formatting. (8eflinks
documentation" (April 2013)
*. Fefining Text )nalytics
;dead link<
+. MFF-+999 Workshop on Text 2ining
,. Text )nalyticsB Theory and (ractice
;dead link<
-. Ho!!s, Cerry 8.= Walker, Fonald :.= )msler, 8o!ert ). (*76+". >&atural
language access to structured text>. Proceedings of the 9th conference on
Computational linguistics . pp. *+4@,+. doiB*9.,**1%77*6*,.77*6,,.
1. Enstructured Fata and the 69 (ercent 8ule
;dead link<
3. Hontent )nalysis of Ler!atim :xplanations
4. httpB%%.!-eye-netork.com%vie%3,**
;full citation needed<
6. Hearst, 2arti ). (*777". >Entangling text data mining>. Proceedings of
the 3th annual meeting of the Association for Computational !inguistics on
Computational !inguistics. pp. ,@*9. doiB*9.,**1%*9,-346.*9,-347. 5./& *-
11639-397-+.
7. httpB%%.clara!ridge.com%default.aspxO
ta!idP*,4N2odule5FP3,1N)rticle5FP4++
;dead link<
*9. 2ehl, 2atthias 8. (+993". >Iuantitative Text )nalysis.>. "and#ook of
multimethod measurement in ps$cholog$. p. *-*. doiB*9.*9,4%**,6,-9**.
5./& *-17*-4-,*6-4.
**. Qanasi, )lessandro (+997". >Lirtual Weapons for 8eal WarsB Text 2ining
for &ational .ecurity>. Proceedings of the %nternational &orkshop on
Computational %ntelligence in 'ecurit$ for %nformation '$stems C%'%'(0).
)dvances in .oft Homputing !". p. 1,. doiB*9.*994%746-,-1-9-66*6*-9R4.
5./& 746-,-1-9-66*69-,.
*+. Hohen, M. /retonnel= Hunter, 'arence (+996". >Jetting .tarted in Text
2ining>. P!o' Computational *iolog$ # (*"B e+9.
doiB*9.*,4*%?ournal.pc!i.99-99+9. (2H ++*4147. (25F *6++17-3.
*,. Censsen, Tor-Mristian= 'Sgreid, )strid= Momoroski, Can= Hovig, :ivind
(+99*". >) literature netork of human genes for high-throughput analysis of
gene expression>. +ature ,enetics $% (*"B +*@6. doiB*9.*9,6%ng919*-+*.
(25F **,+3+49.
*-. 2asys, Faniel 8. (+99*". >'inking microarray data to the literature>.
+ature ,enetics $% (*"B 7@*9. doiB*9.*9,6%ng919*-7. (25F **,+3+3-.
*1. Coseph, Thomas= .aipradeep, Langala J= Lenkat 8aghavan, Janesh
.ekar= .rinivasan, 8a?gopal= 8ao, )ditya= Motte, .u?atha= .ivadasan, &aveen
(+9*+". >T(KB /iomedical literature search made easy>. *ioinformation % (*+"B
146@69. doiB*9.39+3%74,+93,9996146. (2H ,,7646+. (25F ++6+74,-.
*3. Texor
*4. Houssement, Mristof= Lan Fen (oel, Firk (+996". >5ntegrating the voice of
customers through call center emails into a decision support system for churn
prediction>. %nformation - .anagement #! (,"B *3-@4-.
doiB*9.*9*3%?.im.+996.9*.991.
*6. Houssement, Mristof= Lan Fen (oel, Firk (+996". >5mproving customer
complaint management !y automatic email classification using linguistic style
features as predictors>. /ecision 'upport '$stems ## (-"B 649@6+.
doiB*9.*9*3%?.dss.+994.*9.9*9.
*7. (ang, /o= 'ee, 'illian= Laithyanathan, .hivakumar (+99+". >Thum!s
upO>. Proceedings of the AC!002 conference on 1mpirical methods in natural
language processing &. pp. 47@63. doiB*9.,**1%***637,.***649-.
+9. )lessandro Lalitutti, Harlo .trapparava, 0liviero .tock (+991".
>Feveloping )ffective 'exical 8esources>. Ps$cholog$ 2ournal $ (*"B 3*@6,.
+*. :rik Ham!ria= 8o!ert .peer, Hatherine Havasi and )mir Hussain (+9*9".
>.entic&etB a (u!licly )vaila!le .emantic 8esource for 0pinion 2ining>.
Proceedings of AAA% C'3. pp. *-@*6.
++. Halvo, 8afael )= d#2ello, .idney (+9*9". >)ffect FetectionB )n
5nterdisciplinary 8evie of 2odels, 2ethods, and Their )pplications>. %111
4ransactions on Affecti5e Computing (*"B *6@,4. doiB*9.**97%T-)FFH.+9*9.*.
+,. The Eniversity of 2anchester
+-. Tsu?ii 'a!oratory
+1. The Eniversity of Tokyo
'eferences
)naniadou, .. and 2c&aught, C. (:ditors" (+993". 4ext .ining for *iolog$ and
*iomedicine. )rtech House /ooks. 5./& 746-*-1691,-76--1
/ilisoly, 8. (+996". Practical 4ext .ining 6ith Perl. &e DorkB Cohn Wiley N
.ons. 5./& 746-9--49-*43-,-3
Feldman, 8., and .anger, C. (+993". 4he 4ext .ining "and#ook. &e DorkB
Ham!ridge Eniversity (ress. 5./& 746-9-1+*-6,314-7
5ndurkhya, &., and Famerau, F. (+9*9". "and#ook 7f +atural !anguage
Processing, +nd :dition. /oca 8aton, F'B H8H (ress. 5./& 746-*--+99-617+-*
Mao, )., and (oteet, .. (:ditors". +atural !anguage Processing and 4ext .ining.
.pringer. 5./& *-6-3+6-*41-K
Monchady, 2. 4ext .ining Application Programming (Programming 'eries).
Hharles 8iver 2edia. 5./& *-16-19--39-7
2anning, H., and .chut$e, H. (*777". 8oundations of 'tatistical +atural
!anguage Processing. Ham!ridge, 2)B 25T (ress. 5./& 746-9-+3+-*,,39-7
2iner, J., :lder, C., Hill. T, &is!et, 8., Felen, F. and Fast, ). (+9*+". Practical
4ext .ining and 'tatistical Anal$sis for +on0structured 4ext /ata Applications.
:lsevier )cademic (ress. 5./& 746-9-*+-,63747-*
2cMnight, W. (+991". >/uilding !usiness intelligenceB Text data mining in
!usiness intelligence>. /. 9e5ie6, +*-++.
.rivastava, )., and .ahami. 2. (+997". 4ext .ining: Classification; Clustering;
and Applications. /oca 8aton, F'B H8H (ress. 5./& 746-*--+99-17-9-,
Qanasi, ). (:ditor" (+994". 4ext .ining and its Applications to %ntelligence; C9.
and 3no6ledge .anagement. W5T (ress. 5./& 746-*-6-13--*,*-,
(xternal links
2arti HearstB What 5s Text 2iningO (0cto!er, +99,"
)utomatic Hontent :xtraction, 'inguistic Fata Honsortium
)utomatic Hontent :xtraction, &5.T
HategoriesB
)rtificial intelligence applications
Fata mining
Homputational linguistics
Fata analysis
&atural language processing
.tatistical natural language processing
Na)igation menu
Hreate account
'og in
)rticle
Talk
8ead
:dit
Lie history
2ain page
Hontents
Featured content
Hurrent events
8andom article
Fonate to Wikipedia
Wikimedia .hop
Interaction
Help
)!out Wikipedia
Hommunity portal
8ecent changes
Hontact page
Tools
What links here
8elated changes
Epload file
.pecial pages
(ermanent link
(age information
Fata item
Hite this page
*rint+export
Hreate a !ook
Fonload as (FF
(rinta!le version
,anguages
TUVWXYZ
[e\tina
Feutsch
:spa]ol
Fran^ais
/ahasa 5ndonesia
2agyar

(olski
(ortugu_s
`abbcde
.venska

Tifng Ligt

:dit links
This page as last modified on +7 )pril +9*- at *+B19.
Text is availa!le under the Hreative Hommons )ttri!ution-.hare)like 'icense=
additional terms may apply. /y using this site, you agree to the Terms of Ese and
(rivacy (olicy. Wikipediah is a registered trademark of the Wikimedia
Foundation, 5nc., a non-profit organi$ation.
(rivacy policy
)!out Wikipedia
Fisclaimers
Hontact Wikipedia
Fevelopers
2o!ile vie

Das könnte Ihnen auch gefallen