Evaluating Text Analytics 9.2012

Finding the Right Fit: How to Evaluate Text Analytics Software
WHITE PAPER
SAS White Paper
Table of Contents
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 How Does Text Analytics Work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Text Analytics Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Deciding on Text Analytics Software The Process . . . . . . . . . . . . . . 3 Self-Knowledge Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Filter Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Proof-of-Concept Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 The POC Process for Text Analytics Technology: Key Issues . . . . . . 6 Stage One Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Stage Two Development. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Key considerations for categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Developing extraction catalogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Stage Three Results: Balancing Recall and Precision . . . . . . . . . . . . 9

Key considerations for measuring results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Stage Four Report. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 A Simple Solution? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Conclusion: Getting the Most from Your Investment . . . . . . . . . . . . . 13
Tom Reamy is the Chief Knowledge Architect and founder of KAPS Group, a group of knowledge architecture, taxonomy and text analytics consultants. Reamy has 20 years of experience in information architecture, enterprise search, intranet management and consulting, education software and text analytics consulting. His academic background includes a masters in the history of ideas, research in artificial intelligence and cognitive science, and a strong focus in philosophy, particularly epistemology. He has published articles in various journals and is a frequent speaker at knowledge management conferences. When not writing or developing knowledge management projects, Reamy can usually be found at the bottom of the ocean in Carmel, CA, taking photos of strange creatures.
Introduction
Although it is a fast-growing area, text analytics is still new to most organizations. Many are looking for help to understand exactly what text analytics can do for their business and how to choose a platform and vendor that work best for them. This paper describes what text analytics is, how it works and why it is so valuable across many different organizational areas. It is also intended to give guidelines when evaluating various text analytics technologies and vendors.
Text analytics includes four basic text-handling capabilities: text extraction, categorization, sentiment analysis and summarization.
How Does Text Analytics Work?

The analysis of text is based on four basic text-handling capabilities: text extraction, categorization, sentiment analysis and summarization. Text extraction. This is where the software identifies many types of words. These words can range from all words in a document to only one type of word in a document. For example, words could be limited to nouns, noun phrases or entities (people, places, organizations, etc.) or to facts (sets of subject-object pairs connected by some type of relationship). Extraction can be done with large, elaborate lists of named entities or with automated, rule-based extraction or any combination of the two. Categorization. The heart of text analytics, categorization can be done in a variety of ways with different levels of precision and different types of effort by different resources. Despite what software vendors love to claim, categorization is not yet done automatically. Categorization methods range from statistical to a variety of rule-based techniques that utilize sophisticated operators to understand the context of words. Sentiment analysis. Social media, voice of the customer and sentiment analysis have been initiators to the growth of text analytics. And while many people dont consider sentiment analysis to be part of text analytics, it is important to include it as part of the description, because it relies on many of the same techniques and capabilities as the rest of text analytics, particularly categorization. Summarization. This is a rarely used component of text analytics that gives the ability to generate a rules-based summarization of large documents. It is mostly used for replacing the rarely useful snippets that search engines provide. Summarization is typically done using simple rules related to size and placement in the document, and is tied to categorization when the summary is based on a search term.
How Can Your Organization Benefit from Text Analytics?

Enterprise search. Text analytics increases the relevance of search efforts by adding concepts and meaning. Content management. Text analytics enables a hybrid publishing model to semiautomate categorization, improving overall quality. Search-based applications. Embedding text analytics-based search in enterprise applications helps to streamline, inform and optimize business decisions.
SAS White Paper
Text Analytics Applications

Text analytics can be used as a platform for a variety of applications. These include: enabling smarter search and integrating with content management systems to add metadata, making retrieval more relevant; converting unstructured text to data for predictive analytics; doing voice-of-the-customer applications to open up new avenues of input into what customers are really saying; and a variety of event detection applications, such as fraud detection or e-discovery. The complexity of how text analytics can be used is compounded by an almost bewildering variety of offerings. Text analytics software can include any number of features. It can, for example, include taxonomy management software. It can include text analytics platform software that offers everything from just text extraction to simple categorization to all text analytics capabilities. Or, it can include all of the above capabilities integrated with sophisticated text and data mining platforms. Text analytics can be incorporated into a search application, a content management application, or a business intelligence or customer intelligence or competitor intelligence application. The vendors range from small start-up companies to ERP vendors (like SAP), hardware companies (like IBM), to SAS, which specializes in analytics.
Text analytics helps

Governments. Governments benefit by using text analytics to improve the effectiveness and efficiency of citizen services. Text analytics gives agencies a more comprehensive approach to assessing communications and events from social media, and improves monitoring of constituent inquiries and the overall public pulse. It also improves early warning detection, enhances public safety, increases transparency and promotes better-informed policy decisions.
Selfknowledge
Filter
Proof of Concept
Preparation
Development
Results
Vendor Selection
Report
Figure 1: Before choosing a vendor, most companies go through three major phases and four sub-phases in the process of evaluating text analytics solutions.
Deciding on Text Analytics Software The Process

With all this complexity, it is important to approach with great care all decisions related to selecting text analytics solutions. What follows is a description of such a process that has been developed over a number of years across multiple projects. The basic method is a three-step process: Self-knowledge. How does text analytics fit with the information and business goals and strategy of your organization? Filter. Traditional software evaluation methods that involve investigating feature sets, technology issues, usability and other features. Proof of concept, or POC. Because text analytics deals with language and semantics, this is the real heart of the evaluation. Lets take a closer look at what is needed to complete each phase.

Insurers. Insurers benefit by using text analytics to analyze claims descriptions and obtain deep insight into each claim. Text analytics helps automate triage and subrogation processing. It also focuses reviewers efforts on prioritized claims by enhancing predictive analytical models and it helps spot fraudulent activity in claims.
Self-Knowledge Phase
Too many decision makers decide suddenly that they need to jump on the social media or text analytics bandwagon. Then, they try to pick a vendor without really understanding what business value they are looking for. A good evaluation process starts with doing a deep dive into what text analytics might mean to your organization. This deep dive includes: Understanding the strategic and business context for text analytics. For example, how does information flow within specific business processes? Is it mostly when you write large Word documents where research is done as a formal activity, or is it when research is done on the fly as documents are being written? For every company the answers will be different, but the main task for everyone is to map out the relative strategic importance of each type of information or business process flow. Deciding what your information problems are. You must decide what, how severe and how critical each information problem is to your organization. For example, do your problems mostly relate to the difficulty of finding information within your company, or do they relate to the inability to understand what your customers are saying about your products, or to the need to find better patent information? Asking strategic questions. You need to ask why you need text analytics, what value you get from the taxonomy or text analytics, and how you are going to use it. This will involve getting an idea of how much money and time you are currently losing to your information problems and understanding how a text analytics solution will help. This can be done abstractly (by applying the results of analyst research); by doing actual studies or surveys to determine how much productivity is lost now; or by calculating how much profit you think a new text analytics application will generate.
SAS White Paper
Determining what content and content resources you have. This will involve answering questions such as: What is the mix of unstructured content and database content? Does most of the unstructured content live in a content management system or is it distributed on file shares? Is the content mostly just business content, or do you have large collections of topical content such as biological research results? Do you have existing taxonomies or glossaries, or even just good overview books with good chapter structure? Assessing your technology environment and how text analytics will integrate with it. This will involve answering questions such as: Is SharePoint a major part of your technology environment? Do you have well-integrated technology, or does each department or division have its own technology? Do multiple programs have to share information? Do you have multiple search engines within the organization, and how integrated are they? Answering these questions can be done during a formal two- to four-week process, or as an informal set of research and discussion activities.1 This new self-knowledge needs to be documented, describing the extent of the potential value and effect text analytics could have on your organization.
The only way to really understand a text analytics solution is by doing a proof of concept that tests with your content, your scenarios and your people.
Filter Phase
The filter phase is the one that most resembles traditional software evaluation. It consists of such activities as: Market research into the company reputation, history and projected future. Technology research into the underlying technology behind the software, so that you can decide how it might integrate with your existing environment. Feature scorecard with a focus on minimum features, must-have features and an understanding of how those features are important to your organization. These features can include general software features such as price, usability and editing, but can also include comparisons of how well the basic text analytics features (text extraction and categorization) are implemented. These traditional software evaluation activities can produce a scorecard, but this scorecard should be thought of as a filter to eliminate offerings that dont fit with your needs not as a final scorecard that you use to select your software. This phase should reduce the number of viable alternatives to a small list. Then, you can invite those vendors to do extended demos of one to three hours each. Why not simply base the decision on features? First, because software features change. But more importantly, because your content is unique. So the real issue is to find features that are useful in understanding your materials.

Financial organizations. Financial departments and organizations use text analytics to effectively identify fraudulent activity. Text analytics provides a way to dig into the details contained in applications, notes, descriptions and other unstructured text sources helping prioritize cases for examiners to investigate, and creating indicators for detection alerts.
1 Such processes may be called a Readiness Assessment, performed by vendors or third parties like the KAPS Group.
Overall, the filter phase should reduce the number of candidates to between two and four. These are the candidates you will consider in the next phase. In some cases, you might even be able to reduce your candidates to one clear leader. But even if this happens, it still makes sense to do the last phase described below. In some cases, a company could start with a preferred vendor because of an ongoing relationship, or on the basis of a trusted recommendation or some other reason. In these cases, it still makes sense to do the next phase, but with a different focus to make sure that the text analytics offering works in your environment.
Proof-of-Concept Phase
The proof-of-concept (POC) phase is the most important of the entire text analytics evaluation. That is because text analytics is all about language and semantics and how people think and express their thoughts. The only way to really understand that is to test with your content, your scenarios and your people. A basic approach to a POC is to set it up as a contest, of sorts, between the top two or three vendors. POCs are needed because the complexity of language demands that you look beyond simple out-of-the-box (OOB) capabilities. The key questions are not how well a vendor can set up a demo with carefully selected content and scenarios, but how well those capabilities can be refined through two or more development-refine-test cycles. This is what will really tell you if the software can solve the information problems you uncovered in the self-knowledge phase. A POC will also answer another critical question: How much effort will it take to get to acceptable levels of accuracy? For example, some vendors have expended a lot of effort to get better OOB results with built-in semantic networks, large multiple dictionaries and the like. While those resources make the product look good in an initial comparison, the real question is how much effort will be required to achieve the 90 percent accuracy rate that you could have set as your goal? For example, lets say a product can determine OOB that specific content is about telecommunications. That doesnt really tell you much if you are a telecommunications company and almost all of your content is about telecommunications. And it can often take more time and effort to go from telecommunications to specific concepts (like bill plans) than it would to go from scratch to those levels using some other product. Another question that a POC can answer is how well you can establish a working relationship with the vendor. And from the vendors perspective, the POC can uncover any special issues that you need to have addressed so the vendor can work out solutions while you are still in a relatively forgiving research frame of mind.

Health and Life Sciences organizations. Text analytics improves patient safety and care. It promotes a proactive approach to identifying adverse events, often found in doctors notations and in descriptions of symptoms and secondary effects from drug treatments. Text analytics also improves health outcomes from in-depth research assessments.
SAS White Paper
One of the most valuable aspects of a POC is that it gives you a head start in development with the support of the vendors. This is true even in a case where the initial selection was reduced to one. The POC creates a foundation for your initial project as well as for any future projects. This foundation consists of both the actual development of taxonomies and rules for categorization, sentiment and extraction. But just as important, it provides on-the-job training for your internal resources (taxonomists, text analytics developers and others) under the guidance of experts. Training by doing, in this case, is by far the best and cheapest way to train your internal resources so they can take over after the initial POC. A second benefit of using a POC for initial development is that it allows you to build the right kind of foundation one that is designed from the beginning to be a platform technology that can support multiple applications. This keeps you from getting caught in the trap of thinking about text analytics just in terms of your first project a sure way to not get the maximum benefit from text analytics software. In addition to getting the maximum direct value from your investment, this approach can also enable you to integrate text analytics with other advanced analytic technologies like text mining, data mining and predictive analytics.
Tips and Tricks

Testing categorization rules requires careful design. Once the initial test content is categorized, you can virtually automate scores without having to open each file to determine the categorizations correctness. In subsequent tests with uncategorized content, a normal procedure is to open selected documents to let subject matter experts do this evaluation.
The POC Process for Text Analytics Technology: Key Issues

The actual POC can also be broken down into four stages: Preparation (including design), development, results definition and reporting. While each project will be different, there are a number of key issues to consider for any POC for each of those phases.
Stage One Preparation

When designing a POC, you should start by deciding on an appropriate size and length for this phase. While the overall length is somewhat dependent on the size and complexity of content and anticipated uses, a rough guide is to allow four to six weeks of effort, with one or two experienced taxonomists or text analytics developers per candidate software. Ideally, these taxonomists will have experience with the particular software theyre evaluating; but it is even more important that they have experience developing categorization, extraction and/or sentiment rules.2 The four-to-six-week time frame allows the POC to go through at least one and preferably two or three rounds of development and refinement, which is essential for a meaningful POC. Other design considerations involve selecting the amount and variations of content that will accurately reflect the complexity of your organization, and then developing the essential use cases for your POC. This includes getting access to the content, which is not always easy.

Manufacturers. Through text analytics, manufacturers enhance quality and reliability. Text analytics gives a common view of categorized product and parts codes that are used in early-warning detection systems. It also lets manufacturers examine quality and reliability issues based on incoming customer communications, claims and social media monitoring.
2 If you dont have experienced personnel, most vendors will have a range of consultants and partners who can help you bridge this interim skills gap.
Another key consideration is the selection and recruitment of your internal resources who will participate in the POC. These include subject-matter experts (SMEs) who will select and categorize appropriate content for each individual category, and act as expert evaluators of the success of the POC categorization and other scenarios. Others who might need to be included are technical people who can support the technical aspect of the POC and business users who can generate use case scenarios and also evaluate the text analytics success. Another important task for the preparation phase is to identify or develop a taxonomy for use with the categorization portion of the POC. Categorization requires a taxonomy as its organizing structure. It need not be a big taxonomy, and it can often just be a list of important concepts. But if you have a large taxonomy (like biopharmaceutical companies or government and military organizations often have), you should select a small subset of the overall taxonomy to focus on getting good results not complete coverage. Once you have defined the use case scenarios for the evaluation, the next step is to map those to specific text analytics capabilities and then develop tests for each one. This will vary from organization to organization, but a general suggestion is to develop a set of small extraction catalogs that includes both named entities and rule-based functionality, and test them on your selected content. The other primary test case(s) will be categorization and/or sentiment. During the preparation phase, you will need to determine what accuracy level you will aim for.

Retailers. Retailers use text analytics to improve brand image, advertising, customer satisfaction and campaign efforts. Through text analytics, retailers get feedback from consumers social media chatter so they can listen to consumers, understand competitive reactions, follow trends and visibly address problems.
Stage Two Development

After designing the POC project and setting everything up including preparing content and people the next phase (development) typically starts with developing categorization and/or sentiment scenarios. One reason to start with categorization/ sentiment is because this is where the majority of effort will be. The process is roughly the same for both categorization and sentiment, and a simple process can be used for both. Categorization typically starts with selecting example content to build the first round of categorization rules. This example content, often called training sets, can be obtained either by SMEs or in a more automated manner. In some cases, you may be able to develop very simple categorization rules of a few terms or a single term, and then find content that matches it, using an expert to help select additional useful terms out of that small initial set. For example, your SME or your software might do a simple search on the term public health and then explore the result set of that search for additional terms. Once the initial set of rules is developed based on the categories of the initial content set, the next step is to test them against the complete content set and refine them to get both good recall and good precision. (Good is something you define during the preparation phase.) The next stage is to generate a new (usually larger) content set and run your categorization rules against that new content. This will typically result in a significant drop in accuracy, which is followed by another round of refining the rules to produce good results for the new content.
SAS White Paper
This entire process can be repeated. But if time or resources are short, or if you get good results against the new content, then this is as far as you need go.
Tips and Tricks

One way to address scalability in the POC is to take names of worldwide organizations and combine them with sets of generic rules. By starting with small content sets and catalogs and increasing in measurable steps, you can get a good idea of the basic scalability up to limits that are close to your final needs.
Key considerations for categorization

Almost anyone can develop categorization rules. For example, you can ask SMEs to look at documents and pick out words that were suggested by the software, then incorporate them into simple word list categorization rules. However, to get realistic results that you can use to build upon, or as tests to compare the capabilities of two competing packages, you have to go beyond simple word list categorizations. This can be done in two ways. The first is to carefully tune the list of words in ways that SMEs often have little experience with. To do this, you need words that not only exemplify the concept or category you are building the rule for, but words that are unique to those documents. Good software can aid in this process by choosing statistically unique words but human judgment is always needed. The second way to create realistic rules is to develop advanced Boolean rules that utilize operators like AND, OR, NOT and DIST or START. Developing advanced rules or creating word list rules that combine significant and unique words both require experience and learning. This is why one of the goals of a POC should be to train your resources who will be charged with further development and maintenance.
Developing extraction catalogs

The second major activity of a POC is to develop extraction capabilities with catalogs or lists of entities to extract and/or rules for extracting all kinds of noun phrases. When it comes to extraction, there are usually two main considerations: scalability and disambiguation. Scalability is not particularly suited to a POC, but you can get some insight into the scalability of the various offerings with simulations of large content sets and artificial extraction catalogs. For example, you can generate or capture large content sets that will accurately match the number and size of your documents, even though they are not reflective of your specific content.
Disambiguation is something that can and should be tested. Disambiguation is the ability to distinguish between words that look the same but mean something different, or between two words that are different but mean the same thing. The latter case is usually relatively easy to handle through development of extended synonyms. But the first case often calls for much more sophisticated rules that take context into consideration. For example, Ford can refer to a person, a car or a company (or in some contexts, a fictional person). To distinguish which is being referred to in a particular text, you must be able to incorporate multiple levels of context from any type of work (fiction, newspaper, economic analysis) to types of words in the document, the paragraph or the sentence. Because you need this level of disambiguation even for sentiment applications, it is important to look at the categorization functionality of each offering. It is the underlying categorization capability that will typically be used for disambiguation.
Tips and Tricks

Typically, recall and precision are inversely related the better the recall the worse the precision. The trick is to write rules that come up with a good balance between recall and precision, with high scores for both.
Stage Three Results: Balancing Recall and Precision

The results stage would seem to be the most straightforward aspect of the POC, but there are a number of key issues to be aware of during this phase. Initial measurements typically generate numerical scores for overall accuracy, recall and precision. Recall for categorization is the number of documents that are known to be documents that should be tagged with each category. So if you know that there are 100 documents that should be tagged as Health Care > Public Health, and the software correctly identified 80 of them, then it would produce a recall score of 80 percent. Precision is the number of false positives, which are the number of documents the software incorrectly tagged as a particular category. So if 20 out of the top 100 documents tagged Health Care > Public Health dont belong in that category, then the precision score is 80 percent. Typically, recall and precision are inversely related the better the recall the worse the precision. It is easy to write a rule that correctly categorizes all 100 known documents if the rule is so general that it categorizes virtually everything as part of that category. Conversely, it is easy to write a rule that is so specific it only returns 10 of the known documents and therefore no false positives. The trick is to write rules that come up with a good balance between recall and precision, with high scores for both. It is important to realize that recall and precision are somewhat content dependent. For example, in the normal develop-test-refine cycle,3 it is typical to develop rules that give good results for the initial test set of documents but with a score that goes down when applied to a new set. The goal is to produce rules that are general enough to apply to new content almost as well as to old content.

Media and Publishers. Text analytics provides more personalized reader experiences and improves ad revenues in this industry by automatically indexing content and associating it with readers specific topics of interest.
3 For an in-depth description of the develop-test-refine cycle, see: Enterprise Content Categorization How to Successfully Choose, Develop and Implement a Semantic Strategy. Available at: sas.com/reg/ wp/corp/25624.
SAS White Paper
Also, keep in mind that the right balance between recall and precision is dependent on the particular application. For example, in a discovery application in which humans will be reviewing the results, recall is the most important measure. But for an automated application that is exposed to users, precision is often the most important, because too many false positives will cause users to lose faith in the application. Recall and precision are normally applied to categorization, but they can also be applied to extraction, with the focus on specific entities instead of documents.

Energy and Transportation. Companies in these industries use text analytics to improve asset maintenance schedules. Servicing notes are used as inputs to improve asset predictions and to proactively identify potential safety issues from logs and accident reports.
Key considerations for measuring results

There are three key considerations for getting good results from tests. The first is to realize that testing will likely require significant human effort. Subject-matter experts will need to provide a human categorization either during the preparation phase (by categorizing training sets), and/or to evaluate the outcomes. There are tricks to reduce the human effort involved, such as incorporating categories in file names or by obtaining pre-categorized content without having to use internal resources. The difficulty with using pre-categorized content is that it is rarely available in sufficient depth to be useful. Similar to good OOB categorization, it is usually not specific enough to provide documents that are about your industry, such as telecommunications or health care. Such specific categories of content are much harder to find. Using humans for categorization, though, raises a question about accuracy. In general, humans are very good at seeing patterns and coming up with a reasonable categorization; but they are not very consistent. Machines, on the other hand, are completely consistent. Humans can be inconsistent in two ways agreement between people and agreement over time. What this means in terms of getting good results from testing is that you need to normalize results across multiple testers and over time for individual testers. A second key consideration to remember is that the scores are not the only story. First of all, it is often hard to develop tests that reflect each vendors unique capabilities. For example, if one vendor has very strong statistical modeling but weak categorization operators while the opposing vendor has weak statistical components but a complete set of categorization operators, it can be very tricky to design a fair test. One way around this is to develop a set of tests with weights that reflect your criteria and use case scenarios. Second, it is important to factor in the overall level of effort needed to achieve those scores. This is something that often counterbalances price differences because a relatively cheap software package can have a much higher total cost of ownership when labor is factored in. Third, it is important to recognize that scores are only relative measures. If one vendor gets 90 percent accuracy and the other gets 85 percent accuracy with the same level of effort, the difference may not be significant in the real world.
10
The third key consideration knowing that the develop-test-refine cycle is not a linear process is extremely important for an overall evaluation of the project. For example, you may be looking at only 30 percent accuracy after one round, which seems so poor that the entire idea is questionable. Or it may be that after one round, one vendor is way ahead with mostly 80 percent versus 50 percent accuracy. In the first case, project owners may be thinking that it took two weeks to get 30 percent accuracy, so they assume it will take another four weeks to get up to 80 or 90 percent, when in fact a particular category can go from 30 percent accuracy to 90 percent in one hour with a simple addition or deletion to the rule. In the second case, the relative scores could easily be an artifact of experience with the software or inexperience with the particular subject matter, which could be easily reversed with a second round of effort. This scenario also highlights the importance of doing at least two rounds of development and testing.
To get accurate results, its important to do at least two rounds of development and testing.
Stage Four Report

The last phase of the evaluation and the project is to measure the results of the last round of testing and generate a final report. The report should describe the process, present the results with any issues clearly delineated, and propose a final recommendation about which software to purchase. It should also include other details, such as deployment and implementation recommendations. Another component of the final report is often a development road map to guide the development of a text analytics platform and an initial set of applications that the organization plans to deploy. One particularly effective technique is to generate an interim or preliminary report. This is often in the form of a PowerPoint presentation that includes the results, an interpretation of those results, and an emphasis on any unresolved issues and decisions. This interim report is used to get feedback on the results. This feedback typically produces a better final report and also ensures buy-in to the conclusions. Another function of this interim report is to guide and focus discussions about any unresolved issues, interpretations of results, and plans for the future. The format and content of both the interim and final reports are strongly dependent on the specific use case scenarios and other criteria that were developed in the preparation phase. But there are a few general considerations to keep in mind. A typical report will follow the major phases of the project: self-knowledge, preparation and the POC. Some sample sections might be: Review evaluation process and methodology. This section provides the overall context of the report and describes the requirements and use case scenarios that were developed in the self-knowledge phase.

Academic and other educational fields. In this domain, people benefit from the collaboration that text analytics enables. With text analytics, people are rapidly connected with each other and with external networks and relevant materials. Text analytics even identifies the level of expertise contained within documents, sets of documents and groups.
11
SAS White Paper
Initial evaluation. This part of the report should describe the research and thinking that went into the initial evaluation of the entire vendor space. It should then review the outcomes and describe the initial high-level conclusions. An interim version would also contain any unresolved discussion points from that phase. This section would end with a description and justification for the recommendations from that phase. Proof of concept. This section typically describes the methodology employed during the POC, describes and interprets the results, and presents the final conclusions. The interim version would also lay out the remaining discussion points, while the final version would contain the results of those discussions. Final recommendations. This section could be as simple as listing the final vendor selection and the justifications for that selection. It could also contain a set of recommendations about how to proceed to implement the software in one or more applications, with an initial approach, resourcing recommendations and prioritization of potential applications. The level of detail will vary, depending on how much effort went into the self-knowledge phase. These reports can be relatively informal or can follow any formal requirements that are in place. Reports should provide both the history of and justification for any final conclusions and decisions.
When you take a platform approach to text analytics right from the start, your solution will continue to deliver value over time, across your organization.
A Simple Solution?
This might seem like a very involved and complex process, and the question often comes up: Isnt there an easier way? Or isnt there one product that is better than all the others? First of all, no there is not an easier way. You could try setting up some sort of number generator with the randomly placed names of all the text analytics vendors on your wheel of future text analytics fortunes spreadsheet and spin the wheel. Or you could ask a friend but how many people have that kind of friend who has done it before and happens to have exactly the same content, scenarios and use cases as your organization? Second of all, there is not one product that is better in all ways for all customer environments. It would be nice if there were, but the reality is that different environments sometimes call for different solutions. For example, an organization may be primarily interested in developing products for resale in the voice-of-the-customer space. In that case, their best fit might be a vendor that has spent the last five years developing built-in customer intelligence reporting capabilities that are available out of the box.
12
Conclusion: Getting the Most from Your Investment

One generalization applies to the vast majority of companies: Text analytics delivers the greatest value when approached as a platform or infrastructure technology that can support and enable an impressive array of applications both internally and externally. There are both broad strategic reasons for a platform approach and myriad specific, practical reasons. The overall context is the explosion of information, particularly unstructured content. It used to be that in general, 80 percent of significant business information was unstructured; but with the rise of social media and other factors, analysts estimate that it has gone up to 90 percent. Weve known for years that the way to maximize value from this unstructured content is to add more structure to it. But the cost and effort of adding structure with largely manual means have been too high and too unreliable. With the development of more sophisticated text analytics software to semi-automate the process of adding structure, the situation has finally changed. Now we can add structure to information in a more cost-effective way, faster and with better quality. And this need for structure in unstructured content cuts across all boundaries and all applications in the world. This means that the number and variety of applications for text analytics is vast almost beyond belief. And even if you are currently only interested in fixing your search experience or getting better feedback from your customers, once text analytics has been added to your organization, the number of potential applications for them will grow dramatically if you approach text analytics as a platform. You may only be looking at one of these applications right now, but in the future you will almost certainly need and want other applications. If your current choice is limited, you may have to go through this whole process again. Or someone else in your organization may end up purchasing some other software solution that does one thing well, but not everything you need. So another department may buy another solution and this cycle can go on and on. The real solution is to purchase technology that is integrated into a comprehensive platform. In that way, a specific solution can be augmented over time and the platform can be adapted or developed to support all of your application and departmental needs. The decision about which application should be developed first depends on the priority of your organization and what has driven you to incorporate text analytics in the first place. However, if text analytics is approached with a platform model, it doesnt matter which is done first. Why? Because the first application will create a platform that will enable your organization to add other applications at a fraction of the cost and effort it would take if each application was developed independently.
Weve known for years that the way to maximize value from unstructured content is to add more structure to it. With sophisticated text analytics software, we can add structure faster, more effectively and with better quality.
13
About SAS
SAS is the leader in business analytics software and services, and the largest independent vendor in the business intelligence market. Through innovative solutions, SAS helps customers at more than 55,000 sites improve performance and deliver value by making better decisions faster. Since 1976, SAS has been giving customers around the world THE POWER TO KNOW . For more information on SAS Business Analytics software and services, visit sas.com.
SAS Institute Inc. World Headquarters
+1 919 677 8000
To contact your local SAS office, please visit: www.sas.com/offices

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. Copyright 2012, SAS Institute Inc. All rights reserved. 105643_S84400_0512

Evaluating Text Analytics 9.2012

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Evaluating Text Analytics 9.2012

Hochgeladen von

Copyright:

Verfügbare Formate

Finding the Right Fit: How to Evaluate Text Analytics Software

SAS White Paper

Stage Three Results: Balancing Recall and Precision . . . . . . . . . . . . 9

Finding the Right Fit: How to Evaluate Text Analytics Software

How Does Text Analytics Work?

How Can Your Organization Benefit from Text Analytics?

SAS White Paper

Text Analytics Applications

Text analytics helps

Finding the Right Fit: How to Evaluate Text Analytics Software

Deciding on Text Analytics Software The Process

Text analytics helps

SAS White Paper

Text analytics helps

Finding the Right Fit: How to Evaluate Text Analytics Software

Text analytics helps

SAS White Paper

Tips and Tricks

The POC Process for Text Analytics Technology: Key Issues

Stage One Preparation

Text analytics helps

Finding the Right Fit: How to Evaluate Text Analytics Software

Text analytics helps

Stage Two Development

SAS White Paper

Tips and Tricks

Key considerations for categorization

Developing extraction catalogs

Finding the Right Fit: How to Evaluate Text Analytics Software

Tips and Tricks

Stage Three Results: Balancing Recall and Precision

Text analytics helps

SAS White Paper

Text analytics helps

Key considerations for measuring results

Finding the Right Fit: How to Evaluate Text Analytics Software

Stage Four Report

Text analytics helps

SAS White Paper

Finding the Right Fit: How to Evaluate Text Analytics Software

Conclusion: Getting the Most from Your Investment

SAS Institute Inc. World Headquarters

+1 919 677 8000

To contact your local SAS office, please visit: www.sas.com/offices

Das könnte Ihnen auch gefallen