Beruflich Dokumente
Kultur Dokumente
WHITE PAPER
Table of Contents
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 How Does Text Analytics Work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Text Analytics Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Deciding on Text Analytics Software The Process . . . . . . . . . . . . . . 3 Self-Knowledge Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Filter Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Proof-of-Concept Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 The POC Process for Text Analytics Technology: Key Issues . . . . . . 6 Stage One Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Stage Two Development. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Key considerations for categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Developing extraction catalogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Stage Four Report. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 A Simple Solution? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Conclusion: Getting the Most from Your Investment . . . . . . . . . . . . . 13
Tom Reamy is the Chief Knowledge Architect and founder of KAPS Group, a group of knowledge architecture, taxonomy and text analytics consultants. Reamy has 20 years of experience in information architecture, enterprise search, intranet management and consulting, education software and text analytics consulting. His academic background includes a masters in the history of ideas, research in artificial intelligence and cognitive science, and a strong focus in philosophy, particularly epistemology. He has published articles in various journals and is a frequent speaker at knowledge management conferences. When not writing or developing knowledge management projects, Reamy can usually be found at the bottom of the ocean in Carmel, CA, taking photos of strange creatures.
Introduction
Although it is a fast-growing area, text analytics is still new to most organizations. Many are looking for help to understand exactly what text analytics can do for their business and how to choose a platform and vendor that work best for them. This paper describes what text analytics is, how it works and why it is so valuable across many different organizational areas. It is also intended to give guidelines when evaluating various text analytics technologies and vendors.
Text analytics includes four basic text-handling capabilities: text extraction, categorization, sentiment analysis and summarization.
Selfknowledge
Filter
Proof of Concept
Preparation
Development
Results
Vendor Selection
Report
Figure 1: Before choosing a vendor, most companies go through three major phases and four sub-phases in the process of evaluating text analytics solutions.
Self-Knowledge Phase
Too many decision makers decide suddenly that they need to jump on the social media or text analytics bandwagon. Then, they try to pick a vendor without really understanding what business value they are looking for. A good evaluation process starts with doing a deep dive into what text analytics might mean to your organization. This deep dive includes: Understanding the strategic and business context for text analytics. For example, how does information flow within specific business processes? Is it mostly when you write large Word documents where research is done as a formal activity, or is it when research is done on the fly as documents are being written? For every company the answers will be different, but the main task for everyone is to map out the relative strategic importance of each type of information or business process flow. Deciding what your information problems are. You must decide what, how severe and how critical each information problem is to your organization. For example, do your problems mostly relate to the difficulty of finding information within your company, or do they relate to the inability to understand what your customers are saying about your products, or to the need to find better patent information? Asking strategic questions. You need to ask why you need text analytics, what value you get from the taxonomy or text analytics, and how you are going to use it. This will involve getting an idea of how much money and time you are currently losing to your information problems and understanding how a text analytics solution will help. This can be done abstractly (by applying the results of analyst research); by doing actual studies or surveys to determine how much productivity is lost now; or by calculating how much profit you think a new text analytics application will generate.
Determining what content and content resources you have. This will involve answering questions such as: What is the mix of unstructured content and database content? Does most of the unstructured content live in a content management system or is it distributed on file shares? Is the content mostly just business content, or do you have large collections of topical content such as biological research results? Do you have existing taxonomies or glossaries, or even just good overview books with good chapter structure? Assessing your technology environment and how text analytics will integrate with it. This will involve answering questions such as: Is SharePoint a major part of your technology environment? Do you have well-integrated technology, or does each department or division have its own technology? Do multiple programs have to share information? Do you have multiple search engines within the organization, and how integrated are they? Answering these questions can be done during a formal two- to four-week process, or as an informal set of research and discussion activities.1 This new self-knowledge needs to be documented, describing the extent of the potential value and effect text analytics could have on your organization.
The only way to really understand a text analytics solution is by doing a proof of concept that tests with your content, your scenarios and your people.
Filter Phase
The filter phase is the one that most resembles traditional software evaluation. It consists of such activities as: Market research into the company reputation, history and projected future. Technology research into the underlying technology behind the software, so that you can decide how it might integrate with your existing environment. Feature scorecard with a focus on minimum features, must-have features and an understanding of how those features are important to your organization. These features can include general software features such as price, usability and editing, but can also include comparisons of how well the basic text analytics features (text extraction and categorization) are implemented. These traditional software evaluation activities can produce a scorecard, but this scorecard should be thought of as a filter to eliminate offerings that dont fit with your needs not as a final scorecard that you use to select your software. This phase should reduce the number of viable alternatives to a small list. Then, you can invite those vendors to do extended demos of one to three hours each. Why not simply base the decision on features? First, because software features change. But more importantly, because your content is unique. So the real issue is to find features that are useful in understanding your materials.
1 Such processes may be called a Readiness Assessment, performed by vendors or third parties like the KAPS Group.
Overall, the filter phase should reduce the number of candidates to between two and four. These are the candidates you will consider in the next phase. In some cases, you might even be able to reduce your candidates to one clear leader. But even if this happens, it still makes sense to do the last phase described below. In some cases, a company could start with a preferred vendor because of an ongoing relationship, or on the basis of a trusted recommendation or some other reason. In these cases, it still makes sense to do the next phase, but with a different focus to make sure that the text analytics offering works in your environment.
Proof-of-Concept Phase
The proof-of-concept (POC) phase is the most important of the entire text analytics evaluation. That is because text analytics is all about language and semantics and how people think and express their thoughts. The only way to really understand that is to test with your content, your scenarios and your people. A basic approach to a POC is to set it up as a contest, of sorts, between the top two or three vendors. POCs are needed because the complexity of language demands that you look beyond simple out-of-the-box (OOB) capabilities. The key questions are not how well a vendor can set up a demo with carefully selected content and scenarios, but how well those capabilities can be refined through two or more development-refine-test cycles. This is what will really tell you if the software can solve the information problems you uncovered in the self-knowledge phase. A POC will also answer another critical question: How much effort will it take to get to acceptable levels of accuracy? For example, some vendors have expended a lot of effort to get better OOB results with built-in semantic networks, large multiple dictionaries and the like. While those resources make the product look good in an initial comparison, the real question is how much effort will be required to achieve the 90 percent accuracy rate that you could have set as your goal? For example, lets say a product can determine OOB that specific content is about telecommunications. That doesnt really tell you much if you are a telecommunications company and almost all of your content is about telecommunications. And it can often take more time and effort to go from telecommunications to specific concepts (like bill plans) than it would to go from scratch to those levels using some other product. Another question that a POC can answer is how well you can establish a working relationship with the vendor. And from the vendors perspective, the POC can uncover any special issues that you need to have addressed so the vendor can work out solutions while you are still in a relatively forgiving research frame of mind.
One of the most valuable aspects of a POC is that it gives you a head start in development with the support of the vendors. This is true even in a case where the initial selection was reduced to one. The POC creates a foundation for your initial project as well as for any future projects. This foundation consists of both the actual development of taxonomies and rules for categorization, sentiment and extraction. But just as important, it provides on-the-job training for your internal resources (taxonomists, text analytics developers and others) under the guidance of experts. Training by doing, in this case, is by far the best and cheapest way to train your internal resources so they can take over after the initial POC. A second benefit of using a POC for initial development is that it allows you to build the right kind of foundation one that is designed from the beginning to be a platform technology that can support multiple applications. This keeps you from getting caught in the trap of thinking about text analytics just in terms of your first project a sure way to not get the maximum benefit from text analytics software. In addition to getting the maximum direct value from your investment, this approach can also enable you to integrate text analytics with other advanced analytic technologies like text mining, data mining and predictive analytics.
2 If you dont have experienced personnel, most vendors will have a range of consultants and partners who can help you bridge this interim skills gap.
Another key consideration is the selection and recruitment of your internal resources who will participate in the POC. These include subject-matter experts (SMEs) who will select and categorize appropriate content for each individual category, and act as expert evaluators of the success of the POC categorization and other scenarios. Others who might need to be included are technical people who can support the technical aspect of the POC and business users who can generate use case scenarios and also evaluate the text analytics success. Another important task for the preparation phase is to identify or develop a taxonomy for use with the categorization portion of the POC. Categorization requires a taxonomy as its organizing structure. It need not be a big taxonomy, and it can often just be a list of important concepts. But if you have a large taxonomy (like biopharmaceutical companies or government and military organizations often have), you should select a small subset of the overall taxonomy to focus on getting good results not complete coverage. Once you have defined the use case scenarios for the evaluation, the next step is to map those to specific text analytics capabilities and then develop tests for each one. This will vary from organization to organization, but a general suggestion is to develop a set of small extraction catalogs that includes both named entities and rule-based functionality, and test them on your selected content. The other primary test case(s) will be categorization and/or sentiment. During the preparation phase, you will need to determine what accuracy level you will aim for.
This entire process can be repeated. But if time or resources are short, or if you get good results against the new content, then this is as far as you need go.
Disambiguation is something that can and should be tested. Disambiguation is the ability to distinguish between words that look the same but mean something different, or between two words that are different but mean the same thing. The latter case is usually relatively easy to handle through development of extended synonyms. But the first case often calls for much more sophisticated rules that take context into consideration. For example, Ford can refer to a person, a car or a company (or in some contexts, a fictional person). To distinguish which is being referred to in a particular text, you must be able to incorporate multiple levels of context from any type of work (fiction, newspaper, economic analysis) to types of words in the document, the paragraph or the sentence. Because you need this level of disambiguation even for sentiment applications, it is important to look at the categorization functionality of each offering. It is the underlying categorization capability that will typically be used for disambiguation.
3 For an in-depth description of the develop-test-refine cycle, see: Enterprise Content Categorization How to Successfully Choose, Develop and Implement a Semantic Strategy. Available at: sas.com/reg/ wp/corp/25624.
Also, keep in mind that the right balance between recall and precision is dependent on the particular application. For example, in a discovery application in which humans will be reviewing the results, recall is the most important measure. But for an automated application that is exposed to users, precision is often the most important, because too many false positives will cause users to lose faith in the application. Recall and precision are normally applied to categorization, but they can also be applied to extraction, with the focus on specific entities instead of documents.
10
The third key consideration knowing that the develop-test-refine cycle is not a linear process is extremely important for an overall evaluation of the project. For example, you may be looking at only 30 percent accuracy after one round, which seems so poor that the entire idea is questionable. Or it may be that after one round, one vendor is way ahead with mostly 80 percent versus 50 percent accuracy. In the first case, project owners may be thinking that it took two weeks to get 30 percent accuracy, so they assume it will take another four weeks to get up to 80 or 90 percent, when in fact a particular category can go from 30 percent accuracy to 90 percent in one hour with a simple addition or deletion to the rule. In the second case, the relative scores could easily be an artifact of experience with the software or inexperience with the particular subject matter, which could be easily reversed with a second round of effort. This scenario also highlights the importance of doing at least two rounds of development and testing.
To get accurate results, its important to do at least two rounds of development and testing.
11
Initial evaluation. This part of the report should describe the research and thinking that went into the initial evaluation of the entire vendor space. It should then review the outcomes and describe the initial high-level conclusions. An interim version would also contain any unresolved discussion points from that phase. This section would end with a description and justification for the recommendations from that phase. Proof of concept. This section typically describes the methodology employed during the POC, describes and interprets the results, and presents the final conclusions. The interim version would also lay out the remaining discussion points, while the final version would contain the results of those discussions. Final recommendations. This section could be as simple as listing the final vendor selection and the justifications for that selection. It could also contain a set of recommendations about how to proceed to implement the software in one or more applications, with an initial approach, resourcing recommendations and prioritization of potential applications. The level of detail will vary, depending on how much effort went into the self-knowledge phase. These reports can be relatively informal or can follow any formal requirements that are in place. Reports should provide both the history of and justification for any final conclusions and decisions.
When you take a platform approach to text analytics right from the start, your solution will continue to deliver value over time, across your organization.
A Simple Solution?
This might seem like a very involved and complex process, and the question often comes up: Isnt there an easier way? Or isnt there one product that is better than all the others? First of all, no there is not an easier way. You could try setting up some sort of number generator with the randomly placed names of all the text analytics vendors on your wheel of future text analytics fortunes spreadsheet and spin the wheel. Or you could ask a friend but how many people have that kind of friend who has done it before and happens to have exactly the same content, scenarios and use cases as your organization? Second of all, there is not one product that is better in all ways for all customer environments. It would be nice if there were, but the reality is that different environments sometimes call for different solutions. For example, an organization may be primarily interested in developing products for resale in the voice-of-the-customer space. In that case, their best fit might be a vendor that has spent the last five years developing built-in customer intelligence reporting capabilities that are available out of the box.
12
Weve known for years that the way to maximize value from unstructured content is to add more structure to it. With sophisticated text analytics software, we can add structure faster, more effectively and with better quality.
13
About SAS
SAS is the leader in business analytics software and services, and the largest independent vendor in the business intelligence market. Through innovative solutions, SAS helps customers at more than 55,000 sites improve performance and deliver value by making better decisions faster. Since 1976, SAS has been giving customers around the world THE POWER TO KNOW . For more information on SAS Business Analytics software and services, visit sas.com.