Sie sind auf Seite 1von 27

Text Analytics Using Oracle Text

Roger Ford
Principal Product Manager

Asha Tarachandani
Development Manager
Hands-On Lab - Text Analytics using Oracle Text

Table Of Contents
Hands-On Lab - Text Analytics using Oracle Text ........................................................................... 2
Table Of Contents ............................................................................................................................ 2
Login Information ............................................................................................................................. 4
Introduction ...................................................................................................................................... 4
Prerequisites .................................................................................................................................... 4
How to use this document ................................................................................................................ 4
Part 1: Theme Indexes ..................................................................................................................... 6
Purpose ...................................................................................................................... 6
Time to Complete........................................................................................................ 6
Topics ......................................................................................................................... 6
Prerequisites ............................................................................................................... 6
Starting SQL Developer or SQL*Plus ......................................................................... 6
Creating a table........................................................................................................... 6
Inserting some data in our table .................................................................................. 7
Creating a theme index on our table ........................................................................... 7
Searching for Themes................................................................................................. 9
What have we learned? ............................................................................................ 10
Common Questions and Answers............................................................................. 10
Part 2: Extracting Themes and Gists ............................................................................................. 11
Purpose .................................................................................................................... 11
Time to Complete...................................................................................................... 11
Topics ....................................................................................................................... 11
Prerequisites ............................................................................................................. 11
Creating and populating the sample table ................................................................. 11
Creating an index for theme extraction ..................................................................... 11
Creating a themes table ............................................................................................ 12
Gist Extraction........................................................................................................... 14
What have we learned? ............................................................................................ 15
Part 3: Named Entity Extraction ..................................................................................................... 16
Purpose .................................................................................................................... 16
Time to Complete...................................................................................................... 16
Topics ....................................................................................................................... 16
Method ...................................................................................................................... 16
Preparation ............................................................................................................... 16
Lesson 1: Simple Extraction ..................................................................................... 16
Lesson 2: More Entities ............................................................................................ 17
Lesson 3: Selective Extraction .................................................................................. 18
Lesson 4: Creating a Custom Entity Rule ................................................................. 18
Lesson 5: Adding a new Dictionary ........................................................................... 18
What have we learned? ............................................................................................ 19
Part 4: Classification and Clustering .............................................................................................. 19
Purpose .................................................................................................................... 19
Time to Complete...................................................................................................... 20
Classification ............................................................................................................. 20
What have we learned? ............................................................................................ 23
Clustering .................................................................................................................. 24
What have we learned? ............................................................................................ 26
Conclusion ..................................................................................................................................... 26
Further Reading ............................................................................................................................. 26
ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd
Login Information

Linux system username oracle


Linux system password welcome1
Database username hol
Database password hol

Introduction
Oracle Text is well-known as the text searching engine within Oracle Database 12c. Less well known are its text
analytics capabilities.

What do we mean by text analytics? Also known as "text mining", this is the process of deriving high quality
information from text through linguistic and statistical analysis of the text.

This lab takes you through the basic analytic capabilities of Oracle Text in Oracle Database 12c.

This is a self-paced lab. You can run through the exercises yourself with reference to this document. There are
several highly knowledgeable helpers available, please don't be afraid to ask for help at any point you get stuck or
need further explanation of some feature.

Prerequisites
To participate effectively in this Hands On Lab, you need some familiarity with SQL and Oracle's SQL tools -
either the command-line SQL*Plus or SQL Developer.

How to use this document


This document is intended to guide you through the process of setting up these crawls. In each section, there will
be a description of what you must do, followed by a screenshot or image showing you what needs to be done.
Remember, if the description is not clear please look at the image after the description.

In general, things you need to type are emphasized in "red" (don't type the quotes). Output that you would expect
to receive is in a black mono-spaced font.

When following the examples, you are most likely to remember the lessons if you type the text yourself. However,
experience shows that nearly all problems with hands-on labs are due to people mis-typing stuff.

So I would suggest the following:

1. If there are only one to three lines of text and you are a reasonably good typist, then type the lines yourself
with reference to this document.
2. If you hit errors, retry by cutting-and-pasting from this document.
3. If all else fails, there are scripts available in the "HOL" folder and you can run them directly, or cut-and-
paste from them.

In general this document does not list the "drop table", "drop index" etc commands necessary to repeat an
exercise. Hopefully they should all be obvious, but if not then please ask a lab helper for help. All the scripts do
contain the "drop" statements, so you could also look there for them. One side-effect of that is that if you run the
scripts directly you will usually get errors the first time they are run, since they try to delete non-existent objects.

Those completely unfamiliar with Oracle Text may find there is not enough time in the lab to complete all the
sections in this book - if that is the case don't worry, just complete what you can. The more advanced topics are
intended for users who already have some familiarity with Oracle Text.
Part 1: Theme Indexes
Purpose

This tutorial introduces you to the basics of Oracle Text theme indexes and queries. It can all be run from
SQL*Plus or SQL Developer and requires no additional files.

Time to Complete

Approximately 15 minutes

Topics

This tutorial creates a simple text document, and creates a theme index
This tutorial covers the following topics

 Creating a table with a simple text document


 Creating a Theme Index on that table
 Examining the themes that were found in the document

The scripts for this section are in the directory /home/oracle/hol/ThemeIndexes

Prerequisites

Before starting this tutorial you should be logged into your system. You should be logged on already, but if for any
reason you get logged off, the username is "oracle" and the password is "welcome1".

Starting SQL Developer or SQL*Plus

To start SQL Developer:


1. Double-click on the SQL Developer icon on your desktop. Then double-click on the "Hands On Lab"
connection.

To start SQL*Plus:
1. Double-click on the Terminal icon on your desktop
2. Type "sqlplus hol/hol"

Tech note: sqlplus is aliased to "rlwrap sqlplus". rlwrap allows command recall and editing which is not usually
possible with the Linux command line version of SQL*Plus. If you don't like it, or it causes problems, just type
"unalias sqlplus". rlwrap is open source software, available from: http://utopia.knoware.nl/~hlub/rlwrap/

Creating a table

Create a table in the SCOTT schema called "QUICK". It should have a single column called TEXT which is
varchar2(80). You can do this using "Tables (Right-Click) -> New Table" in SQL Developer or by running the
SQL statement:
CREATE TABLE docs (text VARCHAR2(2000));

in either SQL*Plus or SQL*Developer. Don't forget to add a semi-colon or a "/" on a new line for SQL*Plus.
From here on, we're just going to show the SQL as run in SQL*Plus - you can run the same SQL from SQL
Developer if you like, or make use of the GUI features (eg to load a table) if you want.

Inserting some data in our table

Load the table with some simple data. (note: you don't have to use the exact text here, but it is strongly
recommended since later examples build on the same text).

INSERT INTO docs VALUES ('


The interest rate for deposit accounts has not increased for several years.
This indicates a poor rate of return for investors.
');

Creating a theme index on our table

Next we're going to create an index - which includes themes - on our table. Before we do that, though, let's talk a
little bit more about themes and Oracle Text.
 A theme is "something that a document is about". A document cannot be about "invested" – that doesn't
make sense, grammatically. However, the document could be about "investors" or "investing".
 Themes are derived from the Oracle Text Knowledge Base – a large body of knowledge (technically, an
ontology) about the English language.
 The Knowledge Base is loaded with the database "Examples" pack. We have already installed it for this
Hands On Lab. However, when working with your own database later, if you have not downloaded and
installed the Examples, you will not get theme indexes by default, and if you try to repeat exercises from
this lab on your own machine you may get errors.

OK, so let's create an index:

CREATE INDEX docs_index ON docs(text) INDEXTYPE IS CTXSYS.CONTEXT;

If you're not currently familiar with Oracle Text, you might be asking "what does that do?" Like any other
CREATE INDEX statement, we have specified the table and column on which to create the index, but we have
told it to use a special "indextype" of CTXSYS.CONTEXT.

Indextypes are part of the Extensibility Framework of the Oracle kernel, and CTXSYS.CONTEXT is an extensible
(or DOMAIN ) index used by Oracle Text. There are other indextypes in Oracle Text, but CONTEXT is the most
common type and the only one we shall consider here.

When Oracle Text creates an index, it creates several special tables to hold index data and metadata, the most
interesting of which is the so-called "Dollar-I" table. Its name is derived from the index name, and here it will be
called DR$DOCS_INDEX$I.
We can describe it:

SQL> describe dr$docs_index$i


Name Null? Type
------------------------------- -------- ----------------------------
TOKEN_TEXT NOT NULL VARCHAR2(64)
TOKEN_TYPE NOT NULL NUMBER(10)
TOKEN_FIRST NOT NULL NUMBER(10)
TOKEN_LAST NOT NULL NUMBER(10)
TOKEN_COUNT NOT NULL NUMBER(10)
TOKEN_INFO BLOB

Of these columns, we're most interested in TOKEN_TEXT (the indexed words from your documents) and
TOKEN_TYPE (the type of word indexed). Let's examine them:

SELECT token_type, token_text FROM dr$docs_index$i;

The output looks like this (we've abbreviated it – you'll see more words)

TOKEN_TYPE TOKEN_TEXT
---------- ----------------------------------------------------------------
0 ACCOUNTS
0 DEPOSIT
0 INCREASED
0 INDICATES
0 INTEREST

Note that all the TOKEN_TYPE values are 0. Those are "ordinary indexed words" because we created a basic
index. Next we'll tell it we want themes indexed as well. Drop the index…

DROP INDEX docs_index;

And recreate it, this time creating an index preference telling the lexer (process which selects words from text) to
include themes. The index preference is created using the CTX_DDL package, and then specified using the
PARAMETERS clause of the create index statement. A user needs the CTXAPP role to use the CTX_DDL
package.

exec ctx_ddl.create_preference( 'my_lexer', 'BASIC_LEXER' )


exec ctx_ddl.set_attribute ('my_lexer', 'INDEX_THEMES', 'YES' )

CREATE INDEX docs_index ON docs(text)


INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS ( 'lexer my_lexer' );

Now we can select the tokens again:

SELECT token_type, token_text FROM dr$docs_index$i;


And this time we'll see (abbreviated again)

TOKEN_TYPE TOKEN_TEXT
---------- ----------------------------------------------------------------
0 ACCOUNTS
0 DEPOSIT
0 INCREASED
0 INDICATES
0 INTEREST

TOKEN_TYPE TOKEN_TEXT
---------- ----------------------------------------------------------------
1 accountability
1 business and economics
1 depositaries
1 financial investments
1 financial lending
1 general investment
1 investors
1 poverty

TOKEN_TYPE = 0 are the ordinary indexed words – words that appear in the text. Note that "stopwords" such as
the, for, has are not indexed.

TOKEN_TYPE=1 are indexed themes, that is, things that the document is "about". In general, these words and
phrases will not have appeared in the original text, but will have been derived from it.

Searching for Themes

Let's remind ourselves about the syntax for a "standard" Oracle Text query. We use the contains function in a
query, and provide it with the name of the indexed column, and the search expression:

SQL> SELECT * FROM docs WHERE contains( text, 'interest' ) > 0;

TEXT
-------------------------------------------------------------------------

The interest rate for deposit accounts has not increased for several years.
This indicates a poor rate of return for investors.

To do a theme search, we use a similar query, but use the ABOUT operator inside the contains:

SQL> SELECT * FROM docs WHERE contains(text, 'about(general investment)')>0;

TEXT
---------------------------------------------------------------------------
The interest rate for deposit accounts has not increased for several years.
This indicates a poor rate of return for investors.

What have we learned?


1. A theme is something that a document is "about".
2. Themes are found using the built-in Knowledge Base.
3. Themes can be indexed as well as normal words
4. Themes can be searched for using the ABOUT operator within the CONTAINS clause.

Common Questions and Answers

1. Q: Why don't I get any themes indexed?


A: You probably haven't installed the database examples package. Please refer to the standard download
location for the database version you are using.
2. Q: Can the Knowledge Base be extended
A: Yes, absolutely. You can create a thesaurus, then compile it into the Knowledge Base using the ctxkbtc
(ctx Knowledge Base Thesaurus Compiler) command line tool
3. Q: Does this work for languages other than English?
A: The Knowledge Base is supplied for English and French only. You can, however, compile extensions
to it in other languages.
Part 2: Extracting Themes and Gists

Purpose

This tutorial looks further into themes, showing you how to extract them for a single document. It also looks at
gists, or document summaries.

Time to Complete

Approximately 10 minutes

Topics

This tutorial covers the following topics

 Fetching themes for a document


 Getting "full" themes showing the hierarchy within the Knowledge Base
 Getting generic and point-of-view gists for a document

Prerequisites

None

Creating and populating the sample table

We're going to create a table similar to the previous example, but this time it needs a primary key column:

drop table docs;

create table docs (id number primary key, text varchar2(2000));

insert into docs values (1, '


The interest rate for deposit accounts has not increased for several years.
This indicates a poor rate of return for investors.
');

Creating an index for theme extraction

Before we can extract themes using CTX_DOC.THEMES, we need an index. That index only really tells us where
to find the text and how to process it, so it can be a non-populated or NOPOPULATE index. This takes almost no
time to create. (An alternative is to use POLICY_THEMES which doesn't require an index but that's beyond the
scope of this lab).

Create the index with

create index docs_index on docs(text) indextype is ctxsys.context parameters


('nopopulate');

Creating a themes table

We're going to extract the themes from that document, and we need somewhere to put them. We'll create a table
called "THEMES_TABLE" thus:

create table themes_table ( query_id number, theme varchar2(2000), weight


number );

Now we will use the ctx_doc package to get themes from the docs table into the themes_table table. We specify
the table containing our docs, the index name, and the primary key value for the row (in the textkey parameter).
We could use the rowid instead of the primary key value – see ctx_doc.set_key_type in the documentation.

begin
ctx_doc.themes (
index_name => 'docs_index',
restab => 'themes_table',
textkey => 1,
full_themes => FALSE );
end;
/

Let's look at what got put there. If you're using SQL*Plus, the following will make the output cleaner:

column theme format a60

Now select the themes from themes_table:

select theme, weight from themes_table order by weight desc;

The output of this is:

THEME WEIGHT
----------------------------------------- ----------
depositaries 40
accountability 40
indication 39
poverty 39
rate of return 33
interest rates 32
increase 32
investors 26
years 9

9 rows selected.

Note that these are similar to the themes we saw indexed before – but we're getting more information because we
can now see how important each theme is for the document, by referencing the WEIGHT column.

We can get even more information by extracting FULL themes as below. Note the full_themes parameter is now
set to "true".

delete from themes_table;

begin
ctx_doc.themes (
index_name => 'docs_index',
restab => 'themes_table',
textkey => 1,
full_themes => TRUE );
end;
/

select theme, weight from themes_table order by weight desc;

The output is:

THEME WEIGHT
----------------------------------------------------------- ----------
:depositaries: 40
:accountability: 40
:indication: 39
:poverty: 39
:rate of return:financial investments:business and economics: 33
:interest rates:financial lending:business and economics: 32
:increase: 32
:investors:general investment:financial investments:business and 26
economics:
:years: 9

9 rows selected.

You will notice that some themes are "solitary" and some part of a hierarchy. The "solitary" themes are unproved.
That means they were found in the text, but there were no other related terms around them to back them up.
Proved themes can be located properly in the knowledge base hierarchy due to the presence of related terms
nearby.
Gist Extraction

A gist is a summary of a document – usually a paragraph or two that best describes the overall content of the
document.

There are two types of gist:


1. The generic gist, which best describes the document as a whole
2. Point of view gists, that describe the document from the point of view of a particular theme.

For this example to make sense, we need a lot more text. Cut and past the following, or use the script in
/home/oracle/hol/ThemesAndGists/get_gists.sql

drop table docs;

create table docs (id number primary key, text varchar2(4000));

insert into docs values (1, '


Baby Cambridge''s first day out to Granny Carole''s: Proud parents Kate and
William leave Kensington Palace for Bucklebury with son after visit from
Queen and Harry.'||chr(10)||'
Baby Cambridge was taken to see his grandparents on his first afternoon out
today as it was revealed the Queen and Prince Harry have now met the royal
baby for the first time.'||chr(10)||'
Kate and William smiled broadly and waved from their car as they were driven
away from Kensington Palace by security, where they had spent their first
night together as a family.'||chr(10)||'
Both the Duke and Duchess looked happy and fresh-faced, with William sat in
the front passenger seat and Kate in the back with their child in his baby
seat.'||chr(10)||'
They left their west London home shortly after the Queen had visited her new
great-grandson and potentially discussed names with the parents, as the
guessing game over what he will be called continues.'||chr(10)||'
Kensington Palace officials would not confirm where the young family were
going this afternoon, but their black Land Rover was later seen arriving at
grandparents Carole and Michael Middleton''s mansion in Bucklebury,
Berkshire.'||chr(10)||'
Her Majesty, who will travel to Balmoral for her summer holiday on Friday,
spent 30 minutes with the Duke, Duchess and Baby Cambridge from
11am.'||chr(10)||'
It has also emerged that Prince Harry, who is said to be thrilled to have
become an uncle, may have been there last night after the trio left
hospital, having raced back to London from Wattisham airbase in Suffolk
where he is on duty with the RAF.'||chr(10)||'
It is understood that James Middleton and Pippa Middleton were also at
Kensington Palace yesterday evening.'||chr(10)||'
');
Create an index on that table. As before, the index does not need to be populated:

create index docs_index on docs(text) indextype is ctxsys.context


parameters ('nopopulate');

Now create a gists table and fetch the gists into it:

create table gists_table ( query_id number, pov varchar2(80), gist clob );

begin
ctx_doc.gist (
index_name => 'docs_index',
textkey => '1',
restab => 'gists_table' );
end;
/

And select the gists from it. If you're using SQL*Plus:

set pagesize 60
set long 50000

First get the "generic" gist

select gist from gists_table where pov = 'GENERIC';

Then take a look at all the point-of-view gists:

select upper(pov), gist from gists_table;

Here's a couple of examples from the output:

FAMILY
Kate and William smiled broadly and waved from their car as they were driven
away from Kensington Palace by security, where they had spent their first
night together as a family.

GRANDPARENTS
Baby Cambridge was taken to see his grandparents on his first afternoon out
today as it was revealed the Queen and Prince Harry have now met the royal
baby for the first time.

What have we learned?


1. Themes and gists can be extracted from documents.
2. An index is needed to specify options for the extraction but it need not be populated.
3. Gists can be generic, or from a particular point-of-view
Part 3: Named Entity Extraction
Purpose

This section shows you how to use the Entity Extraction technology, first introduced in Oracle Database 11g.

Time to Complete

Approximately 15 minutes

Topics

Entity extraction involves the finding and extracting of Named Entities within texts. These named entities include
things like people, places, job titles, dates, phone numbers and many other things.

 Simple extraction of all known entities


 Extracting specific types of entities

Method

This tutorial is somewhat different from earlier ones. Since the commands necessary to use are rather more
complicated, it is not feasible to type them all in. Instead, you will run pre-defined scripts, and observer the results.
The scripts are best run in SQL*Plus. You can, if you really want to, run them in SQL Developer but I would
strongly recommend SQL*Plus.

Entity extraction is XML-based. All configuration, and the output is XML. This makes it easy to process the
output mechanically, but it does complicate the process of demoing entity extraction.

Preparation
Open a terminal using the icon on your desktop. Change directory to the Entities directory under hol:

[oracle@localhost ~]$ cd /home/oracle/hol/Entities

Now open SQL*Plus. Use the "/nolog" parameter to prevent it prompting for login details:

[oracle@localhost Entities]$ sqlplus /nolog

SQL*Plus: Release 12.1.0.1.0 Production on Mon Aug 5 21:22:01 2013

SQL>

You should do this at the start of each lesson.

Lesson 1: Simple Extraction


At the SQL> prompt (see above) type @entities1

SQL> @entities1

The script will pause at each important stage so you can see what it is doing. Here is what is happening in the
entities1 script at each stage:

1. Connect as "system" and drop the demo user (if it exists – it won't first time round). Create the user
enttest and provide it with the necessary privileges
2. Connect as user enttest, create a table and insert a short "financial news" text document. Create a table
"entities" for inserting the found entities.
3. Create an "extraction policy". This is used to define what type of entities we want to find. In this case,
we're taking all the default options.
4. Create a short anonymous procedure which
a. Fetches the news doc into a PL/SQL variable
b. Runs the entity extraction procedure ctx_entity.extract to find all the entities in that variable
c. inserts the resulting XML output into another table for us to examine later
5. Examine the output from ctx_entity.extract. First we look at the full XML output. Then we apply some
"XML Database" magic (via the XMLTable function) to turn the output into a SQL table format.
6. A PL/SQL procedure is invoked which marks up the original document with the entities found. You
will see a URL listed – use "Left-Control Click" on the URL, or cut and paste it into your browser, to
see the document with highlighted entities. Mouse hover over each entity to see the type of entity and
why it was identified. Note that "New York" is identified as both a City and a State.

Close the browser when done and return to the terminal window

Lesson 2: More Entities

If your terminal window is open at the SQL> prompt, you are ready to go. If not, follow the "preparation" section
above to open a terminal window and open SQL*Plus with the "/nolog" parameter.

Then type

SQL> @entities2

Again the script will pause at the end of each step. Here's what is done at each step:

1. The user is dropped from the previous step (to clear out all tables), and recreated. A "docs" table is created,
and a much longer news article is inserted into it.
2. Create a default extraction policy
3. Run extraction on the document text, and save the extracted entities XML into the table "entities".
4. Show the extracted XML in raw XML format. Notice the various attributes associated with each attribute
5. Turn the extracted XML into a SQL table format for easier viewing. Note the wide variety of different
entity types found.
6. Generate a marked-up HTML file with entities displayed within the original text. Again, use CTRL-click
on the URL to open it in a browser, or cut-and-paste to the browser window.
Lesson 3: Selective Extraction

In this lesson we look at how we can specify the types of entities to extract, rather than having all the recognized
entities extracted. Proceed as before to get a SQL prompt and type:

SQL> @entities3

At each step:
1. Step 1 proceeds as before as Lesson 2 until the actual extraction. This time ctx_entity.extract includes an
extra argument – the list of entity types to extract. This time we will only extract city and company
information.
2. We can see that the output only contains city and company entities.

Lesson 4: Creating a Custom Entity Rule

So far all the entities we've extracted have been found using the built-in dictionary and rules. In this lesson we will
look at how we extend the policy to define new entity types via a regular-expression-based rule.

SQL> @entities4

At each step:
1. Proceeds exactly as Lesson 2, but then adds a new rule to the policy that was created, using
ctx_policy.add_extract_rule
This rule looks for expressions like "climbed by 20 percent". The new entity type identified by this rule is
named "xPositiveGain" – all custom entity types must start with "x".
2. All entities of the new type are extracted and displayed. We have chosen to extract only the new type – of
course we could have extracted all the standard types as well.

Lesson 5: Adding a new Dictionary

The document mentions the "Dow Jones Industrial Average" and the "S&P 500". Both of these are stock market
indexes, and it would be handy if we could extract them as such. Now we could define a regular expression which
would recognize both of these. However, if there are a lot of possible values the regular expression would get very
slow. Instead, it is better to load a custom dictionary. We do that using XML. There is a file in the Entities
directory called "dict.load" which contains:

<dictionary>
<entities>
<entity>
<value>dow jones industrial average</value>
<type>xIndex</type>
</entity>
<entity>
<value>S&amp;P 500</value>
<type>xIndex</type>
</entity>
</entities>
</dictionary>

Note that the "&" in "S&P 500" must be escaped as we are processing XML.

The XML file is loaded using the command-line utility "ctxload" (also used for loading thesauri).

Get to a SQL prompt and type:

SQL> @entities5

1. Loading proceeds as before. The system then pauses to allow you to view the loader file
2. The loader file is displayed. Note we are definining two new entries, both of type "xIndex".
3. ctxload is invoked to actually load the loader file. Note that the arguments to ctxload include
-name p1
This associates the new dictionary with policy P1 which we will use in the extraction
4. The extraction proceeds, using policy P1, which now includes both the regular expression rules, and the
newly loaded dictionary.
5. The extracted entities are listed and the HTML file generated with markup.

What have we learned?


1. Entities are named "things" in documents.
2. We can extract all entity types, or specific types.
3. Entity extraction uses a built-in set of rules, and a dictionary
4. We can extend this by adding our own rules and/or dictionaries.

Part 4: Classification and Clustering


Purpose

Clustering and classification are "text mining" techniques, which allow you to organize documents according to
the content within them.

First, as a slight diversion. Oracle Text supports an indextype called CTXRULE which allows for the "routing" of
documents. The idea is that people save queries about documents which interest them, and these queries are
automatically run against incoming documents. Documents that match particular queries are then routed to the
people (or programs) interested in them. Oracle Text classification works by automatically generating the rules for
a CTXRULE index.

What is usually termed "classification" is more strictly described as "supervised classification". In this case we
have a set of training documents, which have already been assigned (usually manually) into categories or topics..
We then train the system using these documents. The system then contains an internal body of knowledge of what
associates particular documents with particular categories. For example it might know that documents mentioning
"deposit accounts", "interest rates" and "investment" should be categorized as financial documents. New
documents can thus be automatically assigned to one of these existing categories.

Oracle Text has two distinct algorithms for supervised classification


 Decision trees – a set of human-readable rules are generated
 Support Vector Machine (SVM) – a more sophisticated set of binary rules are created
Both of these algorithms require an Oracle Text index to drive them, though in the case of SVM the index does not
need to be populated.

Time to Complete

Approximately 15 minutes

Classification

The scripts for this lesson are in the directory /home/oracle/hol/Classification

In the directory /home/oracle/hol/Classification/Training you will find a set of directories, each


representing a topic heading, and each containing a set of documents relevant to that topic heading.

The script /home/oracle/hol/Classification/training1.sql loads those documents into a set of tables.


This has already been run for you, you do not need to run (or understand) the script, though you can run it again if
you choose. The documents have been loaded as follows:

Table Name Description


trainingdocs Contains the document text
categories List of categories
cat_doc_mapping Indicates which categories are associated with each document

We have also created an Oracle Text index "trainingindex" on the TrainingDocs table.

We will now create a table to contain the generated rules, and run ctx_cls.train to process the set of documents, and
generate rules for the various categories.

The script with these commands in /home/oracle/hol/Classification/training2.sql but I would


encourage you to run each step separately so as to understand it.

Classification training uses a text index to populate a table of rules, which can later be used to see which
documents "match" particular categories.

Create the rules table first. You can do this from SQL*Plus or SQL Developer, logged on as hol/hol.

create table rules(


topicid integer,
ruletext varchar2(4000),
confidence number
);

Create a "classifier preference" to tell them system to use the decision-tree based RULE_CLASSIFIER:

exec ctx_ddl.create_preference( 'my_rule_classifier', 'RULE_CLASSIFIER')

Then run the training process (you will probably want to cut-and-paste this):

begin
ctx_cls.train(
index_name => 'trainingindex',
docid => 'id',
cattab => 'cat_doc_mapping',
catdocid => 'docid',
catid => 'catid',
restab => 'rules',
rescatid => 'topicid',
resquery => 'ruletext',
resconfid => 'confidence',
pref_name => 'my_rule_classifier'
);
end;
/

This is using the "features" (indexed words, and possibly themes and stems) from the trainingindex index to
generate rules to classify new documents. We will find the generated rules in our table "rules"

These rules can be used in a CTXRULE index to classify incoming documents.

We can look at them – for example to look at rules for foreign exchange ("money-fx") we can do:

SQL> select ruletext from rules where topicid =


( select catid from categories where category = 'money-fx' );

RULETEXT
----------------------------------------------------------------------------
----
{CURRENCIES} ~ about(currency stability) ~ about(bands) ~ {INTERVENING} ~
about(Six Nations) ~ about(taking out) ~ about(central-bank interventions) ~
about(money market) ~ {INTERVENED} ~ {INTERVENE} ~ about(repurchase
agreements) ~ {STABILITY} ~ {STABILIZE} ~ {MIYAZAWA} ~ about(Group of Five)
~ {FED} ~ {FOSTER} ~ {SUMITA} & about(money) ~ about(Bank of England) &
about(intervention)
...

We will create a rule index so we can actually use the rules to classify documents:

create index rules_idx on rules (ruletext) indextype is ctxsys.ctxrule;


and then use that to test a new "document". We join the RULES table (where we have our index) to the
CATEGORIES table so we can see the category name rather than just the topicid:

select category from rules, categories


where matches( ruletext, 'on the currency exchanges today...') > 0
and rules.topicid = categories.catid;

This gives us:

CATEGORY
--------------------------------------------------
money-fx

Thus we can see that the system has correctly identified our "document"as being about foreign exchange.

The decision trees algorithm has the advantage that it generates readable and editable rules. However, it is not
particularly accurate or complete. For better results we should use the Support Vector Machine (SVM) algorithm.
We will do this now. It is based on the same training set as before but uses a different rules table, as the generated
rules are now binary:

create table svm_rules(


cat_id number,
type number,
rule blob
);

We also need a new object, an SVM_CLASSIFIER preference. We can set various attributes for that if we choose:

exec ctx_ddl.create_preference( 'my_svm_classifier', 'SVM_CLASSIFIER' )


exec ctx_ddl.set_attribute ( 'my_svm_classifier', 'MAX_FEATURES', '1000' )

Then run the training process (you will probably want to cut-and-paste this):

begin
ctx_cls.train(
index_name => 'trainingindex',
docid => 'id',
cattab => 'cat_doc_mapping',
catdocid => 'docid',
catid => 'catid',
restab => 'svm_rules',
pref_name => 'my_svm_classifier'
);
end;
/
That will take a minute or two. Then we need to create a CTXRULE index on the SVM training output. We must
specify that we're using the SVM classifier this time:

create index svm_rules_idx on svm_rules (rule)


indextype is ctxsys.ctxrule
parameters('classifier my_svm_classifier');

Now a query to confirm that it is working. The query is similar to our simple query with decision trees, but since
we're using the SVM classifier
 We need a little more text to get a useful result
 We only want results where the match is more likely than not, that is the MATCHES function returns a
value > 50. That is, the system considers there is a greater than 50% chance the document belongs in that
category.

Again, you will probably need to cut-and-paste this. Do not leave any blank lines.

select category from categories, svm_rules


where matches( rule, '
ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RIFT
<AUTHOR> By William Kazer, Reuters</AUTHOR>
HONG KONG, April 8 - Mounting trade friction between the
U.S. And Japan has raised fears among many of Asia''s exporting
nations that the row could inflict far-reaching economic
damage, businessmen and officials said.
They told Reuter correspondents in Asian capitals a U.S.
Move against Japan might boost protectionist sentiment in the
U.S. And lead to curbs on American imports of their products.
') > 50
and categories.catid = svm_rules.cat_id;

The output from this should be:

CATEGORY
--------------------------------------------------
trade

1 row selected.

Which seems like a good result from that text. You can try some other texts if you choose.

If you have the time, you can run test.sql from the Classification directory. It loads the complete Reuters
"testing" set, and runs each document against the rule index. It should show an accuracy of about 89% - that is, we
have correctly identified 89% of the original categories that were manually associated with the test documents.

What have we learned?


1. Classification populates a RULES table from a training set
2. A CTXRULE index on the RULES table allows you to classify documents one at a time into the trained
categories .
3. The two types of classification are DECISION TREE and SUPPORT VECTOR MACHINE.
4. Decision Tree generates human readable/editable rules. SVM generates binary rules which are
generally more accurate.

Clustering

Unlike normal classification, clustering requires no training data. It is therefore also known as "unsupervised
classification". The idea is that documents are grouped into sets of documents which are similar to each other.

Oracle Text does this using an algorithm known as K-Means clustering. It collects attributes about documents, and
arranges them in a N-dimensional space, where N is the number of attributes. Each document then has a location in
that N-dimensional space, and a center points can be positioned within this space. Each document will then be
assigned to the cluster nearest to it in that space.

For this exercise we will use a trivial collection of documents. The script for this exercise is
/home/oracle/hol/Clustering/clustering.sql

In SQL*Plus, or SQL Developer, logged on as hol/hol, run this SQL:

create table collection (id number primary key, text varchar2(4000));

insert into collection values (1, 'Oracle Text can index any document or
textual content.');
insert into collection values (2, 'SES uses a crawler to access
documents.');
insert into collection values (3, 'XML is a tag-based markup language.');
insert into collection values (4, 'Oracle12c XML DB treats XML as a native
datatype in the database.');
insert into collection values (5, 'There are three Text index types to cover
all text search needs.');
insert into collection values (6, 'SES also provides an API for content
management solutions.');

We now create need index on the collection. Note that - unlike supervised training - this index does NOT have to
populated, so we can use the NOPOPULATE parameter. The index preferences (such as datastore and filter) tell
the clustering code how to fetch the data to be clustered. In our case we're using the default
DIRECT_DATASTORE to fetch data directly from the table column.

create index col_idx on collection(text)


indextype is ctxsys.context
parameters('nopopulate');

Create the preference to be used for clustering We'll tell the system to use a total of three clusters. It may create
more than that internally, but all documents will be assigned to three of those clusters.
exec ctx_ddl.create_preference('my_cluster','KMEAN_CLUSTERING')
exec ctx_ddl.set_attribute('my_cluster','CLUSTER_NUM','3')
exec ctx_ddl.set_attribute('my_cluster','THEME_ON','YES')
exec ctx_ddl.set_attribute('my_cluster','TOKEN_ON','YES')

And do the clustering...

begin
ctx_cls.clustering(index_name =>'col_idx',
docid =>'id',
doctab_name =>'restab',
clstab_name =>'clusters',
pref_name =>'my_cluster');
end;
/

Now let's look at how it has clustered the documents. If using SQL*Plus it will be helpful to do:

column description format a50


column text format a65

then run

select text, clusterid from collection c, restab


where docid = c.id
order by clusterid;

Which should show us:

TEXT CLUSTERID
----------------------------------------------------------------- ----------
There are three Text index types to cover all text search needs. 2
Oracle Text can index any document or textual content. 2
SES uses a crawler to access documents. 4
SES also provides an API for content management solutions. 4
Oracle12c XML DB treats XML as a native datatype in the database. 5
XML is a tag-based markup language. 5

6 rows selected.

Note the CLUSTERIDs are not contiguous, it actually created six clusters, but only used three. We can examine
the "description" for our clusters. These are the "major features" for the cluster (generally of more use in much
larger collections).

select clusterid, descript from clusters


where clusterid in (2,4,5);
Which produces:

CLUSTERID
----------
DESCRIPT
---------------------------------------------------------------------------
2
DOCUMENT,NEEDS,COVER,THREE,TEXTUAL,ORACLE,TYPES,INDEX,CONTENT,TEXT,SEARCH

4
DOCUMENTS,SOLUTIONS,MANAGEMENT,ACCESS,PROVIDES,CRAWLER,USES,SES,CONTENT,API

5
DATABASE,DB,TAG,DATATYPE,MARKUP,XML,BASED,LANGUAGE,TREATS,NATIVE,ORACLE12C

This is obviously a trivial example, but the reader is encouraged to try this on a real dataset at a later date.

What have we learned?


1. Clustering does not require any training.
2. Documents are grouped by similarity into a number of clusters specified by us.
3. Clusters have a "description" which shows the main features used
4. Clustering is on complete sets of documents, we can't later add a new document and expect it to be
added to a particular cluster, we must recalculate all clusters.

Conclusion
If you have made it this far - congratulations! That's the end of this Hands-On Lab.

You now have a good grounding in using Oracle Text

There's plenty more to learn, but having grasped the basics you can now tackle the other tasks with much more
confidence.

Note: Oracle Text is a standard part of Oracle Database 12c. All versions (from Enterprise to XE) include Oracle
Text.

The clustering and classification features make use of Oracle Advanced Analytics (OAA) functions internally.
Although OAA is separately licensed, a license is not required when these functions are used only within Oracle
Text functions, as demonstrated in this lab. However, since OAA is only available with Enterprise Edition, you can
only use Text Clustering and Classification with the Enterprise Edition.

Further Reading
Oracle Text Homepage on OTN:
http://www.oracle.com/technetwork/database/enterprise-edition/index-098492.html
Oracle Text Discussion Forums on OTN:
https://forums.oracle.com/community/developer/english/oracle_database/text

Oracle Text Reference and Application Developers' Guide:


http://www.oracle.com/pls/db121/portal.all_books#index-TEX

SearchTech Blog
https://blogs.oracle.com/searchtech/

Author: Roger Ford


roger.ford@oracle.com

Text Analytics Using Oralce Text Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
August 2013
Author: Roger Ford This document is provided for information purposes only, and the contents hereof are subject to change without notice. This
document is not warranted to be error-free, nor subject to any other warranties or conditions, whether expressed orally or implied in
law, including implied warranties and conditions of merchantability or fitness for a particular purpose. We specifically disclaim any
Oracle Corporation
liability with respect to this document, and no contractual obligations are formed either directly or indirectly by this document. This
World Headquarters
document may not be reproduced or transmitted in any form or by any means, electronic or mechanical, for any purpose, without our
500 Oracle Parkway
prior written permission.
Redwood Shores, CA 94065
U.S.A.
Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.
Worldwide Inquiries:
Phone: +1.650.506.7000 Intel and Intel Xeon are trademarks or registered trademarks of Intel Corporation. All SPARC trademarks are used under license and
Fax: +1.650.506.7200 are trademarks or registered trademarks of SPARC International, Inc. AMD, Opteron, the AMD logo, and the AMD Opteron logo are
trademarks or registered trademarks of Advanced Micro Devices. UNIX is a registered trademark of The Open Group. 0113
oracle.com

Das könnte Ihnen auch gefallen