Beruflich Dokumente
Kultur Dokumente
Roger Ford
Principal Product Manager
Asha Tarachandani
Development Manager
Hands-On Lab - Text Analytics using Oracle Text
Table Of Contents
Hands-On Lab - Text Analytics using Oracle Text ........................................................................... 2
Table Of Contents ............................................................................................................................ 2
Login Information ............................................................................................................................. 4
Introduction ...................................................................................................................................... 4
Prerequisites .................................................................................................................................... 4
How to use this document ................................................................................................................ 4
Part 1: Theme Indexes ..................................................................................................................... 6
Purpose ...................................................................................................................... 6
Time to Complete........................................................................................................ 6
Topics ......................................................................................................................... 6
Prerequisites ............................................................................................................... 6
Starting SQL Developer or SQL*Plus ......................................................................... 6
Creating a table........................................................................................................... 6
Inserting some data in our table .................................................................................. 7
Creating a theme index on our table ........................................................................... 7
Searching for Themes................................................................................................. 9
What have we learned? ............................................................................................ 10
Common Questions and Answers............................................................................. 10
Part 2: Extracting Themes and Gists ............................................................................................. 11
Purpose .................................................................................................................... 11
Time to Complete...................................................................................................... 11
Topics ....................................................................................................................... 11
Prerequisites ............................................................................................................. 11
Creating and populating the sample table ................................................................. 11
Creating an index for theme extraction ..................................................................... 11
Creating a themes table ............................................................................................ 12
Gist Extraction........................................................................................................... 14
What have we learned? ............................................................................................ 15
Part 3: Named Entity Extraction ..................................................................................................... 16
Purpose .................................................................................................................... 16
Time to Complete...................................................................................................... 16
Topics ....................................................................................................................... 16
Method ...................................................................................................................... 16
Preparation ............................................................................................................... 16
Lesson 1: Simple Extraction ..................................................................................... 16
Lesson 2: More Entities ............................................................................................ 17
Lesson 3: Selective Extraction .................................................................................. 18
Lesson 4: Creating a Custom Entity Rule ................................................................. 18
Lesson 5: Adding a new Dictionary ........................................................................... 18
What have we learned? ............................................................................................ 19
Part 4: Classification and Clustering .............................................................................................. 19
Purpose .................................................................................................................... 19
Time to Complete...................................................................................................... 20
Classification ............................................................................................................. 20
What have we learned? ............................................................................................ 23
Clustering .................................................................................................................. 24
What have we learned? ............................................................................................ 26
Conclusion ..................................................................................................................................... 26
Further Reading ............................................................................................................................. 26
ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd
Login Information
Introduction
Oracle Text is well-known as the text searching engine within Oracle Database 12c. Less well known are its text
analytics capabilities.
What do we mean by text analytics? Also known as "text mining", this is the process of deriving high quality
information from text through linguistic and statistical analysis of the text.
This lab takes you through the basic analytic capabilities of Oracle Text in Oracle Database 12c.
This is a self-paced lab. You can run through the exercises yourself with reference to this document. There are
several highly knowledgeable helpers available, please don't be afraid to ask for help at any point you get stuck or
need further explanation of some feature.
Prerequisites
To participate effectively in this Hands On Lab, you need some familiarity with SQL and Oracle's SQL tools -
either the command-line SQL*Plus or SQL Developer.
In general, things you need to type are emphasized in "red" (don't type the quotes). Output that you would expect
to receive is in a black mono-spaced font.
When following the examples, you are most likely to remember the lessons if you type the text yourself. However,
experience shows that nearly all problems with hands-on labs are due to people mis-typing stuff.
1. If there are only one to three lines of text and you are a reasonably good typist, then type the lines yourself
with reference to this document.
2. If you hit errors, retry by cutting-and-pasting from this document.
3. If all else fails, there are scripts available in the "HOL" folder and you can run them directly, or cut-and-
paste from them.
In general this document does not list the "drop table", "drop index" etc commands necessary to repeat an
exercise. Hopefully they should all be obvious, but if not then please ask a lab helper for help. All the scripts do
contain the "drop" statements, so you could also look there for them. One side-effect of that is that if you run the
scripts directly you will usually get errors the first time they are run, since they try to delete non-existent objects.
Those completely unfamiliar with Oracle Text may find there is not enough time in the lab to complete all the
sections in this book - if that is the case don't worry, just complete what you can. The more advanced topics are
intended for users who already have some familiarity with Oracle Text.
Part 1: Theme Indexes
Purpose
This tutorial introduces you to the basics of Oracle Text theme indexes and queries. It can all be run from
SQL*Plus or SQL Developer and requires no additional files.
Time to Complete
Approximately 15 minutes
Topics
This tutorial creates a simple text document, and creates a theme index
This tutorial covers the following topics
Prerequisites
Before starting this tutorial you should be logged into your system. You should be logged on already, but if for any
reason you get logged off, the username is "oracle" and the password is "welcome1".
To start SQL*Plus:
1. Double-click on the Terminal icon on your desktop
2. Type "sqlplus hol/hol"
Tech note: sqlplus is aliased to "rlwrap sqlplus". rlwrap allows command recall and editing which is not usually
possible with the Linux command line version of SQL*Plus. If you don't like it, or it causes problems, just type
"unalias sqlplus". rlwrap is open source software, available from: http://utopia.knoware.nl/~hlub/rlwrap/
Creating a table
Create a table in the SCOTT schema called "QUICK". It should have a single column called TEXT which is
varchar2(80). You can do this using "Tables (Right-Click) -> New Table" in SQL Developer or by running the
SQL statement:
CREATE TABLE docs (text VARCHAR2(2000));
in either SQL*Plus or SQL*Developer. Don't forget to add a semi-colon or a "/" on a new line for SQL*Plus.
From here on, we're just going to show the SQL as run in SQL*Plus - you can run the same SQL from SQL
Developer if you like, or make use of the GUI features (eg to load a table) if you want.
Load the table with some simple data. (note: you don't have to use the exact text here, but it is strongly
recommended since later examples build on the same text).
Next we're going to create an index - which includes themes - on our table. Before we do that, though, let's talk a
little bit more about themes and Oracle Text.
A theme is "something that a document is about". A document cannot be about "invested" – that doesn't
make sense, grammatically. However, the document could be about "investors" or "investing".
Themes are derived from the Oracle Text Knowledge Base – a large body of knowledge (technically, an
ontology) about the English language.
The Knowledge Base is loaded with the database "Examples" pack. We have already installed it for this
Hands On Lab. However, when working with your own database later, if you have not downloaded and
installed the Examples, you will not get theme indexes by default, and if you try to repeat exercises from
this lab on your own machine you may get errors.
If you're not currently familiar with Oracle Text, you might be asking "what does that do?" Like any other
CREATE INDEX statement, we have specified the table and column on which to create the index, but we have
told it to use a special "indextype" of CTXSYS.CONTEXT.
Indextypes are part of the Extensibility Framework of the Oracle kernel, and CTXSYS.CONTEXT is an extensible
(or DOMAIN ) index used by Oracle Text. There are other indextypes in Oracle Text, but CONTEXT is the most
common type and the only one we shall consider here.
When Oracle Text creates an index, it creates several special tables to hold index data and metadata, the most
interesting of which is the so-called "Dollar-I" table. Its name is derived from the index name, and here it will be
called DR$DOCS_INDEX$I.
We can describe it:
Of these columns, we're most interested in TOKEN_TEXT (the indexed words from your documents) and
TOKEN_TYPE (the type of word indexed). Let's examine them:
The output looks like this (we've abbreviated it – you'll see more words)
TOKEN_TYPE TOKEN_TEXT
---------- ----------------------------------------------------------------
0 ACCOUNTS
0 DEPOSIT
0 INCREASED
0 INDICATES
0 INTEREST
Note that all the TOKEN_TYPE values are 0. Those are "ordinary indexed words" because we created a basic
index. Next we'll tell it we want themes indexed as well. Drop the index…
And recreate it, this time creating an index preference telling the lexer (process which selects words from text) to
include themes. The index preference is created using the CTX_DDL package, and then specified using the
PARAMETERS clause of the create index statement. A user needs the CTXAPP role to use the CTX_DDL
package.
TOKEN_TYPE TOKEN_TEXT
---------- ----------------------------------------------------------------
0 ACCOUNTS
0 DEPOSIT
0 INCREASED
0 INDICATES
0 INTEREST
TOKEN_TYPE TOKEN_TEXT
---------- ----------------------------------------------------------------
1 accountability
1 business and economics
1 depositaries
1 financial investments
1 financial lending
1 general investment
1 investors
1 poverty
TOKEN_TYPE = 0 are the ordinary indexed words – words that appear in the text. Note that "stopwords" such as
the, for, has are not indexed.
TOKEN_TYPE=1 are indexed themes, that is, things that the document is "about". In general, these words and
phrases will not have appeared in the original text, but will have been derived from it.
Let's remind ourselves about the syntax for a "standard" Oracle Text query. We use the contains function in a
query, and provide it with the name of the indexed column, and the search expression:
TEXT
-------------------------------------------------------------------------
The interest rate for deposit accounts has not increased for several years.
This indicates a poor rate of return for investors.
To do a theme search, we use a similar query, but use the ABOUT operator inside the contains:
TEXT
---------------------------------------------------------------------------
The interest rate for deposit accounts has not increased for several years.
This indicates a poor rate of return for investors.
Purpose
This tutorial looks further into themes, showing you how to extract them for a single document. It also looks at
gists, or document summaries.
Time to Complete
Approximately 10 minutes
Topics
Prerequisites
None
We're going to create a table similar to the previous example, but this time it needs a primary key column:
Before we can extract themes using CTX_DOC.THEMES, we need an index. That index only really tells us where
to find the text and how to process it, so it can be a non-populated or NOPOPULATE index. This takes almost no
time to create. (An alternative is to use POLICY_THEMES which doesn't require an index but that's beyond the
scope of this lab).
We're going to extract the themes from that document, and we need somewhere to put them. We'll create a table
called "THEMES_TABLE" thus:
Now we will use the ctx_doc package to get themes from the docs table into the themes_table table. We specify
the table containing our docs, the index name, and the primary key value for the row (in the textkey parameter).
We could use the rowid instead of the primary key value – see ctx_doc.set_key_type in the documentation.
begin
ctx_doc.themes (
index_name => 'docs_index',
restab => 'themes_table',
textkey => 1,
full_themes => FALSE );
end;
/
Let's look at what got put there. If you're using SQL*Plus, the following will make the output cleaner:
THEME WEIGHT
----------------------------------------- ----------
depositaries 40
accountability 40
indication 39
poverty 39
rate of return 33
interest rates 32
increase 32
investors 26
years 9
9 rows selected.
Note that these are similar to the themes we saw indexed before – but we're getting more information because we
can now see how important each theme is for the document, by referencing the WEIGHT column.
We can get even more information by extracting FULL themes as below. Note the full_themes parameter is now
set to "true".
begin
ctx_doc.themes (
index_name => 'docs_index',
restab => 'themes_table',
textkey => 1,
full_themes => TRUE );
end;
/
THEME WEIGHT
----------------------------------------------------------- ----------
:depositaries: 40
:accountability: 40
:indication: 39
:poverty: 39
:rate of return:financial investments:business and economics: 33
:interest rates:financial lending:business and economics: 32
:increase: 32
:investors:general investment:financial investments:business and 26
economics:
:years: 9
9 rows selected.
You will notice that some themes are "solitary" and some part of a hierarchy. The "solitary" themes are unproved.
That means they were found in the text, but there were no other related terms around them to back them up.
Proved themes can be located properly in the knowledge base hierarchy due to the presence of related terms
nearby.
Gist Extraction
A gist is a summary of a document – usually a paragraph or two that best describes the overall content of the
document.
For this example to make sense, we need a lot more text. Cut and past the following, or use the script in
/home/oracle/hol/ThemesAndGists/get_gists.sql
Now create a gists table and fetch the gists into it:
begin
ctx_doc.gist (
index_name => 'docs_index',
textkey => '1',
restab => 'gists_table' );
end;
/
set pagesize 60
set long 50000
FAMILY
Kate and William smiled broadly and waved from their car as they were driven
away from Kensington Palace by security, where they had spent their first
night together as a family.
GRANDPARENTS
Baby Cambridge was taken to see his grandparents on his first afternoon out
today as it was revealed the Queen and Prince Harry have now met the royal
baby for the first time.
This section shows you how to use the Entity Extraction technology, first introduced in Oracle Database 11g.
Time to Complete
Approximately 15 minutes
Topics
Entity extraction involves the finding and extracting of Named Entities within texts. These named entities include
things like people, places, job titles, dates, phone numbers and many other things.
Method
This tutorial is somewhat different from earlier ones. Since the commands necessary to use are rather more
complicated, it is not feasible to type them all in. Instead, you will run pre-defined scripts, and observer the results.
The scripts are best run in SQL*Plus. You can, if you really want to, run them in SQL Developer but I would
strongly recommend SQL*Plus.
Entity extraction is XML-based. All configuration, and the output is XML. This makes it easy to process the
output mechanically, but it does complicate the process of demoing entity extraction.
Preparation
Open a terminal using the icon on your desktop. Change directory to the Entities directory under hol:
Now open SQL*Plus. Use the "/nolog" parameter to prevent it prompting for login details:
SQL>
SQL> @entities1
The script will pause at each important stage so you can see what it is doing. Here is what is happening in the
entities1 script at each stage:
1. Connect as "system" and drop the demo user (if it exists – it won't first time round). Create the user
enttest and provide it with the necessary privileges
2. Connect as user enttest, create a table and insert a short "financial news" text document. Create a table
"entities" for inserting the found entities.
3. Create an "extraction policy". This is used to define what type of entities we want to find. In this case,
we're taking all the default options.
4. Create a short anonymous procedure which
a. Fetches the news doc into a PL/SQL variable
b. Runs the entity extraction procedure ctx_entity.extract to find all the entities in that variable
c. inserts the resulting XML output into another table for us to examine later
5. Examine the output from ctx_entity.extract. First we look at the full XML output. Then we apply some
"XML Database" magic (via the XMLTable function) to turn the output into a SQL table format.
6. A PL/SQL procedure is invoked which marks up the original document with the entities found. You
will see a URL listed – use "Left-Control Click" on the URL, or cut and paste it into your browser, to
see the document with highlighted entities. Mouse hover over each entity to see the type of entity and
why it was identified. Note that "New York" is identified as both a City and a State.
Close the browser when done and return to the terminal window
If your terminal window is open at the SQL> prompt, you are ready to go. If not, follow the "preparation" section
above to open a terminal window and open SQL*Plus with the "/nolog" parameter.
Then type
SQL> @entities2
Again the script will pause at the end of each step. Here's what is done at each step:
1. The user is dropped from the previous step (to clear out all tables), and recreated. A "docs" table is created,
and a much longer news article is inserted into it.
2. Create a default extraction policy
3. Run extraction on the document text, and save the extracted entities XML into the table "entities".
4. Show the extracted XML in raw XML format. Notice the various attributes associated with each attribute
5. Turn the extracted XML into a SQL table format for easier viewing. Note the wide variety of different
entity types found.
6. Generate a marked-up HTML file with entities displayed within the original text. Again, use CTRL-click
on the URL to open it in a browser, or cut-and-paste to the browser window.
Lesson 3: Selective Extraction
In this lesson we look at how we can specify the types of entities to extract, rather than having all the recognized
entities extracted. Proceed as before to get a SQL prompt and type:
SQL> @entities3
At each step:
1. Step 1 proceeds as before as Lesson 2 until the actual extraction. This time ctx_entity.extract includes an
extra argument – the list of entity types to extract. This time we will only extract city and company
information.
2. We can see that the output only contains city and company entities.
So far all the entities we've extracted have been found using the built-in dictionary and rules. In this lesson we will
look at how we extend the policy to define new entity types via a regular-expression-based rule.
SQL> @entities4
At each step:
1. Proceeds exactly as Lesson 2, but then adds a new rule to the policy that was created, using
ctx_policy.add_extract_rule
This rule looks for expressions like "climbed by 20 percent". The new entity type identified by this rule is
named "xPositiveGain" – all custom entity types must start with "x".
2. All entities of the new type are extracted and displayed. We have chosen to extract only the new type – of
course we could have extracted all the standard types as well.
The document mentions the "Dow Jones Industrial Average" and the "S&P 500". Both of these are stock market
indexes, and it would be handy if we could extract them as such. Now we could define a regular expression which
would recognize both of these. However, if there are a lot of possible values the regular expression would get very
slow. Instead, it is better to load a custom dictionary. We do that using XML. There is a file in the Entities
directory called "dict.load" which contains:
<dictionary>
<entities>
<entity>
<value>dow jones industrial average</value>
<type>xIndex</type>
</entity>
<entity>
<value>S&P 500</value>
<type>xIndex</type>
</entity>
</entities>
</dictionary>
Note that the "&" in "S&P 500" must be escaped as we are processing XML.
The XML file is loaded using the command-line utility "ctxload" (also used for loading thesauri).
SQL> @entities5
1. Loading proceeds as before. The system then pauses to allow you to view the loader file
2. The loader file is displayed. Note we are definining two new entries, both of type "xIndex".
3. ctxload is invoked to actually load the loader file. Note that the arguments to ctxload include
-name p1
This associates the new dictionary with policy P1 which we will use in the extraction
4. The extraction proceeds, using policy P1, which now includes both the regular expression rules, and the
newly loaded dictionary.
5. The extracted entities are listed and the HTML file generated with markup.
Clustering and classification are "text mining" techniques, which allow you to organize documents according to
the content within them.
First, as a slight diversion. Oracle Text supports an indextype called CTXRULE which allows for the "routing" of
documents. The idea is that people save queries about documents which interest them, and these queries are
automatically run against incoming documents. Documents that match particular queries are then routed to the
people (or programs) interested in them. Oracle Text classification works by automatically generating the rules for
a CTXRULE index.
What is usually termed "classification" is more strictly described as "supervised classification". In this case we
have a set of training documents, which have already been assigned (usually manually) into categories or topics..
We then train the system using these documents. The system then contains an internal body of knowledge of what
associates particular documents with particular categories. For example it might know that documents mentioning
"deposit accounts", "interest rates" and "investment" should be categorized as financial documents. New
documents can thus be automatically assigned to one of these existing categories.
Time to Complete
Approximately 15 minutes
Classification
We have also created an Oracle Text index "trainingindex" on the TrainingDocs table.
We will now create a table to contain the generated rules, and run ctx_cls.train to process the set of documents, and
generate rules for the various categories.
Classification training uses a text index to populate a table of rules, which can later be used to see which
documents "match" particular categories.
Create the rules table first. You can do this from SQL*Plus or SQL Developer, logged on as hol/hol.
Create a "classifier preference" to tell them system to use the decision-tree based RULE_CLASSIFIER:
Then run the training process (you will probably want to cut-and-paste this):
begin
ctx_cls.train(
index_name => 'trainingindex',
docid => 'id',
cattab => 'cat_doc_mapping',
catdocid => 'docid',
catid => 'catid',
restab => 'rules',
rescatid => 'topicid',
resquery => 'ruletext',
resconfid => 'confidence',
pref_name => 'my_rule_classifier'
);
end;
/
This is using the "features" (indexed words, and possibly themes and stems) from the trainingindex index to
generate rules to classify new documents. We will find the generated rules in our table "rules"
We can look at them – for example to look at rules for foreign exchange ("money-fx") we can do:
RULETEXT
----------------------------------------------------------------------------
----
{CURRENCIES} ~ about(currency stability) ~ about(bands) ~ {INTERVENING} ~
about(Six Nations) ~ about(taking out) ~ about(central-bank interventions) ~
about(money market) ~ {INTERVENED} ~ {INTERVENE} ~ about(repurchase
agreements) ~ {STABILITY} ~ {STABILIZE} ~ {MIYAZAWA} ~ about(Group of Five)
~ {FED} ~ {FOSTER} ~ {SUMITA} & about(money) ~ about(Bank of England) &
about(intervention)
...
We will create a rule index so we can actually use the rules to classify documents:
CATEGORY
--------------------------------------------------
money-fx
Thus we can see that the system has correctly identified our "document"as being about foreign exchange.
The decision trees algorithm has the advantage that it generates readable and editable rules. However, it is not
particularly accurate or complete. For better results we should use the Support Vector Machine (SVM) algorithm.
We will do this now. It is based on the same training set as before but uses a different rules table, as the generated
rules are now binary:
We also need a new object, an SVM_CLASSIFIER preference. We can set various attributes for that if we choose:
Then run the training process (you will probably want to cut-and-paste this):
begin
ctx_cls.train(
index_name => 'trainingindex',
docid => 'id',
cattab => 'cat_doc_mapping',
catdocid => 'docid',
catid => 'catid',
restab => 'svm_rules',
pref_name => 'my_svm_classifier'
);
end;
/
That will take a minute or two. Then we need to create a CTXRULE index on the SVM training output. We must
specify that we're using the SVM classifier this time:
Now a query to confirm that it is working. The query is similar to our simple query with decision trees, but since
we're using the SVM classifier
We need a little more text to get a useful result
We only want results where the match is more likely than not, that is the MATCHES function returns a
value > 50. That is, the system considers there is a greater than 50% chance the document belongs in that
category.
Again, you will probably need to cut-and-paste this. Do not leave any blank lines.
CATEGORY
--------------------------------------------------
trade
1 row selected.
Which seems like a good result from that text. You can try some other texts if you choose.
If you have the time, you can run test.sql from the Classification directory. It loads the complete Reuters
"testing" set, and runs each document against the rule index. It should show an accuracy of about 89% - that is, we
have correctly identified 89% of the original categories that were manually associated with the test documents.
Clustering
Unlike normal classification, clustering requires no training data. It is therefore also known as "unsupervised
classification". The idea is that documents are grouped into sets of documents which are similar to each other.
Oracle Text does this using an algorithm known as K-Means clustering. It collects attributes about documents, and
arranges them in a N-dimensional space, where N is the number of attributes. Each document then has a location in
that N-dimensional space, and a center points can be positioned within this space. Each document will then be
assigned to the cluster nearest to it in that space.
For this exercise we will use a trivial collection of documents. The script for this exercise is
/home/oracle/hol/Clustering/clustering.sql
insert into collection values (1, 'Oracle Text can index any document or
textual content.');
insert into collection values (2, 'SES uses a crawler to access
documents.');
insert into collection values (3, 'XML is a tag-based markup language.');
insert into collection values (4, 'Oracle12c XML DB treats XML as a native
datatype in the database.');
insert into collection values (5, 'There are three Text index types to cover
all text search needs.');
insert into collection values (6, 'SES also provides an API for content
management solutions.');
We now create need index on the collection. Note that - unlike supervised training - this index does NOT have to
populated, so we can use the NOPOPULATE parameter. The index preferences (such as datastore and filter) tell
the clustering code how to fetch the data to be clustered. In our case we're using the default
DIRECT_DATASTORE to fetch data directly from the table column.
Create the preference to be used for clustering We'll tell the system to use a total of three clusters. It may create
more than that internally, but all documents will be assigned to three of those clusters.
exec ctx_ddl.create_preference('my_cluster','KMEAN_CLUSTERING')
exec ctx_ddl.set_attribute('my_cluster','CLUSTER_NUM','3')
exec ctx_ddl.set_attribute('my_cluster','THEME_ON','YES')
exec ctx_ddl.set_attribute('my_cluster','TOKEN_ON','YES')
begin
ctx_cls.clustering(index_name =>'col_idx',
docid =>'id',
doctab_name =>'restab',
clstab_name =>'clusters',
pref_name =>'my_cluster');
end;
/
Now let's look at how it has clustered the documents. If using SQL*Plus it will be helpful to do:
then run
TEXT CLUSTERID
----------------------------------------------------------------- ----------
There are three Text index types to cover all text search needs. 2
Oracle Text can index any document or textual content. 2
SES uses a crawler to access documents. 4
SES also provides an API for content management solutions. 4
Oracle12c XML DB treats XML as a native datatype in the database. 5
XML is a tag-based markup language. 5
6 rows selected.
Note the CLUSTERIDs are not contiguous, it actually created six clusters, but only used three. We can examine
the "description" for our clusters. These are the "major features" for the cluster (generally of more use in much
larger collections).
CLUSTERID
----------
DESCRIPT
---------------------------------------------------------------------------
2
DOCUMENT,NEEDS,COVER,THREE,TEXTUAL,ORACLE,TYPES,INDEX,CONTENT,TEXT,SEARCH
4
DOCUMENTS,SOLUTIONS,MANAGEMENT,ACCESS,PROVIDES,CRAWLER,USES,SES,CONTENT,API
5
DATABASE,DB,TAG,DATATYPE,MARKUP,XML,BASED,LANGUAGE,TREATS,NATIVE,ORACLE12C
This is obviously a trivial example, but the reader is encouraged to try this on a real dataset at a later date.
Conclusion
If you have made it this far - congratulations! That's the end of this Hands-On Lab.
There's plenty more to learn, but having grasped the basics you can now tackle the other tasks with much more
confidence.
Note: Oracle Text is a standard part of Oracle Database 12c. All versions (from Enterprise to XE) include Oracle
Text.
The clustering and classification features make use of Oracle Advanced Analytics (OAA) functions internally.
Although OAA is separately licensed, a license is not required when these functions are used only within Oracle
Text functions, as demonstrated in this lab. However, since OAA is only available with Enterprise Edition, you can
only use Text Clustering and Classification with the Enterprise Edition.
Further Reading
Oracle Text Homepage on OTN:
http://www.oracle.com/technetwork/database/enterprise-edition/index-098492.html
Oracle Text Discussion Forums on OTN:
https://forums.oracle.com/community/developer/english/oracle_database/text
SearchTech Blog
https://blogs.oracle.com/searchtech/
Text Analytics Using Oralce Text Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
August 2013
Author: Roger Ford This document is provided for information purposes only, and the contents hereof are subject to change without notice. This
document is not warranted to be error-free, nor subject to any other warranties or conditions, whether expressed orally or implied in
law, including implied warranties and conditions of merchantability or fitness for a particular purpose. We specifically disclaim any
Oracle Corporation
liability with respect to this document, and no contractual obligations are formed either directly or indirectly by this document. This
World Headquarters
document may not be reproduced or transmitted in any form or by any means, electronic or mechanical, for any purpose, without our
500 Oracle Parkway
prior written permission.
Redwood Shores, CA 94065
U.S.A.
Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.
Worldwide Inquiries:
Phone: +1.650.506.7000 Intel and Intel Xeon are trademarks or registered trademarks of Intel Corporation. All SPARC trademarks are used under license and
Fax: +1.650.506.7200 are trademarks or registered trademarks of SPARC International, Inc. AMD, Opteron, the AMD logo, and the AMD Opteron logo are
trademarks or registered trademarks of Advanced Micro Devices. UNIX is a registered trademark of The Open Group. 0113
oracle.com