Sie sind auf Seite 1von 22

Mining social media data for customer feedback is one of the greatest untapped

opportunities for customer analysis in many organizations today.


As many are aware, twenty-first century corporations are facing a crisis. Many
corporations have been accurately and comprehensively storing data for years. The
data is in variety of forms like social media posts, email, blogs, news, feedback, tweets,
business documents etc.
It is very important to extract meaningful information without having to read every single
sentence. Now, what is meaningful information. The extraction process should identify
the "who", "what", "where", "when" and "how much" (among other things) from these
data.
For example, use social media data to find out

What people are saying about my brand or products?

How many people recommend my brand vs. advocate against it?


Text Analysis is the solution of all this problem.
In this article we will explain:

What is Text Analysis?

Why Text Analysis is so important for business?

How does SAP HANA support text analysis?


Before understanding Text Analysis, you will have to first understand Structured Data
and Unstructured Data.

Structured and Unstructured Data:


Structured Data:
Data that resides in a fixed field within a record or file is called structured data. This
includes data contained in relational databases and spreadsheets .
For example data stored in database tables are structured data.

Structured data has the advantage of being easily entered, stored, queried and

analyzed.

Unstructured Data:
The phrase "unstructured data" usually refers to information that doesn't reside in a
traditional row-column database.

Unstructured data files often include text and multimedia content. Examples include email messages, word processing documents, videos, photos, audio files, presentations,
webpages and many other kinds of business documents.

Digging
through unstructured data can be cumbersome and costly. Email is a good example of
unstructured data. It's indexed by date, time, sender, recipient, and subject, but the body
of an email remains unstructured. Other examples of unstructured data include books,
documents, medical records, and social media posts.

Why unstructured data is so important for business?


Experts estimate that 80 to 90 percent of the data in any organization is unstructured.
And the amount of unstructured data in enterprises is growing significantly -- often many
times faster than structured databases are growing.

The only problem is extracting meaningful information from unstructured data.

What is Text Analysis?

Text Analysis is the process of analyzing unstructured text, extracting relevant


information and then transforming that information into structured information
that can be leveraged in different ways.
Text Analysis refers to the ability to do Natural Language Processing, linguistically
understand the text and apply statistical techniques to refine the results.
With the help of text analysis we can model and structure the information content of
unstructured data for the purpose of business analysis, research and investigation.

Mapping Business Needs to Text Analysis

Example of Meaning Extraction from a sentence

There are few important techniques being used in Text Analysis.

Full Text Search

Full Text Indexing

Fuzzy Search
Let's have a look into them one by one.

Full Text Search:


The primary function of full-text search is to optimize linguistic searches.
Full text search is designed to perform linguistic (language-based) searches against text
and documents stored in your database.
In a full-text search, the search engine examines all of the words in every stored
document as it tries to match search criteria (text specified by a user).

Full Text Indexing:


When dealing with a small number of documents, it is possible for the full-text-search
engine to directly scan the contents of the documents with each query, a strategy called
"serial scanning." This is what some rudimentary tools, such as grep, do when
searching.
However, when the number of documents to search is potentially large, the problem of
full-text search is often divided into two tasks: indexing and searching.
The indexing stage will scan the text of all the documents and build a list of search
terms (often called an index). In the search stage, when performing a specific query,
only the index is referenced, rather than the text of the original documents.
The indexer will make an entry in the index for each term or word found in a document,
and possibly note its relative position within the document.
Conceptually, full-text indexes support searching on columns in the same way that
indexes support searching through books.

Fuzzy Search:
Also known as approximate string matching.
Fuzzy search is the technique of finding strings that match a pattern approximately
(rather than exactly).
It is a type of search that will find matches even when users misspell words or enter in
only partial words for the search.
A Real World Example:
If a user types "SAP HANA Tutorl" into Yahoo or Google (both of which use fuzzy
matching), a list of hits is returned along with the question, "Did you mean "SAP HANA
Tutorial".

How Business can take leverage of Text Analysis:

All that tech talk is fine, but how can Text Analysis help companies make more
money?

Below are the few real time examples.


Automate the process of customer response:
There is an airline company that wanted to automate the process of responding to
customer requests via email. Using SAP Text Analysis technology, they are able to
classify incoming emails and accurately and effectively respond to requests. This also
helps them reduce their call-center costs.
Automate document categorization, search and retrieval:
Another example is of a financial services company that uses SAP Text Analysis
technology as the backbone for their automatic content enrichment platform. They use
Text Analysis to discover meta-data in input text data feeds, making document
categorization, search and retrieval a seamless process.
Find public intent to buy a product from Twitter:
Suppose your company is planning to launch a new product (say smart phone, bike
etc.) in market.
You can do a text analysis on Twitter data to find out

How many people are showing their interest to buy this product?

How frequent people are talking about this new product?

Is there any negative comments or rumor going around for this product?

Top Business Use-cases of text Analysis:


Brand/ Product/ Reputation Management

Market research and social media monitoring, i.e. what people are saying about
my brand or products
Voice of the Customer/ Customer Experience Management

Do I need to step in and offer customer service?

How many people recommend my brand vs. advocate against it?

Search, Information Access, or Questions Answering

Which bloggers are negative towards USA Policies?


Which of the hotels on India get great reviews for the room service?
Competitive Intelligence

What competing products are peope considering and why?

Are competitors's media sped generating purchase intent?

Implementation of Text Analysis in SAP HANA:


The implementation of Text Analysis is one of the coolest features of SAP HANA.
Text analysis is supported from SAP HANA SP05.
SAP HANA Text Analysis has market-leading, out-of-the-box predefined entity types that
are packaged as part of the platform. Looking at a clause, sentence, paragraph, or
document, the technology can identify the "who", "what", "where", "when" and "how
much" and classify it accordingly.
For example, in the following sentence "India celebrates Independence day on 15th
August?, the analysis can identify the country, holiday and month using HANA"s
predefined core extraction.
If you have reach till this end, you should have a clear understanding on Text Analysis.
if you have any doubt or question, please leave a comment.

How to use the Fuzzy Search in SAP HANA

In this article we will talk about

What is Fuzzy Search?

Why Fuzzy Search is important?

Real Time Example of Fuzzy Search Based Applications.

How to Implement Fuzzy Search in SAP HANA?

What is Fuzzy Search?


Also known as approximate string matching.
Fuzzy search is the technique of finding strings that match a pattern approximately (rather than
exactly).
It is a type of search that will find matches even when users misspell words or enter in only
partial words for the search.
purpose:
With the help of Fuzzy Search Misspellings and typos still provide relevant results.

A Real World Example:


If a user types "SAP HANA Tutorl" into Yahoo or Google (both of which use fuzzy matching), a
list of hits is returned along with the question, "Did you mean "SAP HANA Tutorial"?"

Fuzzy Search in SAP HANA:

In SAP HANA, you can call the fuzzy search by using the CONTAINS predicate with the FUZZY
option in the WHERE clause of a SELECT statement.
Syntax:
SELECT * FROM <tablename>
WHERE CONTAINS (<column_name>, <search_string>, FUZZY (0.8))

A search with FUZZY(x) returns all values that have a fuzzy score greater than or equal to x.

The SCORE() Function


The fuzzy search algorithm calculates a fuzzy score for each string comparison. The higher the
score, the more similar the strings are. A score of 1.0 means the strings are identical. A score of
0.0 means the strings have nothing in common.

You can request the score in the SELECT statement by using the SCORE() function.

You can sort the results of a query by score in descending order to get the best records first (the
best record is the record that is most similar to the user input). When a fuzzy search of multiple
columns is used in a SELECT statement, the score is returned as an average of the scores of
all columns used.

So not only does it find a "fault tolerant" match, it also puts a score behind it.

Example:
When searching with 'SAP', a record like 'SAP AG' gets a high score, because the term 'SAP'
exists in the texts. A record like "BSAP Corp" gets a lower score, because 'SAP' is only a part of
the longer term 'BSAP Corp'.

Create the table and data:


-- REPLACE <Schema_Name> WITH YOUR SCHEMA NAME
CREATE COLUMN TABLE <Schema_Name>.COMPANIES(
ID INTEGER PRIMARY KEY,
COMPANY_NAME SHORTTEXT(200) FUZZY SEARCH INDEX ON);
INSERT INTO <Schema_Name>.COMPANIES VALUES (1, 'SAP');

INSERT INTO <Schema_Name>.COMPANIES VALUES (2, 'SAP in Walldorf');


INSERT INTO <Schema_Name>.COMPANIES VALUES (3, 'SAP AG');
INSERT INTO <Schema_Name>.COMPANIES VALUES (4, 'ASAP Corp');
INSERT INTO <Schema_Name>.COMPANIES VALUES (5, 'BSAP orp');
INSERT INTO <Schema_Name>.COMPANIES VALUES (6, 'IBM Corp');

Perform the search on one column:


SELECT SCORE() AS score, * FROM <Schema_Name>.COMPANIES
WHERE CONTAINS(COMPANY_NAME,'SAP',
FUZZY(0.7,'textSearch=compare,bestMatchingTokenWeight=0.7'))
ORDER BY score DESC;

The output of fuzzy search contains 5 entries. Based on the fuzzy search factor (which is 0.7 in
this case), it will also consider the similar words. In this case "SAP AG", "BSAP orp" etc.

A Real Time Example of Fuzzy Search:

Use Case
A call center agent who receives an order by phone needs to know the customer number or, in
the case of a new entry, the system has to inform him about a potentially duplicate entry.
There are chances that name can be misspelled or there can be different person with same
name but different spellings. For example "Jimi Hendricks" can be misspelled as "Jimy
Hendricks" or "Jimi Hendrix". Or the address can also be spelled differently. For example
"Berliner Platz 43" or "Berliner Plats 43" or "Berliner Platz"

Without fuzzy search system can only find the exact match means the only entries that are
100% identical. But with fuzzy search system can find the misspelled words too.

Create table and some data:


-- REPLACE <Schema_Name> WITH YOUR SCHEMA NAME
create column table <Schema_Name>."CUSTOMERS"(
"CUSTOMER_ID" VARCHAR (5) not null default '',
"FIRST_NAME" VARCHAR (20) null default '',
"LAST_NAME" VARCHAR (20) null default '',
"STREET" VARCHAR (20) null default '',
"CITY" VARCHAR (20) null default '',
"COUNTRY" VARCHAR (20) null default '',
"POSTAL_CODE" VARCHAR (20) null default '',
primary key ("CUSTOMER_ID"));
insert into <Schema_Name>."CUSTOMERS" values('00001','Jimi','Hendricks','Berliner Platz
43','Munchen','Germany','80805');
insert into <Schema_Name>."CUSTOMERS" values('00002','Jimy','Hendricks','Berlinr Platz
43','Munchen','Germany','80805');
insert into <Schema_Name>."CUSTOMERS" values('00003','Jimi','Hendrix','Berliner Plats
43','Munchen','Germany','80805');
insert into <Schema_Name>."CUSTOMERS"
values('00004','Jimy','Feuer','Berliner','Munchen','Germany','80805');

insert into <Schema_Name>."CUSTOMERS" values('00006','Sven','Ottlieb','Walserweg


21','Aachen','Germany','52066');
insert into <Schema_Name>."CUSTOMERS" values('00007','Philip','Cramer','Maubelstr.
90','Brandenburg','Germany','14776');
insert into <Schema_Name>."CUSTOMERS" values('00008','Renate','Messner','Magazinweg
7','Frankfurt','Germany','60528');
insert into <Schema_Name>."CUSTOMERS" values('00009','Alexander','Feuer','Heerstr.
22','Leipzig','Germany','04179');
insert into <Schema_Name>."CUSTOMERS" values('00010','Antonio','Moreno','Mataderos
2312','Mexico','Mexico','05023');
insert into <Schema_Name>."CUSTOMERS" values('00011','Thomas','Hardy','120
Hanover','London','UK','WA1 1DP');
insert into <Schema_Name>."CUSTOMERS"
values('00012','Christina','Berglund','Berguvsvagen 8','Lulea','Sweden','S-958 22');

Without Fuzzy Search:


Suppose you want to search a customer with name "Jimi".
SQL Query:
SELECT * FROM <Schema_Name>."CUSTOMERS"
WHERE CONTAINS(FIRST_NAME, 'Jimi')
ORDER BY "CUSTOMER_ID" DESC;

The output will contain only one entry which contains exact match of "Jimi".

Now let us try the fuzzy search function.


SQL Query:
SELECT SCORE() AS score, * FROM <Schema_Name>."CUSTOMERS"
WHERE
CONTAINS(FIRST_NAME, 'Jimi', FUZZY(0.7))
ORDER BY score DESC;

The output of fuzzy search contains 4 entries. Based on the fuzzy search factor (which is 0.7 in
this case), it will also consider the similar words. In this case "Jimy".

We can also do fuzzy search on 2 columns. For example First Name and Last Name.
SQL Query:
SELECT SCORE() AS score, * FROM <Schema_Name>."CUSTOMERS"
WHERE
CONTAINS(FIRST_NAME, 'Jimi', FUZZY(0.7))
and CONTAINS(LAST_NAME, 'Hendricks', FUZZY(0.7))
ORDER BY score DESC;

The output contains 3 entries. Based on the fuzzy search factor (which is 0.7 in this case), it will

also consider the similar names. In this case "Jimy Hendricks" and "Jimi Hendrix".

In SAP HANA Text Analysis - One of the coolest features of SAP HANA we explained
what is Text Analysis and why it is so important for business now-a-days.
In this article we will show you how you can easily implement Text Analysis in SAP
HANA.

Use-case:
Suppose I am planning to buy a new iPhone 5 and I want to know the review of this
over internet. I wanted to get a pulse of the iPhone 5 before I buy it not just from the
critics but actual users like me. I also want to search the blogs, news and social media
to find out whether people's review are positive, negative or neutral.
Lets see how we can do this with the help of SAP HANA Text Analysis.

Prerequisites:
Download unstructured data (iPhone-News.pdf)

To save time, I have created a pdf file which contains news and blog articles on iPhone
5. Download this from here .

Create Table in SAP HANA


Create a table in SAP HANA which will contain this unstructured data. Replace
<SCHEMA_NAME> with your schema.
CREATE COLUMN TABLE <SCHEMA_NAME>."IPHONE_NEWS" (
"File_Name" NVARCHAR(20),
"File_Content" BLOB ,
PRIMARY KEY ("File_Name"));

Upload pdf file to SAP HANA using Python


Use the below Python code to upload pdf file to SAP HANA.

Note: Check below article to configure Python before running the Python code.

import dbapi
# assume HANA host id is abcd1234 and instance no is 00
# and SAP HANA user id is USER1 and password is Password1
conn = dbapi.connect('abcd1234', 30015, 'USER1', 'Password1')
#Check if database connection was successful or not
print conn.isconnected()
#Open a cursor
cur = conn.cursor()
#Open file in read-only and binary
file = open('iPhone-News.pdf', 'rb')
#Save the content of the file in a variable
content = file.read()
#Save the content to the table - Replace SCHEMA1 with your schema
cur.execute("INSERT INTO SCHEMA1.IPHONE_NEWS VALUES(?,?)", ('iPhoneNews.pdf',content))
print 'pdf file uploaded to HANA'

#Close the file


file.close()
#Close the cursor
cur.close()
#Close the connection
conn.close()

After executing the above Python script the pdf data will be uploaded in HANA table.

Implement Text Analysis in SAP HANA:


The most impressive thing about Text Analysis is how easy it is to implement it.
The only thing we need to do is run the following statement:
Create FullText Index "PDF_FTI" On
<SCHEMA_NAME>."IPHONE_NEWS"("File_Content")
TEXT ANALYSIS ON
CONFIGURATION 'EXTRACTION_CORE_VOICEOFCUSTOMER';

This will create a full text index called "PDF_FTI" (you can use any name) on the BLOB
column "File_Content" of the table "IPHONE_NEWS".
With the execution of this script a new column table is created called $TA_PDF_FTI
($TA_<Index_Name>) that contains the result of our Text Analysis Process.

Note: If you do not see this table under your schema, try to refresh that.

That's it. Yes, Text Analysis is implemented. Rest everything is done by SAP HANA.

Further Analysis:
2 columns of the table $TA_PDF_FTI is very important for us.
TA_TOKEN
This column contains the extracted entity or element (for example, an identifiable
person, place, topic, organization, or sentiment).
TA_TYPE
This is the category the entity falls under. For example PERSON, PLACE, PRODUCT
etc.

To know people's review and sentiments about iPhone, we can query the table
$TA_PDF_FTI like this.
SELECT "TA_TYPE", ROUND("SENTIMENT_VALUE"/
"TOTAL_SENTIMENT_VALAUE" * 100,2) AS "SENTIMENT_VALAUE_PERCENTAGE"
FROM
(
SELECT "TA_TYPE", SUM("TA_COUNTER") AS "SENTIMENT_VALUE"
FROM <SCHEMA_NAME>."$TA_PDF_FTI"
where TA_TYPE
in('WeakPositiveSentiment','StrongPositiveSentiment','NeutralSentiment',
'WeakNegativeSentiment','StrongNegativeSentiment','MajorProblem','MinorProblem')
GROUP BY "TA_TYPE"
) AS TABLE1,
(
SELECT SUM("TA_COUNTER") AS "TOTAL_SENTIMENT_VALAUE"
FROM <SCHEMA_NAME>."$TA_PDF_FTI"
where TA_TYPE
in('WeakPositiveSentiment','StrongPositiveSentiment','NeutralSentiment',
'WeakNegativeSentiment','StrongNegativeSentiment','MajorProblem','MinorProblem')
) AS TABLE2

You will get the output like this.

The result shows that more percentage of people are giving positive review of this
product.
Good, now i can go ahead and buy my new iPhone 5.

What's Next:
We can use this full text index table to get a lot of information other than just
sentiments.
Lets take a look into the structure of this table.
Column Name

Ke
y

Description

Data Type

File_Name

Yes

This is the primary key of my table. If you have more than one
column in your primary key, the $TA table will include every
single column

Same as in source table. In


this case: NVARCHAR(20)

RULE

Yes

Stores the rule package that yielded the token. In my case:


"Entity Extraction"

NVARCHAR(200)

COUNTER

Yes

Counts all tokens across the document

BIGINT

TOKEN

No

The token that was extracted (the "who", "what", "where",


"when" and "how much")

NVARCHAR(250)

LANGUAGE

No

You can either specify a language column when you create the
fulltext index or it can be derived from the text. In my case it
was derived from the text and is English (en)

NVARCHAR(2)

TYPE

No

The Token Type, whether it is a "who", a "what", a "where",


etc.

NVARCHAR(100)

No

Stores a normalized representation of the token. This becomes


relevant e.g. for German with umlauts, or ./ss. Normalization
with regards to capitalization would not be as important as to
justify this column.

NVARCHAR(250)

STEM

No

Stores the linguistic stemming information, e.g. the singular


nominative for nouns, or the indicative for verbs. If text
analysis yields several stems, only the first stem will be stored,
assuming this to be the best match.

NVARCHAR(300)

PARAGRAPH

No

The paragraph number where my token is located in the


document

INTEGER

SENTENCE

No

The sentence number where my token is located in the


document

INTEGER

CREATED_AT

No

Creation timestamp

TIMESTAMP

NORMALIZED

Hope you liked this article. If you have any question please leave a comment.
Continue reading:

Das könnte Ihnen auch gefallen