Beruflich Dokumente
Kultur Dokumente
Structured data has the advantage of being easily entered, stored, queried and
analyzed.
Unstructured Data:
The phrase "unstructured data" usually refers to information that doesn't reside in a
traditional row-column database.
Unstructured data files often include text and multimedia content. Examples include email messages, word processing documents, videos, photos, audio files, presentations,
webpages and many other kinds of business documents.
Digging
through unstructured data can be cumbersome and costly. Email is a good example of
unstructured data. It's indexed by date, time, sender, recipient, and subject, but the body
of an email remains unstructured. Other examples of unstructured data include books,
documents, medical records, and social media posts.
Fuzzy Search
Let's have a look into them one by one.
Fuzzy Search:
Also known as approximate string matching.
Fuzzy search is the technique of finding strings that match a pattern approximately
(rather than exactly).
It is a type of search that will find matches even when users misspell words or enter in
only partial words for the search.
A Real World Example:
If a user types "SAP HANA Tutorl" into Yahoo or Google (both of which use fuzzy
matching), a list of hits is returned along with the question, "Did you mean "SAP HANA
Tutorial".
All that tech talk is fine, but how can Text Analysis help companies make more
money?
How many people are showing their interest to buy this product?
Is there any negative comments or rumor going around for this product?
Market research and social media monitoring, i.e. what people are saying about
my brand or products
Voice of the Customer/ Customer Experience Management
In SAP HANA, you can call the fuzzy search by using the CONTAINS predicate with the FUZZY
option in the WHERE clause of a SELECT statement.
Syntax:
SELECT * FROM <tablename>
WHERE CONTAINS (<column_name>, <search_string>, FUZZY (0.8))
A search with FUZZY(x) returns all values that have a fuzzy score greater than or equal to x.
You can request the score in the SELECT statement by using the SCORE() function.
You can sort the results of a query by score in descending order to get the best records first (the
best record is the record that is most similar to the user input). When a fuzzy search of multiple
columns is used in a SELECT statement, the score is returned as an average of the scores of
all columns used.
So not only does it find a "fault tolerant" match, it also puts a score behind it.
Example:
When searching with 'SAP', a record like 'SAP AG' gets a high score, because the term 'SAP'
exists in the texts. A record like "BSAP Corp" gets a lower score, because 'SAP' is only a part of
the longer term 'BSAP Corp'.
The output of fuzzy search contains 5 entries. Based on the fuzzy search factor (which is 0.7 in
this case), it will also consider the similar words. In this case "SAP AG", "BSAP orp" etc.
Use Case
A call center agent who receives an order by phone needs to know the customer number or, in
the case of a new entry, the system has to inform him about a potentially duplicate entry.
There are chances that name can be misspelled or there can be different person with same
name but different spellings. For example "Jimi Hendricks" can be misspelled as "Jimy
Hendricks" or "Jimi Hendrix". Or the address can also be spelled differently. For example
"Berliner Platz 43" or "Berliner Plats 43" or "Berliner Platz"
Without fuzzy search system can only find the exact match means the only entries that are
100% identical. But with fuzzy search system can find the misspelled words too.
The output will contain only one entry which contains exact match of "Jimi".
The output of fuzzy search contains 4 entries. Based on the fuzzy search factor (which is 0.7 in
this case), it will also consider the similar words. In this case "Jimy".
We can also do fuzzy search on 2 columns. For example First Name and Last Name.
SQL Query:
SELECT SCORE() AS score, * FROM <Schema_Name>."CUSTOMERS"
WHERE
CONTAINS(FIRST_NAME, 'Jimi', FUZZY(0.7))
and CONTAINS(LAST_NAME, 'Hendricks', FUZZY(0.7))
ORDER BY score DESC;
The output contains 3 entries. Based on the fuzzy search factor (which is 0.7 in this case), it will
also consider the similar names. In this case "Jimy Hendricks" and "Jimi Hendrix".
In SAP HANA Text Analysis - One of the coolest features of SAP HANA we explained
what is Text Analysis and why it is so important for business now-a-days.
In this article we will show you how you can easily implement Text Analysis in SAP
HANA.
Use-case:
Suppose I am planning to buy a new iPhone 5 and I want to know the review of this
over internet. I wanted to get a pulse of the iPhone 5 before I buy it not just from the
critics but actual users like me. I also want to search the blogs, news and social media
to find out whether people's review are positive, negative or neutral.
Lets see how we can do this with the help of SAP HANA Text Analysis.
Prerequisites:
Download unstructured data (iPhone-News.pdf)
To save time, I have created a pdf file which contains news and blog articles on iPhone
5. Download this from here .
Note: Check below article to configure Python before running the Python code.
import dbapi
# assume HANA host id is abcd1234 and instance no is 00
# and SAP HANA user id is USER1 and password is Password1
conn = dbapi.connect('abcd1234', 30015, 'USER1', 'Password1')
#Check if database connection was successful or not
print conn.isconnected()
#Open a cursor
cur = conn.cursor()
#Open file in read-only and binary
file = open('iPhone-News.pdf', 'rb')
#Save the content of the file in a variable
content = file.read()
#Save the content to the table - Replace SCHEMA1 with your schema
cur.execute("INSERT INTO SCHEMA1.IPHONE_NEWS VALUES(?,?)", ('iPhoneNews.pdf',content))
print 'pdf file uploaded to HANA'
After executing the above Python script the pdf data will be uploaded in HANA table.
This will create a full text index called "PDF_FTI" (you can use any name) on the BLOB
column "File_Content" of the table "IPHONE_NEWS".
With the execution of this script a new column table is created called $TA_PDF_FTI
($TA_<Index_Name>) that contains the result of our Text Analysis Process.
Note: If you do not see this table under your schema, try to refresh that.
That's it. Yes, Text Analysis is implemented. Rest everything is done by SAP HANA.
Further Analysis:
2 columns of the table $TA_PDF_FTI is very important for us.
TA_TOKEN
This column contains the extracted entity or element (for example, an identifiable
person, place, topic, organization, or sentiment).
TA_TYPE
This is the category the entity falls under. For example PERSON, PLACE, PRODUCT
etc.
To know people's review and sentiments about iPhone, we can query the table
$TA_PDF_FTI like this.
SELECT "TA_TYPE", ROUND("SENTIMENT_VALUE"/
"TOTAL_SENTIMENT_VALAUE" * 100,2) AS "SENTIMENT_VALAUE_PERCENTAGE"
FROM
(
SELECT "TA_TYPE", SUM("TA_COUNTER") AS "SENTIMENT_VALUE"
FROM <SCHEMA_NAME>."$TA_PDF_FTI"
where TA_TYPE
in('WeakPositiveSentiment','StrongPositiveSentiment','NeutralSentiment',
'WeakNegativeSentiment','StrongNegativeSentiment','MajorProblem','MinorProblem')
GROUP BY "TA_TYPE"
) AS TABLE1,
(
SELECT SUM("TA_COUNTER") AS "TOTAL_SENTIMENT_VALAUE"
FROM <SCHEMA_NAME>."$TA_PDF_FTI"
where TA_TYPE
in('WeakPositiveSentiment','StrongPositiveSentiment','NeutralSentiment',
'WeakNegativeSentiment','StrongNegativeSentiment','MajorProblem','MinorProblem')
) AS TABLE2
The result shows that more percentage of people are giving positive review of this
product.
Good, now i can go ahead and buy my new iPhone 5.
What's Next:
We can use this full text index table to get a lot of information other than just
sentiments.
Lets take a look into the structure of this table.
Column Name
Ke
y
Description
Data Type
File_Name
Yes
This is the primary key of my table. If you have more than one
column in your primary key, the $TA table will include every
single column
RULE
Yes
NVARCHAR(200)
COUNTER
Yes
BIGINT
TOKEN
No
NVARCHAR(250)
LANGUAGE
No
You can either specify a language column when you create the
fulltext index or it can be derived from the text. In my case it
was derived from the text and is English (en)
NVARCHAR(2)
TYPE
No
NVARCHAR(100)
No
NVARCHAR(250)
STEM
No
NVARCHAR(300)
PARAGRAPH
No
INTEGER
SENTENCE
No
INTEGER
CREATED_AT
No
Creation timestamp
TIMESTAMP
NORMALIZED
Hope you liked this article. If you have any question please leave a comment.
Continue reading: