Sie sind auf Seite 1von 2

Every language, including English, presents unique and difcult challenges

for search applications to deliver relevant and precise results. Rosette

Base
Linguistics (RBL) enables enterprise applications to effectively search or process
text in many languages by providing a complete set of linguistic services. RBL
enriches the original text in its native language for best-of-class natural language
processing, improving speed, and accuracy.
As linguistics experts with deep understanding at the intersection of language
and technology, Basis Technology continually improves the Rosette product
family with language additions, feature updates, and the latest innovations from
the academic world.
Supported
Languages
Search many languages
with high accuracy
40
KEY FEATURES
- Simple API
- Fast and Scalable
- Industrial-strength Support
- Easy Installation
- Flexible and Customizable
- Integrations: Java, C++, or Web Services
- Platform: Unix, Linux, Mac or Windows
- Component of the Rosette SDK
- Customizable features such as user
dictionaries, orthographic normalization,
and script conversion
Select Customers
www.basistech.com
info@basistech.com
+1 617-386-2090
Start using RBL today
Try our free product evaluation
www.basistech.com
Verb Determiner
Preposition Determiner
Noun
Noun Noun
Noun
Noun Punctuation
Conjunction
Preposition Adjective
Adjective
Improve the speed and
accuracy of your search
application with advanced
linguistic analysis.
TOKENIZATION

Many search tools use bigrams to understand
languages written without spaces between
words. This results in a larger index size and
a reduction in relevancy. RBL, in contrast,
accurately identies and separates each
word through advanced statistical modeling.
The resulting token output (also known as
segmentation) minimizes index size, enhances
search accuracy, and increases relevancy.
DECOMPOUNDING
RBL breaks down compound words into
sub-components and delivers each individual
element to be indexed. This is especially useful
for increasing search relevancy in languages
such as German and Korean.
WESTERN EUROPE
- Catalan*
- Czech
- Danish
- Dutch
- English
- Finnish*
- French
- German
- Greek
- Italian
- Norwegian
- Portuguese
- Spanish
- Swedish
EASTERN EUROPE
- Albanian*
- Bulgarian*
- Croatian*
- Estonian*
- Hungarian
- Latvian*
- Polish
- Romanian
- Russian
- Serbian*
- Slovak*
- Slovenian*
- Turkish
- Ukranian*
Search Engines
Advanced Morphological Features
Available Languages
LEMMATIZATION
Most search engines utilize a crude method of
chopping of characters at the end of a word in
the hopes of removing unimportant diferences.
This method, called stemming, often results
in extra recall and poor precision. Instead,
RBL nds the true dictionary form of each
word, known as a lemma, by using vocabulary,
context, and advanced morphological analysis.
Indexing the root form increases search
relevancy and slims the search index by
not indexing all inected forms. Alternative
lemmas are also made available to supplement
indexing.
PART OF SPEECH TAGGING
As part of the lemmatization process, statistical
modeling is used to determine the correct
part of speech, even with ambiguous words.
Each token is then tagged for enhanced
comprehension and search relevancy.
Compatibility
MIDDLE EAST
- Arabic
- Hebrew
- Pashto
- Persian
- Urdu
ASIA
- Chinese, Simplied
- Chinese, Traditional
- Indonesian
- Japanese
- Korean
- Malay*
- Thai
Example: German
Samstagmorgen is a compound word formed
with Samstag (Saturday) and morgen (morning).
Decompounding allows for an appropriate match
when searching for "Samstag".
Example: English
Linguistic analysis is useful for every language;
lemmatization for English improves recall and
precision.
NOUN PHRASE EXTRACTION
Certain nouns, especially proper names, can
be very tricky to identify as a single entity.
RBL groups the nouns and their modiers,
which is useful in document clustering and
concept extraction.
SENTENCE DETECTION
The start and end of each sentence is
automatically identified even though
punctuation use may be ambiguous.
CHALLENGE QUERY STEM LEMMA
Two unrelated words
may share a stem.
animals
animated
anim animal
animate
Stemming may
deliver unintended
results.
several sever several
Irregular verbs and
nouns stump the
stemmer.
spoke spoke speak (v.)
spoke (n.)
WEST COAST
171 Second Street
San Francisco, CA
94105
FEDERAL
2553 Dulles View Dr.
Suite 450
Herndon, VA
20171
HEADQUARTERS
One Alewife Center
Cambridge, MA
02140
EUROPE
Furzeground Way
Middlesex UB11 1BD,
UK
ASIA
9-6 Nibancho,
Chiyoda-ku
Tokyo 102-0084,
Japan
Code Base Platform Support
Compatibility
Example: Chinese
Consider the problem of indexing Beijing
University Biology Department and a
subsequent search for student:
Beijing
University
Biology
Department
(Student)
INDEX
BIGRAMMING
RBL MORPHOLOGICAL TOKENIZATION
SEARCH

4 5 1 2
1 2
6 5 2 3 3 4 6 7
Beijing
Beijing University Biology Department
(non-word) University (Student) Biology
Dept.
(non-word)
"Student" Incorrectly hits Beijing University Biology Department
Correctly misses
Beijing University
Biology Department
* Limited Support
2014 Basis Technology Corporation. Basis Technology
Corporation , Rosette and Highlight are registered trademarks of
Basis Technology Corporation. Big Text Analytics is a trademark of
Basis Technology Corporation. All other trademarks, service marks,
and logos used in this document are the property of their respective
owners. (2014-04-04-RBL)

Das könnte Ihnen auch gefallen