Sie sind auf Seite 1von 3


ICDAR 2015:

ICDAR is the premier international forum for researchers and practitioners in the document analysis community for
identifying, encouraging and changing the ideas on the state-of-the-art technology in document analysis. It is
sponsored by International Association for Pattern Recognition and it is technically co-sponsored by IEEE
Computer Society, TC-10 (Graphics Recognition), IEEE Region 8, TC-11(Reading Systems), Tunisian Chapter of
the IEEE Computer Society.

ICDAR is Robust Reading Competition and in 2015 edition there were a number of changes in competition. A new
challenge on Incidental Scene Text was introduced on a new 1,670 images dataset which was acquired through
Google Glass. Secondly, End to End system performance was introduced to simultaneously localize and recognize
or the words in image or video sequence. Finally, datasets for Video Text was updated as well bringing up to 49
number of sequences and comprising of 25,824 frames


Coco-Text is a totally new large scale dataset which is used detection and recognition of text in natural images.
There are 63,686 images and 145,859 text instances with 3 fine-grained text attributes in this dataset. In Coco-Text,
there are bounding boxes for text localizations. There are also text transcriptions for legible text. There are Multiple
text instances per image and text instances are also categorized into machine printed and handwritten text, legible an
illegible text and English script and non-English script.

ICDAR 2017:

ICDAR 2017 was the 14th IAPR International Conference on the analyzing and recognition of document and it was
held in Kyoto, Japan. A number of Competitions were held including Robust Reading Challenge on Multi-Lingual
Scene Detection, Competition on Post-OCR Text Correction, Competition on Layout Analysis for Challenging
Medieval Manuscripts, Page Object Detection and much more.


EMNIST is a dataset which consist of a set of handwritten character digits that are derived from NIST Special
Database and they are then converted to a format of 28x28 pixel image and dataset structure that directly matches
the MNIST dataset.

There are 2 formats in the dataset and both of the versions contain identical information. The first dataset is provided
in Matlab format while the second version is the same binary format as the original MNISR dataset.

There are six different splits in the EMNIST dataset:

 EMNIST ByClass: 814,255 characters. 62 unbalanced classes.

 EMNIST ByMerge: 814,255 characters. 47 unbalanced classes.
 EMNIST Balanced: 131,600 characters. 47 balanced classes.
 EMNIST Letters: 145,600 characters. 26 balanced classes.
 EMNIST Digits: 280,000 characters. 10 balanced classes.
 EMNIST MNIST: 70,000 characters. 10 balanced classes.

MSRA-TD500 is database is released publicly to evaluate text detection algorithms for tracking text detection in
natural images. This database contains 500 natural images which are captured from indoor and outdoor using a
pocket camera.

There are 2 parts of the dataset which are testing set and training set. Training set contains 300 images which are
selected randomly from the original dataset and the testing set contains the remaining 200 images. All the images are
fully annotated. The basic unit in this dataset is text line rather than word, which is used in the ICDAR Datasheets.


CTW is a very large Chinese Text dataset in the wild. While there were OCR in document images which were well
studied and many commercial tools were available but there was a challenging problem for detection and
recognition of text in the natural languages especially for some more complicated characters like Chinese text.

CTW is a newly created dataset of Chinese dataset which contains about one million Chinese characters from 3850
unique ones annotated by experts in over 30000 street view images. This dataset contains 32,285 high resolution
images, 1,018,402 character instances, 3850 character categories and 6 kind of attributes. The character recognition
accuracy for this dataset is top-1 accuracy of about 80.5%, character detection at the percentage of 70.9% and text
line detect (AED of 22.1). This dataset is publicly available along with the source code and trained models.


Synthetic Word dataset is used for Natural Scene Text Recognition. It is a highly realistic dataset and sufficient to
replace real data and gives us an infinite amount of the training data. This exposes new models for word recognition.
There are 3 models which are reading the words in an entirely different way of 90k-way dictionary encoding,
encoding of character sequence and N-grams bag encoding. It is fast, simple and it requires zero data acquisition
costs. For the applications like scanning printer generated document, the synthetic text dataset may be useful.


Uber-Text is also a large scale dataset for OCR. Uber-Text contains street level images which are collected from car
mounted sensors. The characteristics are this dataset are:

1) Street-side images with their text region polygons and the corresponding transcriptions.
2) There are 9 categories which indicates business name text, street number and street name text etc.
3) It contains up to 110k images.
4) There are about 484 text instances per image.

The dataset is split in training, validation and testing subset and is open source. Each subset is divided in 1k and 4k


SVT Dataset is Abbreviation of Street View Text Dataset. This dataset is taken from Google Street View and the
images in this dataset is often in low resolution. There are 2 characteristics which are noted in dealing with outdoor
street level image. One is that image text comes mostly from business signage and the other is that business names
are easily available through searching business geographically. These characteristics makes this dataset more unique
from the other datasets. The goal of SVT is to identify the words from nearby businesses. SVT Dataset has only
word-level annotations. It should be used for cropped lexicon-driven word recognition and full image lexicon-driven
word detection and recognition.

Chars74k Dataset is used for character recognition in Natural Images. In this dataset, both the English as well as
Kannada symbols are available. In English Language symbols, Latin scripts (except accents) and Hindu-Arabic
numeral are also used. The dataset consists of 64 classes (0-9, A-Z and a-z), 7705 characters which are obtained
from natural images, 3410 characters that are hand drawn characters and 62992 synthesized characters from
computer fonts. There are total of 74k images.