Beruflich Dokumente
Kultur Dokumente
4 class
files and then look up for the values in the HOCR
Classifying and grouping document images into file.
known categories is often a prerequisite step The results are stored in JSON format.
towards document understanding tasks, such as
text recognition, document retrieval and For Email data extraction we parse the body of the
information extraction . email and look for the keywords present in the body
These tasks can be greatly simplified if we know a and then take the values based on the context files.
priori the genre or the layout-type of documents.
Fig. 1 VGG16 model
Conclusion
Two major paradigms that have been extensively
studied : document structure based approaches • Text-content based classification
and text-content based approaches. This project Results Document Classification using VGG16 was able to
follows both the paradigm and studies document Training Pdf achieve accuracy of about 85-90%.
structure based classification as well as text-
Documents
• Document structure based classification
Pdf to Image converter
content based classification. Text based classification gave a accuracy of about
Image
Documents
~87%. This approach is deployed and being
Document structure based classification: Text extraction using
Tesseract
currently used by a client in US.
We classified 4 classes of Documents : Email, Extracted Texts
Form, Letter, Scientific Research papers. This project can be very helpful for organizations
The proposed model was trained on a dataset of that deal with very large set of Digital Documents.
200 images for each class and 60 images for List of unigrams
of all classes
List of bigrams of
all classes
List of trigrams
of all classes
Document Classification and Data extraction will
validation and testing per class. help ease the process of fast data retrieval and also
help in data analysis.
Trained model
Text-content based classification:
We converted pdf documents to images and used Fig. 2 Training in Text based approach There are still some chances of error and room for
OCR approach to get the text content of the improvement in this project. Frequent changes are
Test Pdf
documents. Using the n-gram approach we Document • Text-content based classification and Data still being made to improve the outcome still further.
created unigrams, bigrams and trigrams of 175 Pdf to Image converter Extraction
Image
template documents [1] and used these to classify Document
Text extraction using
the documents by assigning 5% weight to Tesseract