Sie sind auf Seite 1von 1

Document Classification and Data Extraction

Project Guide : Prof. P Radha Krishna


Rishu Kumar (157251) Pranav Pawar (157243) Rahul Ramteke (157250)
Departmemt of Computer Science and Engineering

Problem Statement Proposed Methodolgy • Data Extraction


We convert pdf documents into HOCR file and
• Data Extraction in E-mail
The problem consists of two parts : • Document structure based classification then design a HOCR parser to look up Key and
• Classify a set of documents into predefined Here a primary level of ‘inter-domain’ transfer values according to the context file.
trained classes. learning is used by exporting weights from a pre-
• Extracting relevant data from the documents trained VGG16 architecture on the ImageNet
(PDFs and emails) with the help of context files. dataset to train a document classifier on whole
document images. [2]
Abstract
Fig. 4 Context File
Documents can be classified into various classes
based on their text contents and/or their structural HOCR parser takes parameter based on the context
properties..

4 class
files and then look up for the values in the HOCR
Classifying and grouping document images into file.
known categories is often a prerequisite step The results are stored in JSON format.
towards document understanding tasks, such as
text recognition, document retrieval and For Email data extraction we parse the body of the
information extraction . email and look for the keywords present in the body
These tasks can be greatly simplified if we know a and then take the values based on the context files.
priori the genre or the layout-type of documents.
Fig. 1 VGG16 model
Conclusion
Two major paradigms that have been extensively
studied : document structure based approaches • Text-content based classification
and text-content based approaches. This project Results Document Classification using VGG16 was able to
follows both the paradigm and studies document Training Pdf achieve accuracy of about 85-90%.
structure based classification as well as text-
Documents
• Document structure based classification
Pdf to Image converter
content based classification. Text based classification gave a accuracy of about
Image
Documents
~87%. This approach is deployed and being
Document structure based classification: Text extraction using
Tesseract
currently used by a client in US.
We classified 4 classes of Documents : Email, Extracted Texts
Form, Letter, Scientific Research papers. This project can be very helpful for organizations
The proposed model was trained on a dataset of that deal with very large set of Digital Documents.
200 images for each class and 60 images for List of unigrams
of all classes
List of bigrams of
all classes
List of trigrams
of all classes
Document Classification and Data extraction will
validation and testing per class. help ease the process of fast data retrieval and also
help in data analysis.
Trained model
Text-content based classification:
We converted pdf documents to images and used Fig. 2 Training in Text based approach There are still some chances of error and room for
OCR approach to get the text content of the improvement in this project. Frequent changes are
Test Pdf
documents. Using the n-gram approach we Document • Text-content based classification and Data still being made to improve the outcome still further.
created unigrams, bigrams and trigrams of 175 Pdf to Image converter Extraction
Image
template documents [1] and used these to classify Document
Text extraction using
the documents by assigning 5% weight to Tesseract

unigrams, 30% to bigrams and 60% to trigrams. Extracted


Texts
References

After classification we move forward to data unigrams bigrams trigrams


[1] :
extraction from pdf documents and emails. http://appliedsystems.com/support/ACORD_Forms
We used context files to define the keys to search Count no of
matched unigrams
Count no of
matched bigrams
Count no of
matched trigrams
for and the relevant position of the values wrt the for all classes for all classes for all classes [2]: Document Image Classification with Intra-
30% weight
keys in the corresponding document. Domain Transfer Learning and Stacked
Compute Score
HOCR files were generated for the pdf documents for all classes in Generalization of Deep Convolutional Neural
the training
and then this was parsed to extract the data. dataset Networks
For email documents we used similar context files Predicted Class:
along with email parser and nltk library to extract Class with max
computed score
relevant data from e-mails, Fig. 3 Classification in Text based approach

Das könnte Ihnen auch gefallen