Sie sind auf Seite 1von 12


Big Documents
Leveraging Big Data Technologies to
Extract Meaning from Document Content

Big Data has reached Big Adoption. According to a

recent study by NewVantage Partners LLC, in 2016,

62.5% of Fortune 1000 senior executives surveyed

reported that they have at last one instance of Big

Data in production within their organizations. This is

nearly double the number reported just three years

earlier. In addition, more than a quarter of the firms

surveyed reported that they will invest greater than

$50 million in Big Data initiatives by 2017. This is up

from just 5% who invested that much in 2014.

[ Ephesoft Insight | Big Documents ]

Popular Big Data use cases including optimizing sales efforts, studying customer behavior, optimizing
pricing, detecting security threats, fraud detection, and predictive analysis. Big Data applications are
designed to enable users to capture, store, analyze, query, visualize, chart, graph, and slice and dice
high volumes of information being pulled from various sources. Sources typically utilized in Big Data
applications can include:

• Internal business applications like ERP, CRM, HR, and marketing management systems
• Social media sites
• Information available over the public Web like competitor pricing, weather forecasts, and publicly
available financial information
• Log data from servers, applications, audits, call records, and mobile apps
• Sensor data from medical devices, GPS systems, satellites and devices connected to the emerging
Internet of Things

All these Big Data sources listed have one thing in common: they are likely to be delivered in sort of
structured/labeled format. For example, information coming from a CRM system, like the customer
name, address, and buying history, will be clearly labeled. Likewise, information coming from a GPS
system will be formatted so that a Big Data application understands that it contains the coordinates
of a specific person or object being tracked. Historically, these types of data sets are what is known
as “structured data.”

It’s important to note that “structured data” historically accounts for a small minority of the data
an organization has access to. Back when the $25 billion enterprise content management (ECM)
market was launched in the early 2000s, it was estimated that only 20% of an organization’s data
could be classified as “structured.” The other 80%, which ECM targets, comes primarily in the form of
documents, both electronic and paper (through document imaging), as well as some other sources
like video, images, software code, etc.
[ Ephesoft Insight | Big Documents ]

In today’s world, IDC estimates that even with an increasing amount of digitally created content,
which lends itself to being structured out of the gate (vs. paper content, which is inherently
unstructured—at least when it comes to digital understanding—in 2013 only 22% of the information
in the digital universe was sufficiently labeled and tagged so that it could be leveraged for useful
analysis. IDC does expect the percentage of tagged data to rise to 37% by 2020, but with the
overall amount of digital data being generated expected to grow by a factor of 10 during this
same time period (from a total of 4.4 trillion gigabytes to 44 trillion GB), that still leaves quite a bit
of unstructured data out there, which cannot be easily analyzed by Big Data applications.

So, what makes up this unstructured data

and how can it be made useful? A lot of
it currently sits in ECM systems, as well as
other document repositories. These include
Windows file systems and file, sync, and
share systems like Box and Dropbox, which
continue to grow in popularity. Many
organizations utilize multiple ECM systems
and repositories to store their unstructured
content. A 2015 AIIM survey of ECM users
indicated that 52 percent have three or
more ECM systems, while 22 percent have
five or more. It’s not unusual for a Global 1000
organization to have a double-digit number
of document repositories storing billions of
documents. In the healthcare industry, the
U.S. Veterans Administration reports that it
adds more than a million new text-based
notes, 1.2 million electronic orders, 2.8 million
images, and one million vital signs each day
to a database that already contains more
than 16 billion clinical entries.

While ECM systems are designed to help users

manage their unstructured files by applying meta data to help with processes like search and
retrieval and records management, this meta data can be woefully insufficient for users seeking
to do true Big Data type analytics on their unstructured content. Meta data associated with a
loan document, for example, might include the customer name, their address, and the date the
loan was approved. And while this might be useful for locating a record for customer service or an
audit, it doesn’t create much opportunity for true analytics. Buried within the content of that loan
document might be valuable information like if the customer defaulted, what their income was,
what the loan was used for, what the customer’s income-to-debt ratio was, etc.

Healthcare is another area where meta data is often lacking. Patient charts, for example, are
often stored as multi-page PDF files, with a bare minimum of indexing information like the patient
name and a date. This locks in valuable data related to the patient’s diagnosis and treatment
that could be potentially married with structured clinical data to improve treatment.
[ Ephesoft Insight | Big Documents ]

Why Big Documents Matter?

Big Data applications are being deployed in multiple industries. According to IDC, manufacturing
and banking are the largest markets, followed by federal government, professional services,
telecommunications, and retail. Forecasts also call for Big Data spending in healthcare to grow
more than 25% annually over the next five

In the financial services market, Big Data

is predominantly being used in areas like
improving customer intelligence, reducing
risk, and meeting regulatory objectives.
In healthcare, we are seeing Big Data
used for applications like reducing fraud,
waste, and abuse; predictive analysis
to improve outcomes, and real time
monitoring of patients. In government,
the U.S. federal government recently
announced a Big Data Research and
Development Strategic Plan, “which
highlights emerging Big Data capabilities
and provides guidance for developing or
expanding federal Big Data research and
development (R&D) plans.” Fifteen federal
agencies participated in the development
of this plan.

Clearly, adoption of Big Data initiatives is

well underway in multiple mission critical
areas across multiple markets. But what
about Big Documents? Are these Big Data
initiatives complete without the inclusion of the high volume of unstructured content that could
be leveraged? For example, is a bank’s anti-money laundering analytics application going to be
truly effective without insights from millions of transaction records that may have been recorded
on paper? Or how about a security check for government employment? Wouldn’t it be better
executed with insights from unstructured content like reviews from previous employers and credit
reports? Or, couldn’t an application for real-time monitoring of patients being fed by information
input from medical devices be complemented through the addition of data from reports made
by nurses and doctors?

These are the type of applications that Big Documents technology can address. So, what exactly
is Big Documents and how does it work?
[ Ephesoft Insight | Big Documents ]

Ephesoft Insight: A Big Documents Application

Big Documents represents the intersection of document imaging and Big Data. Document imaging
is the conversion of paper documents to electronic data. As the name indicates, imaging involves
creating a document images from scanned paper. However, it’s the application of technology
like automated document recognition, character recognition (OCR/ICR), intelligent document
recognition (IDR), and machine learning that makes imaging relevant to Big Data. These types of
technology are literally able to transform the content of an imaged page into structured data that
can be digested and analyzed just like data coming from an ERP system or a machine log.

Take a contract for example. Contracts can run many pages in length and be filled with legalese.
But, the only pieces of data that might matter to a law firm putting together a suit are items such as
who signed the contract, how long it was for, how much it was for, and what would be considered
a breach. A scanner could be used to convert the paper contract to an electronic image and then
OCR/ICR applied to convert the text to an electronic format. IDR and machine learning can then be
applied in tandem to extract the specific information being sought. Once this information is extracted,
it can analyzed and studied just like any other structured data.

The power of document imaging and automated data capture technologies like the ones described
above are that they greatly reduce the time and manpower needed to extract desired data from
high volumes of paper documents. For example, Interfirst Mortgage Company, a Chicago-area
mortgage services specialist, utilized document imaging to help it improve employee productivity by
700% and reduce mortgage closing and HUD-1 (a U.S. government form) processing times by 67%, as
well as cut down on errors.

The data capture technology deployed by Interfirst was developed by Laguna Hills, CA-based
ISV Ephesoft. Ephesoft is a leader in the traditional document capture market with technology for
automatically recognizing and extracting data from images. It is also a pioneer in the field of Big

In 2015, Ephesoft announced that it had leveraged open source Big Data technology like Hadoop,
Apache Spark, MLlib, MongoDB, and Apache Kafka to create a platform for automating classification
of, and data extraction from, repositories containing millions and even billions of documents. In 2016,
Ephesoft packaged its Big Data technology in a product called Ephesoft Insight.

Insight is designed to be leveraged by business analysts and includes tools for creating charts and
graphs for examining historical data sets as well as making predictions. Its ease of use is in line with
the design of most successful Big Data applications. According to research from Wikibon on the
ROI associated with Big Data applications, “Instead of focusing on enlightening the few…successful
players focused on changing operational systems for everybody.”

So, instead of confining Big Documents to a backroom analytics department, Insight is designed
to move it into the realm of the loan underwriters, the HR department, records managers, and
healthcare providers. The easy-to-use interface is the visible element of this democratization of Big
Documents. But, it’s what goes on behind the scenes—the combination of cutting edge Big Data
technology and marketing leading document imaging software that really powers Insight.
[ Ephesoft Insight | Big Documents ]

Built on a Big Data platform

Let’s take a look at the elements of Big Data technology behind Insight’s ability to rapidly process
millions to billions of stored documents:

Hadoop: Hadoop is a framework for leveraging distributed clusters of off-the-shelf computer

hardware to make them run like one large supercomputer. The result is powerful CPU functionality
and a vast amount of storage at a fraction of the cost of a single supercomputer.

HDFS (Hadoop Distributed File System): HDFS is the storage element of the Hadoop framework.
It’s Java-based, distributed, portable, redundant, and scalable. HDFS stores blocks on data
distributed on the multiple computers in a Hadoop framework. To account for the failure rates
associated with using off-the-shelf hardware, HDFS stores each data block on multiple hardware
devices. To facilitate the processing of this distributed data, the Hadoop processing program
assigns workloads to hardware based on the data being stored there, which results in localized
(and faster) processing of data.

Apache Spark: Apache Spark is a faster way to process data within a Hadoop framework than
MapReduce, the original processing technology that Hadoop was based on. While MapReduce
operates in linear fashion and writes back to a disk after each function it performs, Apache
Spark enables multiple processing steps to be executed simultaneously and it can also be run in-
memory. The bottom line is that Spark can run up to 100 times faster in memory and 10 times faster
running on a disk than MapReduce. This can be valuable when dealing with the large document
sets Insight is being targeted at, as well as when performing the complex document classification
and extraction algorithms Insight requires. Spark is growing in popularity and in 2015 was the most
actively developed open source project among data tools.

MLlib (Machine Learning Library): Designed specifically to run on Spark, MLlib consists of a series
of common machine learning algorithms and utilities. These include classification, regression,
clustering, collaborative filtering, and dimensionality reduction, as well as well as underlying
optimization primitives. Spark features APIs that enable developers to connect to MLlib and
utilize its functionality in their applications. This enables the developers to focus more on their
projects and less on lower level machine learning algorithms. Because of its tight relationship with
Spark, MLib is both scalable and fast. Like Spark, MLib is an open source project that is widely
contributed to.

MongoDB: MongoDB is a noSQL database program designed specifically for managing

documents. It can be highly distributed and in some instances is being run across more than 100
nodes in multiple data centers. It can sustain more than 100,000 database read/writes per second
and is highly scalable. It is currently in use by over a over a third of Fortune 100 companies,
including several storing more than a billion documents in it. MongoDB supports field, range
queries, and regular expression searches.

Apache Kafka: Kafka is an open source platform for managing streaming of real-time data feeds.
It fits well into a Hadoop framework because it can be run across multiple computers. As of 2016,
Kafka was in use by 35% of the Fortune 100. In Insight, it enables the data capture technology to
feed the data analytics tools in real time.
[ Ephesoft Insight | Big Documents ]

Next-Generation Document Capture

So, why does Insight require all this Big Data power? Because it’s doing a lot more processing than
traditional document imaging applications. Leveraging the above mentioned stack of technology
enables Insight to utilize more recognition elements, process higher volumes of documents, perform
more advanced analytics, and integrate to more external data sources than typical document capture

Let’s take a look at Insight’s capabilities in each of these areas:

Document Classification and Extraction: This is the heart of what Insight does and takes Ephesoft’s
heritage in document capture and expands upon it exponentially. While a typical capture application
might utilize anywhere from a couple to a handful of document elements to perform classification
and extraction, Insight can draw information from up to 21 elements. These include textual elements,
positioning on a page, the fonts and text sizes used, the way data is formatted, elements positioning
related to other elements, and the order in which elements are listed. Insight can also analyze blocks
of text, compare multiple elements in a document to a ‘truth set’ of data and determine which is the
best match, compare extracted text against elements in a database, and work with “fuzzy” (inexact or
close) matches of data.

Insight incorporates all these capabilities in its machine learning algorithms. This enables the system to
learn by example. Basically, a user identifies a few documents to be of a certain classification, and then
identifies elements on those documents they would like extracted. From that point on, each time Insight
encounters a document that it identifies as being in a defined class, it will automatically extract the
fields identified as included on that document. In a loan document, for example, these might include
the name of the loan taker, the date of the loan, the amount of the loan, what the loan is being used
for, etc.

If a document cannot be identified as belonging to a pre-defined class, or one or more of the identified
fields can’t be found, the document can be passed into an exception queue. As an exception, it could
be manually processed and Insight would have the ability to add to its knowledge set based on that
manual processing, so a similar document could be automatically processed the next time through.

Volume: Insight is designed to process millions, and even billions of documents, and provide users
with results that they can analyze and act upon in near real time. To accomplish this, it has the ability to
connect to multiple repositories through a RESTful API and simultaneously crawl them.
[ Ephesoft Insight | Big Documents ]

Integration with multiple data sources: Insight can also be integrated with multiple data sources
through its API. These data sources can be used to validate the values in fields being extracted.
For example, a list of addresses from the postal services could be used to validate addresses
being extracted from mortgage documents. Third-party data sources like online credit reports
or a list of individuals on a government watch list could also be incorporated in finance and/or
security applications.

Analytics: These third-party data sources could be incorporated in the analytics operations
that can be built with Insight, as well. Unlike most capture applications, from which results
are typically fed into a third-party applications like an ECM or ERP system, Insight has its own
analytics, which enable users to chart, graph, visualize and make predictions with their results.
For example, Insight’s analytics could be used to create a graph showing the effect on
mortgages that a prolonged economic depression has in a certain area. They could also be
used to check for anomalies and inconsistencies in bank deposits to detect money laundering

Insight’s analytics can also be used to create “mind maps” that center a key word or idea.
Utilizing this mind-map structure, a user could choose to examine all documents related to
a particular social security number for example, with the SSN at the center of the analytical
structure and each document including that SSN, as well as its meta data shown as being
related to that SSN. If the user noticed two different names on documents related to that social
security number, this might identify a discrepancy. The user could then click on one of the
names and it would become the center of the mind map and all documents connected to
that name would be displayed.

By presenting ideas in a non-linear manner, mind maps are designed to encourage a

brainstorming approach. This is also reflected the type of flexibility that Insight delivers.
[ Ephesoft Insight | Big Documents ]

The analytics UI is designed to enable users to perform personalized searches and

visualizations that will meet the specific needs of their jobs, as well as their personalities.
A compliance officer in a bank, for example, will be looking for a different set of data
arraigned in a different way than a marketing executive who might be working with the
same set of loan documents. And two marketing executives might even approach same
the loan documents in entirely different ways.

Insight’s flexibility is manifested through its UI and enabled by its powerful back-end
combination of Big Data and document capture technologies.

Potential Use Cases

So how can Big Documents Work for you? It’s a new concept and Insight is a brand new
platform, so the use cases are still very much emerging. Insight is specifically targeted
at large sets of documents that can provide important data to be used in applications
similar to those in which users are leveraging Big Data. These include managing risk,
performing predictive analysis, and improving marketing.

Following is a look at applications in three document-intensive vertical markets where

Insight could be leveraged:

Federal Government: The U.S. Federal Government is well known for the volume of
paperwork it generates. It was one of the earliest users of document imaging technology
and continues to put millions of documents per year into hundreds of diverse document
repositories. With an eye toward leveraging the documents stored in these repositories,
In-Q-Tel, a strategic investment firm which identifies innovative technology that could be
utilized by U.S. intelligence agencies, took at stake in Ephesoft in 2016.

Security is obviously a major concern for the Federal Government and assisting with
security checks is one potential use case for Insight. The U.S. government runs more than
two million background checks per year. This includes examining a number of documents
related to financial, employment, and criminal history, as well as potential foreign
influence. Insight can be used to accelerate this process by automating tasks like ensuring
that fields from various data sources match up or looking for keyword or text patterns that
might indicate undue foreign influence or a history of problems with former employers.

The federal government is also heavily involved in research and Epheosft has volunteered
its software to help out in two areas:

Patents: Insight has been used to identify fields like patent numbers and dates on image-
based patent documents, as well as identify links across multiple documents. Insight’s
mind map visualization technology can be used to show how one patent is connected to
others based on reference, citations and abstracts.
[ Ephesoft Insight | Big Documents ]

Trade Data: Ephesoft is currently working with the US International Trade Administration, the
US Census Bureau, and the Bureau of Economic Analysis at Commerce to develop a public
knowledge base of combined trade data for use by American industry.

Financial Services: The financial services market is also known for a high-volume of paper.
Because most of it is related to transactions, these documents are typically valuable and
highly regulated, and ECM has been a staple of financial services organizations for many
years. We’ve already mentioned some of the value that could be gleaned from extracting
and examining data on loan files. There are also potential applications for Insight in areas like
fraud prevention and uncovering money laundering.

Like it sounds, laundering is

basically the process of cleaning
up dirty, or illegally obtained
money. Organized crime and
terrorist organizations will often
create a paper trail to try and
hide their sources of income.
This can involve multiple invoices
for the same service, carousel
transactions for exploiting the VAT
system in Europe, payments made
by unrelated third companies,
and phantom shipping charges.
Insight could be used to
automate data mining from
transaction documents to detect
organizations that may be running
their money laundering operations
through a bank, potentially saving
the bank criminal penalties.

Improving customer services, mitigating risks, and ensuring regulatory compliance are a few
other ways in which financial services organizations could potentially leverage Insight.

Healthcare: Even with the recent increase in the adoption of electronic medical records
(EMR) systems, there can still be an enormous amount of paperwork in the medical industry,
not to mention historical records, which are almost all stored on paper or document images.
Being able to effectively mine these historical paper records could potentially lead to cures
for illness and medical conditions. For patients currently undergoing treatment, being able to
integrate doctors’ notes and lab results with electronic data being produced my monitoring
devices in real time could result in improved care. In addition, studies have estimated that
hundreds of billions of dollars each year are being wasted by inefficient billing practices.
[ Ephesoft Insight | Big Documents ]

Insight’s ability to mine data from documents and present it in a useful and actionable
format could prove invaluable to healthcare organizations looking to operate more
efficiently and provide better patient outcomes.

In addition to these vertical use cases, Insight could be used across industries to
improve records management. Effective records management policies are based on
consistency. This means that records are stored and destroyed in a regular and timely
manner not on an ad hoc, plan-as-you-go basis. The first step in managing records,
however, is understanding what you have. Insight can be a key enabler for this.

Also, as records increasingly

spread across multiple
repositories, especially with
the growth in use of file
sharing systems like Box,
DropBox, Google Drive,
Microsoft OneDrive, etc.,
there is an increasing amount
of duplicate and outdated
files being kept on servers.
AIIM estimates that one-
third to 70% of all content in
unmanaged servers is ROT
(redundant, obsolete, and
trivial) files.

Discovering what an
organization is storing, and
keeping only that which is
current and needs to be kept
for compliance, customer
service, and other business
reasons can be a huge factor in reducing both storage costs and potential risk. By
providing a view into exactly what is being stored, Insight can be used to minimize
ROT, cut storage costs significantly, and help ensure that files that were supposed to be
deleted as part of a records retention policy do not have duplicates still in existence.
[ Ephesoft Insight | Big Documents ]

Summary: Document Repositories Belong in the Data Lake

Hadoop has become a popular platform as organizations seek to create Data Lakes to
fully leverage in their Big Data applications. Hadoop is attractive because it’s scalable, fast,
and can efficiently manage vast data storehouses. And while electronically generated
data continues to increase exponentially every year, it’s important to remember that a lot
of important historical data, as well as important information being created today, resides
on documents. Documents have a long history serving as the primary record for business
transactions and that legacy cannot be ignored.

Unlike traditional document imaging applications, Ephesoft Insight is built on technology

that can scale, has a UI that is accessible by business users, and can be simultaneously
integrated with multiple third-party applications and repositories. These qualities, along with
Insight’s powerful document classification and extraction capabilities, make it a natural fit for
expanding Big Data applications.

After all, are you really getting all you need from Big Data if you are ignoring the up to 80% of
your content being stored in document repositories?

About Ephesoft
Based in Laguna Hills, CA, Ephesoft was founded in 2010 as the first vendor of open source
document capture software. It still incorporates open source technology in its platform,
including Hadoop and other elements of its Insight Big Data Framework. Ephesoft Transact
is employed by over 500 organizations worldwide for process automation and intelligent
document recognition applications. Insight was launched in 2016, after the concept was
announced in 2015. In 2016, driven by interest in Insight, In-Q-Tel made an equity investment
in Ephesoft. In-Q-Tel looks for innovative technologies that can be leveraged by the U.S.
Federal government.

Global Headquarters UK/ EMEA Headquarters German Headquarters Localized Italia Office
Ephesoft, Inc. Ephesoft UK Ltd. Ephesoft GmbH Ephesoft Italia, Srl
23041 Avenida De La Carlota, 100 6-8 Market Place Reading Tiergartenstr. 11 Piazza IV Novembre 7
Laguna Hills, CA 92653 Berkshire RG1 2EG 35619, Braunfels 20125 Milan
United States United Kingdom Germany Italy
Phone: +1-949-335-5335 Phone: +44 1183282620 Phone: +49 6442 706 5488 Phone: +39 (02) 8088 6345
Email: Email: Email: Email:

For a demonstration and solution presentation visit today.
© 2016 Ephesoft Smart Capture is a registered trademark. All Rights Reserved