Beruflich Dokumente
Kultur Dokumente
181
3. RELATED WORK While tools currently exist for many of these stages, which will
Beyond the prior systems the Redeye Analysis Workbench has be discussed in the following section, the computer profile and
descended from, there is a variety of work that we have drawn crime potential phases are lacking dedicated tools. At these
from for inspiration and direction. In general these works fall in stages we see an opportunity for document triage with a focus
to two categories: document triage and digital forensics. on helping an analyst make sense out of documents they have
already processed through the first two stages.
182
Figuree 1 - Search Vieew
teechnologies: (1)) the limited appplicability of th
hese applicationss send to a translator. Scanned docum ments are simple TIF files in
too legacy or obso
olete systems and media and (2) the inability forr whichh the text cannot be searchedd once opened. As relevant
thhese technologiees to recreate th
he initial state an
nd layout of thee materrial is found, thee locations of thhe original files on the disk
ddata. imagee are kept trackk of so the files can later be coopied into a
folderr for import intoo Analysts Notebbook where furtther analysis
44. USE CASES can thhen occur.
AAs shown in ourr related work, thhere is a gap in the workflow off
As thhe narrative shhows, the analyysts job was fr fraught with
fforensic documeent analysts bettween their doccument recovery y
manuual tasks that couuld be automatedd. By contrast, thhe following
aand cataloging toools and their an
nalysis tools. Cuurrently this gapp
narrattive shows the w
workflow we imaagined when using Redeye.
iss filled with extensive
e manpo ower that manu ually filters thee
uuninteresting fro
om interesting documents
d for further
f analysis. Juliuss and Dan have now moved to uusing Redeye as a document
TThis is where document triage work
w has the poteential to improvee triagee tool. When doccuments are acqquired under a w warrant, the
thhe workflow of analysts. This seection will discu
uss two use casess Redeyye Ingestion toool is directed to ttheir location annd begins to
wwe observed and d how a documeent triage system m improves thatt proceess the documeents. Instead of splitting the documents
wworkflow. amonngst themselves aand looking at aall of them, Juliius and Dan
now iinteract with thee collection via search looking by relevant
namees, places, or oother keywordss. Relevant doccuments are
44.1 Actor Search
S savedd in a sub-colleection that can later be exporteed with one
Inn the sponsorss previous worrkflow, a majorr portion of an n click and are ready fo
for import into AAnalysts Notebook. Scanned
aanalysts time waas spent manually searching thro ough documentss docum ments are also now searchablee due to the integration of
ffor information about
a entities and actors. This in
nformation could
d Opticcal Character R Recognition in the ingestion pipeline. If
bbe geographicall data, names, or contact infformation. Thiss foreiggn language documents are fouund, the system m provides a
innformation was then collated to identify aliasees and assemblee roughh machine trannslation that infforms the analy lysts if they
pprofiles of the actors
a involved in the case. Th he process as itt shoulld send it to a linguist. Tasks that took Juliuus and Dan
ppreviously stood is described witth the following narrative: longeer than a year caan now be accoomplished in weeeks or days.
Comm ments can now bbe stored on doccuments allowingg Julius and
JJulius and Dan n are analystss working for a federal law w Dan tto easily see eacch others insighhts when docum ments are re-
eenforcement agency. They are cu urrently working g on a case with
h exam ined.
ssuspected moneyy laundering asspects. In orderr to pursue thiss
ccase, the agency has obtained a warrant
w and connducted a search
h Usingg Redeye, the aanalysts work hhas become marrkedly more
of the homes and d offices of a su
uspected money launderer. Thiss effici ent and interacttions between aanalysts occur w
with greater
ssearch yielded tens
t of thousands of documentts both scanned d frequuency. In generaal, Redeye has altered the natuure of their
fr
from paper orig ginals and retrieeved using FTK K from hard diskk work from sifting thhrough documennts to searching for entities.
immages taken du uring the searcch. These docu uments are then n Analyysts are now free to ask m more questions and revisit
inndexed into a da atabase that allo
ows users to broowse through thee docum ments. Additionnally, since seaarching can bee performed
fi
files one at a tim
me. Julius and Dan
D divide the set s of documentss acros s cases, they caan build links bbetween cases something
bbetween them and a then begin to look throug gh each one. If that w
was difficult-to-iimpossible previiously.
MMicrosoft Office or other pre-in nstalled softwaree cannot open a
ddocument, the fille is passed by. If a document iss not in English,,
thhey determine based
b on appearaance if it is impo
ortant enough to
o
183
Figure 2 - Advisor Vieew
44.2 Term Matching
M By prroviding these aanalytical tools to help filter doown a large
TThe second obserrved activity in the
t prior workflo ow was spent on n noisyy set of documennts to a more m manageable sizee, we enable
teerm matching. The term mattching task inv volves manually y analyysts to more quiickly triage doccuments in to ssets that are
sscanning throug gh documents looking
l for releevant keywordss relevaant to the case aand that are not. Thus, we beliieve that we
fr
from a list of term
ms. Unlike searcching, the goal iss not to find one, have helped bridge tthe gap betweenn the documentt acquisition
aall or none but to find documeents that match h several criticall phasee and the evidencce-reporting phaase.
pphrases. This prrocess is used to filter the poteentially relevantt
ddocuments out. This
T process can n be imagined wiith the following g
5. S
SYSTEM D
DESIGN
nnarrative: Redeyye is a combination of threee major compponents: an
ingesttion pipeline, aan Apache SOL
LR repository [16], and an
JJulius and Dan are provided a several page list l of importantt Eclip se RCP-based [16] user tool. Each componnent will be
teerms for the casse at hand fromm a forensic acco ountant. The listt coverred in this sectioon.
inncludes terms th hat may indicatte money laundeering divided byy
ccategories as well as names, organizations,
o and
a locations off
innterest. None off the terms by themselves
t are indicative
i nor iss 5.1 Ingestion
thhe full, multi-pa
age list of termss likely to be foound in a singlee Each case processed presents differeent needs and intterests. This
ddocument. Instea ad Julius and Dan
D browse thro ough documentss leads to a design wheerein the ingestioon process is brooken up into
loooking for docu uments that conttain several of the
t phrases. Thee a seriies of steps that can be arrangedd via a pipeline tto match the
loocation of these documents is reecorded and lateer the documentss speciffic requirementts of each casee. This pipelinee is loosely
aare copied off to a folder forr further analyssis in Analystss couplled and backedd by a temporarry state manageement store
NNotebook. Due to the inherentt poorly defined d nature of thee impleemented using MongoDB [177]. The objectiives of the
hhuman task of deetermining whatt a sufficient num mber of terms is,, ingesttion process are to extract all off the relevant meetadata from
thhe process is noot easily reprodu
ucible nor is it co
omprehensive ass files as well as perfform pre-processing to aid in pperformance
itt relies on thee Julius and DanD not overloo oking particularr ng the analysis sstages. Once thhis process is coomplete, the
during
ddocuments or misssing terms in a document. analyysis tools do not need to interact with the originaal files other
MMuch like the previous search case,
c the term matching
m case iss than tto display them iif requested.
fr
fraught with auto
omatable processes. The use casse we envisioned
d Somee examples of thhe pipeline stepss include: tree buuilding, text
ffor automating this process drraws from the prior work on n extracction, entity exttraction, metadaata extraction, duuplicate file
RRaptor. consoolidation, comprressed file extraaction, machine translation,
JJulius and Dan are provided a list of key term ms in categoriess tokennization, and poppulation into SO OLR. With the eexception of
fr
from a forensicc accountant; however instea ad of manuallyy tree bbuilding, which is a recursive crawl of the raaw data that
loooking through documents forr these terms, Julius J and Dann buildss the initial tree structure in the state managemeent store, the
ddivide the list by
b category and d load their terrm sets into thee majorrity of the stagess can be orderedd, included/excluuded, re-run,
RRedeye Analysis Workbench. Affter several min nutes of waiting,, etc. aas desired. The pipeline is expoosed via a simpple graphical
thhe Workbench produces
p a set off result lists each
h corresponding
g user iinterface that seerves as a harnesss by which the analyst can
too a category off terms. The documents in the set are ordered d controol which stagess are run, in w what order, and how many
aaccording to thee similarity of th
he document to the terms in thee timess.
ccategory it match
hed against.
184
5.3 Workbencch
The w workbench reprresents the user interface for analysts in
Redeyye. It is built inn the Eclipse RCP framework [16] around
three major tasks. E Each task is suppported with a perspective
contaaining one or moore views. The three tasks as mmentioned in
sectioon four are seearch, advisor, and analysis. Search and
Advissor each have a single view, while analysis has three views:
List VView, Graph Viisualization, andd Detail View. Each of the
three analysis views represents an allternative way oof looking at
relevaant documents identified in the search aand advisor
persppectives.
185
Figure 4 - Analysis Vieew
55.3.3 Advisorr View 5.3.44 List View
Inn addition to thee search view, th
he Redeye tool also
a provides thee Once pre-triage filteering has been conducted eitheer using the
aadvisor view. Th he advisor view (Figure
( 3) is the implementation
n searchh or the advisorr views, the userr can then switcch to the list
oof the Raptor and the DTHSTR techniques as mentioned
m in ourr view.. The list view, aas shown in Figuure 4, consists oof five major
pprior work. compponents: the listt, properties, enntities, commennts, and the
TThe advisor vieww provides a laarge text entry box b that can bee timeliine.
ppopulated by thee user through direct entry, co opy and pasting g The llist shows the m main informatioon about docum ments in the
fr
from another appplication or by lo
oading a text fille already on thee selectted document sset. In order too simplify the information
uusers machine. The user can n then control the number off presennted the list is ppaginated into ssets of a user specified size.
ddocuments returnned per set with a slider from 1 to
t the number off Each document show wn has a title, a nnumber of status indicators,
ccoarse-selected documents. Th his number of coarse selected d a snipppet from the coontent of the doocument and thee path to the
ddocuments is adjustable
a in the
t system pro operties but iss originnal document. The status inddicators help shhow various
cconsidered an advanced
a featurre that most users
u would nott informmation about thhe document. Thhese indicators are whether
nnormally use. the ooriginal documennt is available, if the documennt has been
FFinally the user can push the buildb result setss button and thee flaggeed as importannt, if there are any commentss about the
ssystem begins prrocessing the terrm list and comp paring the termss docum ment and, if hiddden documents are set to be shhown, if the
too documents in the SOLR repo ository. The pro ocess, which thee docum ment has been hiidden.
aadvisor performss, is described ass follows: first, the input is splitt Once a document is selected a num mber of views arre populated
oon empty lines to o form sets of caategory documen nts. The first linee with information aboout that documeent. The first of them is the
inn each category y becomes the title of the caategory and thee propeerties view. Thee properties view w shows metadaata extracted
remaining lines become individ dual terms. Usin ng a predefined d aboutt the file itself ssuch as when itt was created, iff it has text,
thhreshold the teerms to be seaarched are evaaluated on theirr and iff it was encryptted. If the file is an email, propperties from
ddiscriminatory value
v and are eiither kept or triimmed from thee the em mail header are ddisplayed in adddition to propertiies about the
ssearch. Using these
t trimmed terms, the sollr repository iss file ittself. These incluude subject, sendder, recipient, annd date sent.
ssearched for all documents that match one or more m terms. Thee The eentities view, iin contrast to tthe properties vview, shows
results are rankeed by textual sim milarity to the search terms and d metaddata about the document insstead of the fiile. This is
thhe top documen nts up to a certaiin number the coarse selection n currenntly restricted too entities extractted during ingesstion using a
nnumber set in th he system preferrences are kep pt. Second, forr numbber of regular exxpressions. Entitties include phonne numbers,
eeach document retrieved by the t coarse filteer, the TF-ICF F and eemail addresses aamong others.
wweighted term vector
v is comparred to the TF-IC CF weight term m
vvector of the cateegory using a sttandard cosine similarity metric. The llast view is thhe timeline vieww. The timelinee view is a
TThis provides a ranking of doccuments that is cut off after a mized instance oof the SIMILE JJavaScript timeliine inside an
custom
ccertain number ofo documents the t fine selection n number set by y embeedded web brow wser [21]. The timeline is hosted via an
thhe slider. Each of
o the resulting document sets are a then saved to o embeedded instance oof the Jetty weeb server. Addittionally two
ddata sets for furth
her analysis in th
he system. J2EE web services pprovide both thee JSON dataset to populate
186
Figure 5 - Cluster
C Visualizaation
thhe timeline and
d a callback feaature to feed infformation aboutt review
w documents inn a cluster. An eexample of this vview can be
ddocument selecttion from the timeline
t back innto the Redeyee seen iin Figure 5.
WWorkbench. We hhave devised a novel graph laayout algorithm around the
Inn addition to this
t view of th he data, we also provide two o desireed properties oof the graph. T The properties w we wish to
aadditional viewss. One for view wing clusters of documents, and d guaraantee are all docuuments nodes arre on the same rradius of the
aanother for delviing into the term
ms and entities inn detail. Next wee graphh and that the root of the tree is located in the center.
wwill discuss the cluster
c visualizattion and then thee detail view. Addittionally we souught to minimizze label overlapp and edge
lengthh without relyinng on a dynamicc layout algorithhm, such as
force atlas [23]. Our solution is to fliip the problem oof laying out
55.3.5 Clusterr Visualization
n a grapph upside down and layout the leaves first and thhen walk up
OOne of the majorr features of onee of the predecessor tools of thee the trree. This differs from other layoouts that start frrom the root
RRedeye Workbeench, Piranha, is i a hierarchicaal clustering off and wwork their way ddown to the leavees [24].
ppages best on to op terms and phrrases with-in th he document [1]. In ordder to perform this, we first nneed to know thhe leaves in
TThis was presentted as a radial graph.
g This samme technique hass orderr of their height. This is performmed by first walkking the tree
bbeen adapted intto Redeye along g with a new im mplementation off usingg a post-order traaversal and storinng each node in to a stack.
thhe cluster visuallization. Unlike the visualization
n in Piranha, thee
RRedeye tool allow ws smooth zoom ming and panning g without pausess Afterr this is complette, we iterated thhrough the stackk setting the
bbetween actions. This allows a more
m natural inv
vestigation of thee heighht of each node tto either zero or the maximum hheight of the
ggraph [22]. Analysts, as part off their training, must verify thee nodes children plus one. These heigght labeled nodees are placed
immportance of every
e documentt to a case. Trraditionally thiss on to another stack thhat is then iteraated through to ccalculate the
mmeant wading in ndividually throuugh millions of documents for a anglee of the node on iits level.
ffew relevant ones. The cluster viisualization allowws the analyst too If a nnode has no chilldren, the angle of the node is equal to the
sstill look at each one but in a mo
ore intelligent maanner. Instead off followwing:
loooking at each document in deepth, they can first f look at thee
2
cclusters to triagee which sets off documents aree more likely to o
ccontain pertinentt information. This
T enables thee analyst to nott 2
oonly prioritize thheir search for relevant
r informaation, but allowss wheree n is the numbeer of leaf nodes,, i is the count oof leaf nodes
thhem a means, by b viewing the top terms of a document or a alreaddy laid out, h is the height of the current node, aand H is the
ccluster, to more rapidly
r perform an exhaustive search
s as the top
p maxim mum height of aall nodes. If a nnode does have cchildren, the
teerms help inforrm the analyst of o how extensiv ve they need to o anglee is equal to the aaverage angle off its children.
187
Figuree 6 - Detail View
w
FFinally, we calcu ulate a radius for
fo each node in
n the graph. Thee has ddocuments from. This allows thee detail view to bbe restricted
fformula of the raadius is the follow
wing: to ressults from a partiicular collectionn.
The Top and All tabs are similarr in that they connsist of three
, , tabs, each of which is dedicated to words, phrases,, or entities.
wwhere d is the maximum
m distan
nce in pixels thaat a node can bee The pprimary differennce between thhe two is that Top only
fr
from the center of the graph, B( B , , is the beta value at h displaays the top 500 of each in orrder of importaance in the
wwith tuning paraameters and [25]. We havee experimentally y collecction measured by a TF-ICF w weighted term ccount while
ddetermined satisffactory values off 1 and 2. "All" displays every w
word/phrase/enttity in alphanumeeric order.
188
positive and we are currently evaluating further use cases in 8. REFERENCES
which we would be able to extend Redeye to help reduce the [1] J. W. Reed, T. E. Potok, and R. M. Patton, "A multi-agent
demands of document pre-processing on the analyst. We are system for distributed cluster analysis," in Proceedings of
currently considering four possible areas of future work. Third International Workshop on Software Engineering for
Our sponsor has indicated a strong interest in semi-automatic Large-Scale Multi-Agent Systems (SELMAS'04) Workshop
generation of social network graphs from documents. This area in conjunction with the 26th International Conference on
of work would extend on existing work on social network Software Engineering Edinburgh, Scotland, UK: IEE,
extraction to help provide a visual and analyst-updatable 2004, pp. 152-5.
representation of the graph of actors involved in an [2] J. W. Reed, Y. Jiao, T. E. Potok, B. A. Klump, M. T.
investigation. Elmore, and A. R. Hurson, "TF-ICF: A new term weighting
Another area is in the area of alias resolution. With the kinds of scheme for clustering dynamic data streams," in Machine
cases the sponsor deals in, individuals may be using several Learning and Applications, 2006. ICMLA'06. 5th
alternate names, companies may have various shell names, and International Conference on, 2006, pp. 258-263.
both use more than one email address, phone number or mailing [3] R. M. Patton, W. McNair, C. T. Symons, J. N. Treadwell,
address. By analyzing clues in documents, the analysts currently and T. E. Potok, "A Text Analysis Approach to Motivate
resolve several virtual identities in-to one physical identity or Knowledge Sharing via Microsoft SharePoint," in System
organization. Providing tools and techniques to speed the Science (HICSS), 2012 45th Hawaii International
process could provide a significant speed up in the analysis of Conference on, 2012, pp. 3670-3678.
documents involved in a case.
[4] S. Bae, R. Badi, K. Meintanis, J. Moore, A. Zacchi, H.
Furthermore, there may be value in investigating visualization of Hsieh, C. Marshall, and F. Shipman, "Effects of display
the document collection in bulk. This kind of visualization could configurations on document triage," Human-Computer
help analysts rapidly narrow down documents of interest or Interaction-INTERACT 2005, pp. 130-143, 2005.
potentially allow them to spot abnormal documents in a
collection that may contain important facts about a case. [5] C. C. Marshall and F. M. Shipman III, "Spatial hypertext
and the practice of information triage," in Proceedings of
Finally, wed like to apply more advanced named entity the eighth ACM conference on Hypertext, 1997, pp. 124-
extraction techniques to help identify actors, locations, addresses 133.
and other important pieces of information contained in
documents. Due to the changing domain of cases and the often [6] F. M. Shipman, H. Hsieh, J. M. Moore, and A. Zacchi,
highly technical nature of the domains, traditional corpus-based "Supporting personal collections across digital libraries in
named entity recognition (NER) techniques may not prove spatial hypertext," in Proceedings of the 4th ACM/IEEE-CS
adequate to provide sufficiently high levels of performance on joint conference on Digital libraries, 2004, pp. 358-367.
multiple cases. Instead, we are looking in to developing novel [7] S. Bae, D. H. Kim, K. Meintanis, J. M. Moore, A. Zacchi,
semi-supervised and unsupervised techniques of NER in order to F. Shipman, H. Hsieh, and C. C. Marshall, "Supporting
address these concerns in dealing with technical cross-domain document triage via annotation-based multi-application
collections for NER. visualizations," in Proceedings of the 10th annual joint
In conclusion, we have identified a need for support in forensic conference on Digital libraries, 2010, pp. 177-186.
document analysis, examined an existing workflow in use by a [8] G. Buchanan and T. Owen, "Improving skim reading for
law enforcement agency and singled out two use cases where document triage," in Proceedings of the second
computer support could improve their workflow, and designed a international symposium on Information interaction in
digital-library-inspired tool for managing their document context, 2008, pp. 83-88.
collections. We have shown that a digital libraries approach to
[9] F. Loizides and G. R. Buchanan, "Performing document
forensic document analysis meets the analysts need and
triage on small screen devices. part 1: structured
provided a direction for further work in this area.
documents," in Proceedings of the third symposium on
Information interaction in context, 2010, pp. 341-346.
7. ACKNOWLEDGEMENTS [10] G. Cantrell, D. Dampier, Y. S. Dandass, N. Niu, and C.
This document was prepared by Oak Ridge National Laboratory, Bogen, "Research toward a Partially-Automated, and
P.O. Box 2008, Oak Ridge, Tennessee 37831-6285; managed by Crime Specific Digital Triage Process Model," Computer
UT-Battelle, LLC, for the US Department of Energy under and Information Science, vol. 5, p. p29, 2012.
contract number DE-AC05-00OR22725. [11] G. Buchanan, "Rapid document navigation for information
This manuscript has been authored by UT-Battelle, LLC, under triage support," in Proceedings of the 7th ACM/IEEE-CS
contract DE-AC05-00OR22725 with the U.S. Department of joint conference on Digital libraries, 2007, pp. 503-503.
Energy. The United States Government retains and the [12] J. Payne, J. Solomon, R. Sankar, and B. McGrew, "Grand
publisher, by accepting the article for publication, acknowledges challenge award: Interactive visual analytics palantir: The
that the United States Government retains a non-exclusive, paid- future of analysis," in Visual Analytics Science and
up, irrevocable, world-wide license to publish or reproduce the Technology, 2008. VAST '08. IEEE Symposium on, 2008,
published form of this manuscript, or allow others to do so, for pp. 201-202.
United States Government purposes.
[13] IBM Corporation. (2012). IBM i2 Analyst's Notebook
datasheet. Available:
http://public.dhe.ibm.com/common/ssi/ecm/en/zzd03127us
en/ZZD03127USEN.PDF
189
[14] i2 Limited, "i2 Analyst's Notebook 8: product overview.," [22] I. Herman, G. Melanon, and M. S. Marshall, "Graph
i2 Limited 2009. visualization and navigation in information visualization: A
[15] J. L. John, "Adapting existing technologies for digitally survey," Visualization and Computer Graphics, IEEE
archiving personal lives," iPRES 2008, p. 48, 2008. Transactions on, vol. 6, pp. 24-43, 2000.
[16] D. Rubel, "The heart of eclipse," Queue, vol. 4, pp. 36-44, [23] M. Bastian, S. Heymann, and M. Jacomy, Gephi: An Open
2006. Source Software for Exploring and Manipulating Networks,
2009.
[17] 10gen Incorporated. (2013). mongoDB. Available:
http://www.mongodb.org/ [24] G. M. Draper, Y. Livnat, and R. F. Riesenfeld, "A survey
of radial methods for information visualization,"
[18] Apache Software Foundation. (2012). Apache POI - the Visualization and Computer Graphics, IEEE Transactions
Java API for Microsoft Documents. Available: on, vol. 15, pp. 759-776, 2009.
http://poi.apache.org
[25] K. Pearson, "Mathematical Contributions to the Theory of
[19] Microsoft Corporation. (2012). IFilter interface. Available: Evolution. XIX. Second Supplement to a Memoir on Skew
http://msdn.microsoft.com/en- Variation," Philosophical Transactions of the Royal Society
us/library/ms691105(v=vs.85).aspx of London. Series A, Containing Papers of a Mathematical
[20] Apache Software Foundation. (2012). Apache Solr. or Physical Character, vol. 216, pp. 429-457, January 1,
Available: http://lucene.apache.org/solr/ 1916 1916.
[21] Massachusetts Institute of Technology. (2009). SIMILE
Widgets Timeline. Available: http://www.simile-
widgets.org/timeline/
190