Beruflich Dokumente
Kultur Dokumente
Journalism
Gregor Wiedemann Seid Muhie Yimam Chris Biemann
Language Technology Group
Department of Informatics
Universität Hamburg, Germany
{gwiedemann, yimam, biemann}@informatik.uni-hamburg.de
tainers for complex architectures, end-users can New/s/leak connects directly to Hoover’s index
download and run locally a preconfigured version to read full-texts and metadata for its own infor-
of New/s/leak with one single command. Being mation extraction pipeline. Through this close in-
able to process data locally and even without any tegration with Hoover, New/s/leak can offer infor-
connection to the internet is a vital prerequisite for mation extraction to a wide variety of data formats.
journalists when they work with sensitive data. All In many cases, this drastically limits or even com-
necessary source code and installation instructions pletely eliminates the amount of work needed to
can be found on our Github project page.9 clean and preprocess large datasets beforehand.
In many cases, journalists follow some hypothesis To further summarize document contents in addi-
to test for their investigative work. Such a pro- tion to named entities, we automatically extract
ceeding can involve looking for mentions of al- keyterms and phrases from documents. For this,
ready known terms or specific entities in the data. we implemented a keyterm extraction library for
This can be realized by lists of dictionaries pro- the 40 languages also supported in the previous
vided to the initial information extraction process. step.15 Our approach is based on a statistical com-
New/s/leak annotates every mention of a dictio- parison of document contents with generic ref-
nary term with the respective list type. Dictionar- erence data. Reference data for each language
ies can be defined in a language-specific fashion, is retrieved from the Leipzig Corpora Collection
but also applied across documents of all languages (Goldhahn et al., 2012), which provides large rep-
in the corpus. Extracted dictionary entities are dis- resentative corpora for language statistics. We em-
played along with extracted named entities in the ploy log-likelihood significance as described in
visualization. (Rayson et al., 2004) to measure the overuse of
In addition to self-defined dictionaries, we an- terms (i.e. keyterms) in our target documents com-
notate email addresses, telephone numbers, and pared to the generic reference data. Ongoing se-
URLs with regular expression patterns. This is quences of keyterms in target documents are con-
useful, especially for email leaks to reveal commu- catenated to key phrases if they occur regularly in
nication networks of persons and filter for specific that exact same order. Regularity is determined
email account related content. with the Dice coefficient. This simple method al-
lows to reliably extract multiword units such as
5.3 Temporal Expressions “stock market” or “machine learning” in the docu-
Tracking documents across the time of their cre- ments. Since this method also extracts named en-
ation or by temporal events they mention can pro- tities if they occur significantly often in a docu-
vide valuable information during investigative re- ment, there can be a substantial overlap between
search. Unfortunately, many document sets (e.g. both types. To allow for a separate display of
collections of scanned pages) do not come with named entities and keywords, we filter keyterms if
a specific document creation date as structured they already have been annotated as a named en-
metadata. To offer a temporal selection of contents tity. The remaining top keyterms are used to cre-
to the user, we extract mentions of temporal ex- ate a brief summary of each document for the user
pressions. This is done by integrating the Heidel- and to generate keyterm networks for document
Time temporal tagger (Strötgen and Gertz, 2015) browsing.
in our UIMA workflow. HeidelTime provides au- 12
https://wikipedia.org
tomatically learned rules for temporal tagging in 13
https://developers.google.com/
more than 200 languages. Extracted timestamps freebase
14
can be used to select and filter documents. A list of the 40 languages covered by Polyglot-NER can
be found at https://tinyurl.com/yaju7bf7
11 15
http://icu-project.org/apiref/icu4j https://github.com/uhh-lt/lt-keyterms
7 Case Study
16
A video of the proceeding can be found at: http://
youtu.be/96f_4Wm5BoU