Beruflich Dokumente
Kultur Dokumente
Introduction
Understanding e -discovery
Meta-data
The dening dierence in the nature of the information available for discovery
is the existence of meta-data in digital documents. Meta-data is information
about the information or as it is commonly termed data about data. Looking
at a concrete example such as a Microsoft Word word processing le, there will
be in addition to the pages of text, a number of other data saved with the le.
This data includes the creation date and time of the le, the date and time of
last modication and the date of the last access by the computer to the le.
Meta-data of thius type is not only available for MicroSoft Word les but all
les in general. Word in addition has specic meta-data elds for the name of
the le creator and details of the creator and their contact details.
1 hmi.ucsd.edu/pdf/HMI_2009_ConsumerReport_Dec9_2009.pdf
1
Depending on how the software is set up it may also contain information
about previous revisions and in one version the Internet address (IP) of the
machine on which the le was created.
Such a wealth of information is a boon for the discovering party, far exceeding
the information available from paper copies of the same document.
Consider also the case of email. A printout of the email will at the most
include details of the sender and the recipient, whereas the original email le will
contain meta-data which includes the source email address, who else received
copies (even bcc or blind carbon copies) and the Internet route that the email
traveled.
While not depreciating traditional discovery methods, it can be seen that
digital discovery is potentially a much richer source of discovery information.
This form of e-discovery is both potentially quick and simple. The most basic
format is for a search criteria to be applied to the applicable media and a search
made for documents that meet that criteria. Generally the criteria specify
2
single or multiple instances of document date ranges, le names, document
types, directory names and similar readily available data included in the media
le structure information. Selection criteria such as this will yield a potential
universe of documents for further analysis.
A more extensive and more time consuming procedure is to perform full
text searching on the applicable media. Full text searching allows the criteria to
include searching within documents for the appearance of the search terms. In
this case for instance the appearance of a specic name within a word processing
document may be included in the potential universe of discoverable documents.
One of the problems with the full text search is that there are generally
only two ways to perform searches at this level of detail. The rst requires a full
text searching application index all the available media. Subsequent searches on
the index will quickly identify documents that conform to the inputted search
criteria. In this way various permutations of the criteria may be experimented
with quite quickly.
The second approach is to use a text search tool to search through the les
on the disk for a matching text string or value.
This method has a signicant disadvantage. Each text or value search must
be performed discretely on the same data in real time. In essence it means that
the same data is searched over and over for the various criteria, truly a time
consuming exercise.
Although there are a number of software tools that may assist in this process,
they are generally not installed on business systems as a matter of course.
In practical use the second approach is leaning toward a more forensic anal-
ysis of the data set.
This type of discovery uses the tools and methods of the e-forensic specialist,
where one or more bit image or exact copies of the media are created. This
process does not change the data on the target machine in anyway and produces
a exact copy of the data on some alternate media. It should be recognised that
the physical media does not have to be identical as it the data and its logical
structure that is being reproduced in this situation.
Given the forensic approach to discovery, the evidential purity of the data
may be ensured by both chain of evidence procedures and adherence with stan-
dard computer forensic principles. These principles include the generation of
checksums which ensure that the data has not been changes either intentionally
and inadvertently.
In addition the forensic approach opens the opportunity for discovery to be
extended to deleted, hidden and other non user accessible data.
Clearly an all of media or data forensic discovery may raise issues of scope
and privacy in the discovery outside the content of this paper.
3
File naming conventions in e-discovery
The advent of long le names in general use has enabled users to name les
in a more meaningful and relevant way. Prior to wide spread of the Microsoft
Windows operating system the naming of les and directory of folders was
constrained to eight characters, a dot and a further 3 characters. This gave rise
to rather obscure naming conventions and made identication of a target les
considerably more dicult.
The long le name (LFN) convention now allows a user to name a le in any
meaningful manner within the constraints of 255 characters in total. Historically
the three character extension after the dot is still used to identify the type of
document such as .doc for microsoft word wordprocessing les. Other than this
the user is able to name the document any meaningful name. For instance
under the old 8.3 naming convention the 2010 sta levels report for the human
resources may have been called hrstaf10.doc. Clearly a long le name such as
2010 sta levels report for human resources is not only more meaningful be
signicantly enhances the discoverability of the particular document.
The use of the 3 (or sometimes 4) character le extension to identify the
type of le or more specically the application that is associated with that le
type also aords an opportunity for documents to be discovered based on their
le type. As an example the discovery of engineering drawing may be facilitated
by searching for the .acd extension if the discovered party is using Autocad for
the drafting eorts.
The computer discovery process is in most respects the same as the paper dis-
covery process. In the rst instance the scope of the potential universe must be
dened. Generally, but not exclusively this entails limits to date and time, orig-
inator or recipient of the document and keywords. In computer discovery there
is the additional parameter of the data hosting machine identity. This may be
one or more personal computers located on an individual's desktop, a laptop
or in some cases a home computer that is used to perform work related activi-
ties. These are not however the only data repositories. File serves, read/write
archival devices such as backup tapes, CD/DVD and USB devices are all poten-
tial sources of discoverable data. Although other devices that conform to the
strict denition of e-discovery devices such cellular phones and personal digital
devices are not considered in this short introductory paper. The practitioner
should be aware that the potential universe of discoverable data may include
these other devices.
4
porate network. Copies of those emails may be found on either or both of the
party's computers. There is also the possibility the email server may have a
recoverable copy of the email that was forwarded through it. additionally the
emails may have been backed up one or more times before beingb deleted from
the system.
Likewise a particular document, regardless of type, may be included as part
of a backup or archival routine for an extended period of time, till it is deleted
from the le server. In this instance the central repository copy has been re-
moved, however the copies that have been saved to the backup media are still
in existence.
It is opportune to consider that most sites cycle their backup media, since
their main function is to provide a mechanism for the recovery of current infor-
mation in the event of a failure or malfunction of the computer system. This
means that in general, depending on the cycle, data may be retained for any-
where from a week to months.
This does not necessarily mean that data from say a previous year is lost.
Quiet often backups at critical corporate times are archived on a more perma-
nent basis. This may be limited to the end of nancial year process. In other
enterprises the end of month processing archives are retained for a number of
years.
There are applications that will sort through digital les and nd those that
are duplications. In this instance duplication means quite specically those les
that have exactly the same le contents. If we consider an email that is sent
from one person to another within an internal network, although the content of
the email, that the text portion that was sent will be same. However the two
electronic documents are not identical. The header information that is in part
hidden from the user is materially dierent and thus from a duplication point
of view the two documents are discrete and dierent.
The other way to potentially identify duplications is to compare the user
identiable le parameters. Things such as the same le name, le size and
other meta-data may point to potential duplicate documents. From a e-forensic
viewpoint, the only reliable way to identify duplicate les is to perform what is
call a hash on the les.
Hashing
5
duplicate les and may assist in discovery by establishing associatively on both
persons private data.
Retention policies
6
There are applications that will securely delete this hidden data, but in
general they require that the user actively install and use the product. The
presence of secure deletion applications in themselves are not generally cause to
imply spoilage as many organisations use secure deletion in much the same way
that they use encryption, that is to protect corporate information from those
who might exploit that information.
Encryption
Compression
A similar situation to the encryption issue is posed by the use of data compres-
sion utilities. Once again it is not the actual data compression that is material
but rather that the party conducting the e-discovery may not be aware that
compression was used and potential evidence may be over looked.
A number of full text search application are able to open and read the text
within commonly used compression schemes. It should be noted however that
the search routines native to the operating system are not generally able to
perform this function.
There have been a shift over the years in the way that data is associated and
in repository design. In the earlier days of personal computers the trend was
toward storing the data with the application that created it. This was partic-
ularly common with word processing and spreadsheet les. One of the driving
forces behind his was that the media structures were rather simple and the need
was to have the application and the data near and in many cases on the same
media.
As networks came into broader use there was a trend to store data in a
central repository where it could be managed and backed up more easily. To
aord a level of security, users were assigned usernames and passwords that
provided access to specic regions of the storage media. Often each user was
also allocated storage for their own work les in what was termed their home
7
directory. Users could selectively allow others access to individual les on a user
or group basis.
A nal shift in the paradigm was the move toward project centric data
storage. All the data les for a particular task, function or project were stored
centrally together. In this design the application that created the data was in
many ways irrelevant as was the specic user that created or modied the le.
A project repository could contain word processing, spreadsheet, engineering
drawing, in fact anything that related to the project.
Discovery by generations
Looking at the specic needs related to each data storage paradigm, we can see
that each one imposes dierent discovery techniques. We should however be
aware that most sites utilise a mixture of paradigms either intentionally or by
default.
In Generation 1, the machine based storage, discovery is centralised around
specic computers that may have been used for associated tasks during the dis-
covery time frame. In this regard the discovery is based on custodian principles
and is clearly a subset of the available machines. It is a rather simple task in
this type of structure to identify the documents of interest. From a business
viewpoint the potential for disruption is contained and manageable.
The second generation, user based storage tends to concentrate the discovery
eort on the basis of the users who would have reason and authority to create
and edit the les associated with the discovery eort. Because the majority
of the les would be held in a central le server environment then selection of
those les would be based on the media structures that the discovered enterprise
maintains.
We should not however only consider the centrally maintained data but
also include any local les located on machines used by those users who are
identied in the discovery criteria. Additionally personal assistants and the like
of those users should also be included in the potential discovery set as they often
manipulate data and les for their superiors.
The third generation, project based data once again concentrated the data in
a central repository and additionally classies data on the basis of the project,
task or client or supplier. In this regard the potential universe is materially
constrained to those matters that can be readily identied as part of the issue
under discovery.
Although easier to identify the subset, there is still the matter of locally
stored data as covered above that must always be addressed.
8
other party what they are thinking or not thinking. Contrast this with other
forms, letter or even fax where the party has to compose the communication.
Another possible explanation is that email is a non-tangible form of commu-
nication and the sender may consider that it is in some way less real. Regardless
of the motivation, email messages are often a good source of discoverable infor-
mation.
Email discovery is similar to discovery of other documents. A set of criteria
is established which may include date range sender or recipient names, subject
heading or content. In handling email it is essential to include the content of
any attachments in the discovery process.
Spoilage by discovery
One of the signicant issues is that spoilage may result from the very process
of a type 1 or document centric discovery. Although there are precautions that
may be taken. In general the very act of opening a document or even searching
for it will change at the very least the last access date which may be signicant
in any future e-forensic endeavor.
To over come this limitation a number of techniques have been developed
to enable the investigator to conduct either a full-text index or search and
optionally extract those les that adhere to the discovery criteria.
Once again the method is determined by the type of system that is being
discovered. A stand alone pc wiih no connection to a wider network (which
is rare in a commercial environment) is much easier to perform non intrusive
discovery. Basically the method entails the booting of the machine with an al-
ternative operating system usually linux and mounting the data storage devices
as read only so that the meta data is not compromised.
The text index or even the discovered documents may then be stored on self
conguring removable media such as a usb memory card.
In adoption of this approach the forensic purity must be maintained and
both chain of custody and a record of the procedure must be kept.
Where the data is potentially located on a central le server or distributed
in a peer to peer manner a dierent approach must be adopted. Where the
data is located on a le server, mail server or other mass storage device then the
investigator should consider the use of back up media to conduct the discovery.
This may pose a number of problems in regard to osite storage and data
media format. Both historically and currently there are quite a number of
varying media formats and data storage schemes.
The main deciding factor is going to be the time frame of the documents
being discovered. Where the matter relates to incidents in the past few years
then the data storage technology is going to be readily available to restore the
backup media to a forensically clean machine. By contrast where the matter
relates to events that are a decade or more old then the archived data may be
unreadable by current devices.
One example of this situation is where a rm has archived reel to reel tape
backups and the original tape drive is no longer in use or available. On the
9
positive side there are ways to retrieve this data but you should be aware of the
time and resource issues in such a discovery endeavor. This was the very cir-
cumstance with the exxon case where the so called smoking gun was discovered
on an old archive tape dating back to the 1980's.
Depending on the jurisdiction, there are parties who have the power to compel
a person to divulge or bypass a password or encryption scheme. The challenge
for the discoverer is that electronic indexing application may not be able to
`look into' the document for text searching. This of course translates to the
discovering party not being aware of potential evidence.
Within commercial environments a given user may have access or use a num-
ber of discrete computer work stations including both laptop and home based
systems. In circumstances such as these where the parties specically involved
in the discovery may be identied it is possible to narrow the search to specic
machines and personal directories on the central servers. This does not preclude
the searching of other machines or le server areas but may provide sucient
material to justify a fuller discovery eort.
10