Sie sind auf Seite 1von 10

e -discovery from Scratch

David Teisseire, CISSP

March 19, 2010

Introduction

Mention discovery to most legal practitioners and they immediately conjure up


images of rummaging through dusty and mold encrusted boxes of documents
that have been stored in a warehouse or storage loft somewhere. Such delights
as mouse droppings, crushed dried insects and the biologically unidentiable
add to the lure and excitement of conventional paper based discovery.
In recent years, there has been an exponential rise in the quantity of data
being stored on electronic media. It has been suggested that only 30% of all
electronic data is ever printed to hard copy and that 97% of all data is held
in electronic form1 . Such a situation creates a potential to use computer based
discovery systems to extract data for discovery. This then is the eld of computer
discovery or as it is more commonly called e-discovery.
Whereas the legal professional might consider e-discovery to be just another
source of discovery, it is in fact a potentially rich source of information not
available from traditional paper based discovery.

Understanding e -discovery
Meta-data

The dening dierence in the nature of the information available for discovery
is the existence of meta-data in digital documents. Meta-data is information
about the information or as it is commonly termed data about data. Looking
at a concrete example such as a Microsoft Word word processing le, there will
be in addition to the pages of text, a number of other data saved with the le.
This data includes the creation date and time of the le, the date and time of
last modication and the date of the last access by the computer to the le.
Meta-data of thius type is not only available for MicroSoft Word les but all
les in general. Word in addition has specic meta-data elds for the name of
the le creator and details of the creator and their contact details.
1 hmi.ucsd.edu/pdf/HMI_2009_ConsumerReport_Dec9_2009.pdf

1
Depending on how the software is set up it may also contain information
about previous revisions and in one version the Internet address (IP) of the
machine on which the le was created.
Such a wealth of information is a boon for the discovering party, far exceeding
the information available from paper copies of the same document.
Consider also the case of email. A printout of the email will at the most
include details of the sender and the recipient, whereas the original email le will
contain meta-data which includes the source email address, who else received
copies (even bcc or blind carbon copies) and the Internet route that the email
traveled.
While not depreciating traditional discovery methods, it can be seen that
digital discovery is potentially a much richer source of discovery information.

Classes of computer discovery

Before looking at the specics of an e-discovery session, we should consider the


two broad classes of e-discovery: document centric and media centric, and when
they may be appropriate.
The rst class teats the e-discovery in much the same way as paper based
discovery. Stored electronic documents will be selected based on pre-determined
criteria, from the available universe of current and archival electronic data. If
an electronic document meets the selection criteria, it will be included in the
potential universe of discoverable documents. In this sense, the discovery is
document centric.
The other major type is what could be termed a forensic type discovery. In
this type the data is evaluated in a non-intrusive manner and may include the
evaluation of not only stored data but deleted and hidden information as well.
In this situation the discovery is media centric.
Within each type there are a variety of nuances, but essentially the choice
has to come down to one or the other of these broad classes. Of interest it may
not be possible to change from a document centric to media centric investigation
part way through a discovery as the question of evidential purity may raised
due to the initial document discovery.
The very process of document discovery may change the meta-data asso-
ciated with the documents. As mentioned above, meta-data is data about the
data such as the time and date stamps associated with a le. This data includes
amongst other things the creation time/date, time and date last modied and
last accessed. It can be seen that the very process of copying a le will, unless
precautions are taken change the last access date at the very least. This change
may be material in some cases.

Document centric e-discovery

This form of e-discovery is both potentially quick and simple. The most basic
format is for a search criteria to be applied to the applicable media and a search
made for documents that meet that criteria. Generally the criteria specify

2
single or multiple instances of document date ranges, le names, document
types, directory names and similar readily available data included in the media
le structure information. Selection criteria such as this will yield a potential
universe of documents for further analysis.
A more extensive and more time consuming procedure is to perform full
text searching on the applicable media. Full text searching allows the criteria to
include searching within documents for the appearance of the search terms. In
this case for instance the appearance of a specic name within a word processing
document may be included in the potential universe of discoverable documents.
One of the problems with the full text search is that there are generally
only two ways to perform searches at this level of detail. The rst requires a full
text searching application index all the available media. Subsequent searches on
the index will quickly identify documents that conform to the inputted search
criteria. In this way various permutations of the criteria may be experimented
with quite quickly.
The second approach is to use a text search tool to search through the les
on the disk for a matching text string or value.
This method has a signicant disadvantage. Each text or value search must
be performed discretely on the same data in real time. In essence it means that
the same data is searched over and over for the various criteria, truly a time
consuming exercise.
Although there are a number of software tools that may assist in this process,
they are generally not installed on business systems as a matter of course.
In practical use the second approach is leaning toward a more forensic anal-
ysis of the data set.

Media centric e-discovery

This type of discovery uses the tools and methods of the e-forensic specialist,
where one or more bit image or exact copies of the media are created. This
process does not change the data on the target machine in anyway and produces
a exact copy of the data on some alternate media. It should be recognised that
the physical media does not have to be identical as it the data and its logical
structure that is being reproduced in this situation.
Given the forensic approach to discovery, the evidential purity of the data
may be ensured by both chain of evidence procedures and adherence with stan-
dard computer forensic principles. These principles include the generation of
checksums which ensure that the data has not been changes either intentionally
and inadvertently.
In addition the forensic approach opens the opportunity for discovery to be
extended to deleted, hidden and other non user accessible data.
Clearly an all of media or data forensic discovery may raise issues of scope
and privacy in the discovery outside the content of this paper.

3
File naming conventions in e-discovery

The advent of long le names in general use has enabled users to name les
in a more meaningful and relevant way. Prior to wide spread of the Microsoft
Windows operating system the naming of les and directory of folders was
constrained to eight characters, a dot and a further 3 characters. This gave rise
to rather obscure naming conventions and made identication of a target les
considerably more dicult.
The long le name (LFN) convention now allows a user to name a le in any
meaningful manner within the constraints of 255 characters in total. Historically
the three character extension after the dot is still used to identify the type of
document such as .doc for microsoft word wordprocessing les. Other than this
the user is able to name the document any meaningful name. For instance
under the old 8.3 naming convention the 2010 sta levels report for the human
resources may have been called hrstaf10.doc. Clearly a long le name such as
2010 sta levels report for human resources is not only more meaningful be
signicantly enhances the discoverability of the particular document.
The use of the 3 (or sometimes 4) character le extension to identify the
type of le or more specically the application that is associated with that le
type also aords an opportunity for documents to be discovered based on their
le type. As an example the discovery of engineering drawing may be facilitated
by searching for the .acd extension if the discovered party is using Autocad for
the drafting eorts.

Setting discovery universe

The computer discovery process is in most respects the same as the paper dis-
covery process. In the rst instance the scope of the potential universe must be
dened. Generally, but not exclusively this entails limits to date and time, orig-
inator or recipient of the document and keywords. In computer discovery there
is the additional parameter of the data hosting machine identity. This may be
one or more personal computers located on an individual's desktop, a laptop
or in some cases a home computer that is used to perform work related activi-
ties. These are not however the only data repositories. File serves, read/write
archival devices such as backup tapes, CD/DVD and USB devices are all poten-
tial sources of discoverable data. Although other devices that conform to the
strict denition of e-discovery devices such cellular phones and personal digital
devices are not considered in this short introductory paper. The practitioner
should be aware that the potential universe of discoverable data may include
these other devices.

Data duplication blessing or curse

The very nature of computerised storage within a commercial environment gen-


erally means that there may be duplications and replications of discoverable
data. As a simple example two employees may exchange emails within the cor-

4
porate network. Copies of those emails may be found on either or both of the
party's computers. There is also the possibility the email server may have a
recoverable copy of the email that was forwarded through it. additionally the
emails may have been backed up one or more times before beingb deleted from
the system.
Likewise a particular document, regardless of type, may be included as part
of a backup or archival routine for an extended period of time, till it is deleted
from the le server. In this instance the central repository copy has been re-
moved, however the copies that have been saved to the backup media are still
in existence.
It is opportune to consider that most sites cycle their backup media, since
their main function is to provide a mechanism for the recovery of current infor-
mation in the event of a failure or malfunction of the computer system. This
means that in general, depending on the cycle, data may be retained for any-
where from a week to months.
This does not necessarily mean that data from say a previous year is lost.
Quiet often backups at critical corporate times are archived on a more perma-
nent basis. This may be limited to the end of nancial year process. In other
enterprises the end of month processing archives are retained for a number of
years.

What to do about duplications

There are applications that will sort through digital les and nd those that
are duplications. In this instance duplication means quite specically those les
that have exactly the same le contents. If we consider an email that is sent
from one person to another within an internal network, although the content of
the email, that the text portion that was sent will be same. However the two
electronic documents are not identical. The header information that is in part
hidden from the user is materially dierent and thus from a duplication point
of view the two documents are discrete and dierent.
The other way to potentially identify duplications is to compare the user
identiable le parameters. Things such as the same le name, le size and
other meta-data may point to potential duplicate documents. From a e-forensic
viewpoint, the only reliable way to identify duplicate les is to perform what is
call a hash on the les.

Hashing

Hashing is a one way mathematical calculation of the numeric value of the


le. This value is probabilistically unique for every discrete le on the system.
From this any two or more les that return the same hash value are essential
duplicate and identical les. This process is sensitive to data changes such that
the deletion, addition or modication of one character in the le will make the
hash checksums change signicantly. We can see that using the hash value
method, the instance above of two people exchanging emails will create non

5
duplicate les and may assist in discovery by establishing associatively on both
persons private data.

Performing a type 1 e-discovery

The discovered party would authorise their IT department or sta to perform


the type 1 searches of their own data repositories. From this they may produce
a listing of the discoverable documents to meet the specied criteria.
These documents may then be sent to archival media for distribution to the
parties concerned.
Increasingly in other jurisdictions, notably the US, there has been a trend
to use a court appointed or mutually agreed third party agent to perform the
e-discovery for both parties. The advantages of this is there is lessened chance
of bias or conict of interest.
In addition the independent third party is able to apply strict rules of evi-
dence gathering, chain of custody of the data and may act as an expert witness
in regard to the collection and validity of the collected data.
The third party may also provide some degree of protection against inten-
tional spoilage of data by either material changing or destruction of les.

Retention policies

The majority of organisations will or should have an active document retention


policy. This is more correctly a document destruction policy where the criteria
for the destruction of specic and certain classes of documents are detailed and
implemented. Electronic data is in this regard no dierent to hard copy paper
documents and should be subject to the same policies. It is of interest that
when an order for discovery is implemented then all deletion or destruction or
backup and archival data should cease to ensure that spoilage does not occur.
Because data is often kept in a variety of locations within the company
computer environment, there is a chance that a signicant electronic document
may have escaped the destruction process.

The issue of deleted and other hidden data

When a paper document is destroyed by shedding or other secure destruction


process, then that document for practical purposes no longer exists. Digital
data is not always physically destroyed. On a desktop computer for instance,
the sending of a le to the recycle bin or the deleting of a le in a an application
does not destroy the digital data. Instead it makes that data transparent to
the user. The data is often not overwritten or destroyed till some time in the
future, a indeterminate time.
Essentially then depending on the terms of the discovery, deleted documents
and hidden les may be recovered for potential inclusion in the discoverable
universe. The recovery of this data may in some cases point to a intentional
spoilage issue.

6
There are applications that will securely delete this hidden data, but in
general they require that the user actively install and use the product. The
presence of secure deletion applications in themselves are not generally cause to
imply spoilage as many organisations use secure deletion in much the same way
that they use encryption, that is to protect corporate information from those
who might exploit that information.

Encryption

In any e-discovery the problems surrounding encryption must be addressed.


This issue is not so much about encryption applied by the discovered entity,
since they will know how to decrypt their own documents but rather in regard
to sta and employees who have applied encryption on their own account.
Once again it is not so much an issue of breaking the encryption but rather
the failure to know that encryption has been applied. Consider the case where
a full text search has been conducted on the relevant universe of data. If en-
cryption has been unknowingly applied to a part of that universe then the full
text search will not be able to ag those encrypted les that match the search
criteria. The result is that potentially signicant les may be over looked in the
discovery process.

Compression

A similar situation to the encryption issue is posed by the use of data compres-
sion utilities. Once again it is not the actual data compression that is material
but rather that the party conducting the e-discovery may not be aware that
compression was used and potential evidence may be over looked.
A number of full text search application are able to open and read the text
within commonly used compression schemes. It should be noted however that
the search routines native to the operating system are not generally able to
perform this function.

Data storage paradigms

There have been a shift over the years in the way that data is associated and
in repository design. In the earlier days of personal computers the trend was
toward storing the data with the application that created it. This was partic-
ularly common with word processing and spreadsheet les. One of the driving
forces behind his was that the media structures were rather simple and the need
was to have the application and the data near and in many cases on the same
media.
As networks came into broader use there was a trend to store data in a
central repository where it could be managed and backed up more easily. To
aord a level of security, users were assigned usernames and passwords that
provided access to specic regions of the storage media. Often each user was
also allocated storage for their own work les in what was termed their home

7
directory. Users could selectively allow others access to individual les on a user
or group basis.
A nal shift in the paradigm was the move toward project centric data
storage. All the data les for a particular task, function or project were stored
centrally together. In this design the application that created the data was in
many ways irrelevant as was the specic user that created or modied the le.
A project repository could contain word processing, spreadsheet, engineering
drawing, in fact anything that related to the project.

Discovery by generations

Looking at the specic needs related to each data storage paradigm, we can see
that each one imposes dierent discovery techniques. We should however be
aware that most sites utilise a mixture of paradigms either intentionally or by
default.
In Generation 1, the machine based storage, discovery is centralised around
specic computers that may have been used for associated tasks during the dis-
covery time frame. In this regard the discovery is based on custodian principles
and is clearly a subset of the available machines. It is a rather simple task in
this type of structure to identify the documents of interest. From a business
viewpoint the potential for disruption is contained and manageable.
The second generation, user based storage tends to concentrate the discovery
eort on the basis of the users who would have reason and authority to create
and edit the les associated with the discovery eort. Because the majority
of the les would be held in a central le server environment then selection of
those les would be based on the media structures that the discovered enterprise
maintains.
We should not however only consider the centrally maintained data but
also include any local les located on machines used by those users who are
identied in the discovery criteria. Additionally personal assistants and the like
of those users should also be included in the potential discovery set as they often
manipulate data and les for their superiors.
The third generation, project based data once again concentrated the data in
a central repository and additionally classies data on the basis of the project,
task or client or supplier. In this regard the potential universe is materially
constrained to those matters that can be readily identied as part of the issue
under discovery.
Although easier to identify the subset, there is still the matter of locally
stored data as covered above that must always be addressed.

Email

Email is considered one of the most fertile of discoverable documents.


It is curious thing that people tend to be more outspoken, less guarded when
communicating via email. There are various possible reasons for this ranging
from the immediacy of email, where a party can just jot o an email and tell the

8
other party what they are thinking or not thinking. Contrast this with other
forms, letter or even fax where the party has to compose the communication.
Another possible explanation is that email is a non-tangible form of commu-
nication and the sender may consider that it is in some way less real. Regardless
of the motivation, email messages are often a good source of discoverable infor-
mation.
Email discovery is similar to discovery of other documents. A set of criteria
is established which may include date range sender or recipient names, subject
heading or content. In handling email it is essential to include the content of
any attachments in the discovery process.

Spoilage by discovery

One of the signicant issues is that spoilage may result from the very process
of a type 1 or document centric discovery. Although there are precautions that
may be taken. In general the very act of opening a document or even searching
for it will change at the very least the last access date which may be signicant
in any future e-forensic endeavor.
To over come this limitation a number of techniques have been developed
to enable the investigator to conduct either a full-text index or search and
optionally extract those les that adhere to the discovery criteria.
Once again the method is determined by the type of system that is being
discovered. A stand alone pc wiih no connection to a wider network (which
is rare in a commercial environment) is much easier to perform non intrusive
discovery. Basically the method entails the booting of the machine with an al-
ternative operating system usually linux and mounting the data storage devices
as read only so that the meta data is not compromised.
The text index or even the discovered documents may then be stored on self
conguring removable media such as a usb memory card.
In adoption of this approach the forensic purity must be maintained and
both chain of custody and a record of the procedure must be kept.
Where the data is potentially located on a central le server or distributed
in a peer to peer manner a dierent approach must be adopted. Where the
data is located on a le server, mail server or other mass storage device then the
investigator should consider the use of back up media to conduct the discovery.
This may pose a number of problems in regard to osite storage and data
media format. Both historically and currently there are quite a number of
varying media formats and data storage schemes.
The main deciding factor is going to be the time frame of the documents
being discovered. Where the matter relates to incidents in the past few years
then the data storage technology is going to be readily available to restore the
backup media to a forensically clean machine. By contrast where the matter
relates to events that are a decade or more old then the archived data may be
unreadable by current devices.
One example of this situation is where a rm has archived reel to reel tape
backups and the original tape drive is no longer in use or available. On the

9
positive side there are ways to retrieve this data but you should be aware of the
time and resource issues in such a discovery endeavor. This was the very cir-
cumstance with the exxon case where the so called smoking gun was discovered
on an old archive tape dating back to the 1980's.

User accounts and passwords

Depending on the jurisdiction, there are parties who have the power to compel
a person to divulge or bypass a password or encryption scheme. The challenge
for the discoverer is that electronic indexing application may not be able to
`look into' the document for text searching. This of course translates to the
discovering party not being aware of potential evidence.

Deleted and other invisible data

One of the problems of the e-discovery process is that it is not by denition


a process that extracts deleted or hidden data from the target machine. This
process more specically comes under the province of e-forensics.

Users using mutiple machines

Within commercial environments a given user may have access or use a num-
ber of discrete computer work stations including both laptop and home based
systems. In circumstances such as these where the parties specically involved
in the discovery may be identied it is possible to narrow the search to specic
machines and personal directories on the central servers. This does not preclude
the searching of other machines or le server areas but may provide sucient
material to justify a fuller discovery eort.

10

Das könnte Ihnen auch gefallen