Sie sind auf Seite 1von 12

George Washington University, CSci175 Information Policy

Document Retention
Policies, Law and
Issues
Impacts and issues in the software development process

Michael Corsello
10/18/2008
Abstract
Document retention has become an area of increasing importance including a dramatic increase in
regulation regarding the organizational policies for standardizing the retention practices for documents
and content in general.

This paper will discuss and describe some relevant regulation covering document retention overall and
specifically detail impacts and issues in the specialized area of software development. The development
of software includes generation of source code and documents which detail the design and process by
which the software is developed, begging the question as to what is a document and which content
needs to be retained for regulatory purposes. Furthermore, the software being developed will be
subject to regulation for content retention as hidden requirements that may dramatically increase the
overall cost of developing software applications.

In the practice of software development, there is little distinction between documents and content as
one is simply a semantically constrained subset of the other. By altering the definition of document, the
concept of content becomes largely indistinguishable from a document. It is for this very reason that
this paper covers the concepts as being interchangeable.
CSci 175 Information Policy Michael Corsello

Background
Document retention is a subset of the larger concept of content retention. Content retention is the
collection of policy and practices surrounding the standardization of practices involving the collection,
storage, tracking, security, retention and disposal of any content. Any data produced in the course of
conducting business that is pertinent to the business is content. Content retention is a portion of the
larger concept of content management, which consists of the portions of management involving the
disposition of the content from creation to destruction.

Business case for content retention


Content retention is of critical importance to any organization simply due to the legal implications of
non-compliance. Beyond the legal implications, standardization of content management practices
forces an organization to address the how, what, when, where and why of all information the
organization owns. In order to guarantee compliance with regulation it is required that some measure
of standardization of content management is performed. The standardization of content management
will include the identification of which content to keep for what purpose, and for how long.

Regulations on retention
In recent years the requirements imposed on organizations by the governments of the world has
dramatically increased with respect to content retention. The public failures of organizations practices
such as the Enron scandal and the Veterans Affairs data loss are partially responsible for the new
regulations. At the federal level, laws that impact the content retention policies of organizations
include:

Clinger – Cohen (National Defense Authorization Act for Fiscal Year 1996)
Sarbanes – Oxley (Sarbanes-Oxley Act of 2002)
HIPPA (Health Insurance Portability and Accountability Act)
DoD 5015.02-STD (Electronic Records Management Software Applications Design Criteria
Standard)

In addition to these regulations defining explicit requirements on content practices, there are many
other regulations that directly or indirectly require standardization of content management practices.

Retention practices
Content retention overall involves the practice of “holding” or retaining content from the time it is
captured or created to the time it is “released” or destroyed. This concept introduces two sides of the
paradigm of retention: to retain and to destroy. These will continually play against one another in this
paper.

Standard time period


For each type of content, that content has a definable purpose for an organization over a standard
period. That purpose will bring value to the organization by the use of the content retained. At the time
the content is no longer of positive value to the organization, it should be disposed of. The
1

Fall 2008 Booz Allen Cohort 2


CSci 175 Information Policy Michael Corsello

standardization and documentation of this duration and the practices involving the transition of the
content between these states is the primary goal of retention.

Disposal processes
Once a document has outlived its standard period of benefit it will be disposed of. The disposal of
content must also be standardized and documented. Disposal will involve the actual process of
discovering expired content among the full corpus of content and the process for removing expired
content from the organization storage repositories. This should also document the results of the
destruction to provide the level of confidence that the content is unrecoverable once disposed of.

Content Retention
To appreciate the complexity of content retention the individual concepts of content and retention must
be understood.

What is a document
Prior to a discussion on content and document retention, it is critical to understand what each of these
concepts truly represents. Content can be any information of any type, structured or unstructured. This
concept of content can include something as simple as a single word. When content is placed in a
context such as an order form, that content becomes a record. A record that is stored to some
persistent media is a document. In that manner, an order submitted in a web form is a document once
saved to a database or printed out. This makes the structure of the persistence mechanism the actual
form of the saved content and therefore a critical issue to the developers of software persisting such
content.

In a software application such as an online shopping site, which will persist the orders as records in a
database, the persisted structure of the data representing the “document” will in no way resemble the
format of the “document” presented to the user. This presents a number of considerations to a
developer:

What must be retained


How is data to be removed
What is the beginning and ending of the document

For the example of an order, several pieces of information comprise the document:

Customer information
Billing information
Shipping information
Order items
Metadata
o Date of transaction
o Date of shipment
2

Fall 2008 Booz Allen Cohort 2


CSci 175 Information Policy Michael Corsello

o Date of arrival
o Disposition of order (returns)

In an information system, the structure of this information is not together as a comprehensive


document, but instead as fragments of related data that connected via “keys”. All of these issues as to
the nature of a document must be addressed via retention policies to ensure the proper portions of data
are disposed of when appropriate.

What is retention
Retention is the entire lifecycle of content on persistent media. The concept of retention must include
the eventual destruction of the content from the media and potentially the destruction of the media
itself when no longer viable for re-use. For the media itself, retention must also cover applicable reuse
of the media once the content it contained is disposed of. For paper, this must include scenarios such as
secondary use of paper for fax machines. It is obviously critical to ensure that sensitive documents
printed on paper are not re-used for fax paper once out of date.

All of the practices regarding keeping, re-using and disposal of content and media are within the scope
of content retention. Since the coupling of retention is so tight with the practices of management of
content, the two areas are largely interchangeable, though management also involves other practices as
well.

Backups and Continuity of Operations (COOP) is not retention


There is a critical distinction between retention of content and disaster recovery. In general, any
disaster recovery or continuity practices are separate from retention practices. However, it is critical to
understand that disaster recovery content is still subject to use by enforcement agencies as a source of
data. Given this point, it is critical to include such content in the retention planning process to ensure
proper destruction of such content to prevent secondary avenues of data exploitation.

All evaluation of all content must be on an even and level playing field to ensure proper handling and
disposition. Overall, any information can illustrate both good and bad points depending upon who has
the information and how they attained that information. Therefore, it is quite important that all
information that can be disposed of be disposed of as soon as possible to minimize the potential liability.
This includes the destruction of information on backup and COOP media in addition to all production
media. Backup usage and planning should also consider this and forbid the use of backups as a standard
mechanism of restoring content due to use fault. This practice would count as an accepted form of
content retrieval and thereby make backups considered production media as well. Backup and COOP
content must be restricted to use during disasters resulting in hardware failure only as part of the
retention plan.

Configuration and Content Management (CM)


The practices of configuration and content management in an organization are not specific to software
development but do have specialty areas in software development organizations. The overall concepts
of both forms of CM involve the management of content produced in the course of operations. The
3

Fall 2008 Booz Allen Cohort 2


CSci 175 Information Policy Michael Corsello

primary concern in CM is the tendency to desire to retain content. In CM practices, content is generally
versioned over time to illustrate the history of content. From a retention perspective, this must be
balanced with the need to purge content as its use is diminished over time.

Documentation of retention practices


All practices must be standardized and documented to provide a public and formal proof of internal
practices. This must also be distributed and actively practiced by members of the organization. Practice
without documentation is much more difficult to prove, justify and measure actual compliance.

Software Development Practices


The development of software is a complex process that is more similar to the process of invention than
that of manufacturing. The development of a software system will involve several technical
specialization areas to ensure the system built addresses a business need, meets all regulatory
requirements, is relatively free of defects, is easy to use and understand and generally functions as
specified and desired.

High-level introduction
The process of software development involves several phases during which a specific portion of the total
system becomes defined. The basic development phases include inception, elaboration, construction
and transition. Prior to starting a development project, the customer and software provider commit to
a contract. This “pre-inception” phase involves the aggregation of business processes to automate, the
scoping of the effort, identification of the key stakeholders, base lining an anticipated timeline for
completion and a rough cost estimate.

Inception
The inception phase of the development effort involves the creation of a set of requirements that depict
the business processes to automate, all applicable regulations and policies, any performance constraints
and the general constraints on the overall construction. Once completed, this will result in several
formal documents including meeting notes, possibly audio or video of meetings, rough sketches and
business documentation from the client. All of this content is managed through the CM processes.

Elaboration
During the elaboration phase the content produced in the inception phase is analyzed to produce a
workable design for the system. The design may also include prototype code for demonstrating design
concepts. Again, several formal documents are produced and all content is managed through the CM
processes.

Construction
The construction phase is where construction, testing and validation of actual production quality
software is performed against the documents produced in the earlier phases to ensure compliance with
the stated requirements. Again, there are formal documents produced as well as the source code for
4

Fall 2008 Booz Allen Cohort 2


CSci 175 Information Policy Michael Corsello

the system itself. The CM processes are used to manage all of this content. Technically, at this point,
the construction is complete and all deliverables are provided to the client. Therefore, it could be argued
that there is minimal value in the retention of any content produced under this contract at this time.

Transition
The transition phase involves the integration of the new software into the client business and the
continual maintenance and upgrading of the software over time. If the same company has the
maintenance contract, the entire body of content may be useful to evolve the software. The
management of all content over time during the ongoing transition phase is still performed under the
CM processes.

The general theme of CM in the software development lifecycle (SDLC) is to retain content including all
revisions to all source code throughout the life of the project and beyond. In many cases, content is
applicable to multiple contracts and as such is desirable as a source of content to expedite content
creation. Unfortunately, the contracts themselves often do not discuss the legality of content reuse at
all and simply the nature of unrealistic time expectations drives the reuse of such content. The balance
of the value argument to the liability of perpetual or non-standardized retention is not generally
realized.

Process documents
Over the course of the SDLC there are several documents produced to support the construction of the
software system.

Business processes
Generally business process documentation is produced by the client and delivered as-is to the software
development team. These processes are entered into the CM library (CML) for retention as
documentation of the processes to be automated. This serves as accountability for the development
team back to the client to ensure compliance of software. These process documents are generally
accompanied in bulk by a signed inventory sheet depicting the versions and delivery of these documents
to the development team.

If any changes occur to the formal processes followed by the client during development, any resulting
changes to the software being developed can be “at cost” to the client by referencing changes to these
documents.

Meeting notes
Notes are stored in the CML for each meeting throughout the development process. Audio or video are
often captured for meetings such as requirements elicitation meetings. The size of audio and video
content is an issue for CML storage, but when captured it is also stored.

Fall 2008 Booz Allen Cohort 2


CSci 175 Information Policy Michael Corsello

Requirements
Because of the requirements process there are several documents generated. The primary
requirements documents are the Software Requirement Specification (SRS) and the Requirements
Traceability Matrix (RTM). These two documents form the basis for all work performed during the SDLC
and are the most critical to retain.

Design documents
Based upon the content of the SRS the software design will depict the expected structure and function
of the application to build. The design will consist of one or more documents collectively known as the
Software Design Document (SDD). The SDD is mapped to the SRS in the RTM where each design artifact
in the SDD is mapped to the requirements in the SRS that design artifact will partially or completely
realize.

Design as a process takes a significant amount of time and is argued as being of little practical value in
an “Agile” development methodology. The risk of not performing a detailed design may reduce the
accountability and tracking of requirements if not properly documented.

Testing documents
Each portion of the software application must be tested to ensure it works to design specification and to
requirement. The testing process, the tests performed and the results of each round of testing are
documented and stored in the CML. Once all tests pass and the system is delivered, the results of the
incremental tests leading up to a passing score are of little business value.

Configuration and Content Management (CM)


The overall process of managing content produced in the SDLC is called configuration management. The
concept of a configuration is any portion of content that results in a specific configuration. A
configuration is the manner in which an operationally deployed application is structured, configured and
works. This concept of configuration management is a specialized subset of content management (also
CM) regarding the software and system development cycles.

The library
All configuration content is stored in a repository known collectively as the configuration management
library or CML. The CML includes all content across the entire lifetime of the project. The CML is
responsible for the maintenance of proper naming standards (and their enforcement), versioning of
content and accountability for access and dissemination of content. The only official source of content
in a development project is from the CML.

Responsible parties
The configuration manager and their team manage the CML. A client representative generally will have
visibility into the content within the CML. The Information Assurance Officer (IAO) will also have
visibility into the CML and oversight to ensure the management of the CML follows the defined content

Fall 2008 Booz Allen Cohort 2


CSci 175 Information Policy Michael Corsello

management policies. Finally, all contributing personnel are responsible for submitting content to the
CM team for inclusion in the CML.

Source code retention


One key type of content for any software development project is the source code for the application
being developed. The source code is the actual content that is or becomes the application itself. There
is continual modification and re-factoring of source code over the lifetime of the project. The
management of the source code takes place is a special repository called a source code management
system (SCM).

Source code concepts


Source code is represented in text files that contain text written in a computer programming language.
For many languages this code is then compiled into binary executables such as an “exe” or “dll” file.
Other languages such as HTML (hypertext markup language) are used directly and not compiled.

There are several concerns with the management of such files:

What is managed, source or compiled libraries?


How are changes tracked over time?
What is a change?
Are daily variations tracked?
Is the SCM part of the CML?

In general, the source is the only thing that is content managed over time. However, compiled files are
tracked via the testing process to ensure only tested files make it to use by other developers and to
production. Changes are tracked in the SCM automatically by “deltas” or saving what has changed with
each edit. This however must be managed for which changes are significant and make it through the
testing process. If a change is made to a source code file, it is not significant to track alone. Instead,
changes are defined more by progress over time than static edits themselves. Likewise, daily changes to
code do not represent changes, but instead more provide a means of sharing code between developers
to aid in productivity.

Overall, due to the nature of the SCM it should NOT be part of the CML, but instead be governed by the
CM Team to ensure proper management of the source in the SCM. The only source code that should be
tracked in the CML are baselines, or releases that have meaning to the schedule or otherwise to the
client. These should be stored outside of the SCM to ensure distinction from the code in the SCM. In
practice, this is rarely done and the SCM is considered a key part of the CML. This is largely because of
how an SCM works.

Versions and Baselines


As changes are made to source code, each action of “checking in” code results in a new “change set” or
revision to those files checked in. These change sets are revisions in time that are selected by time. To
7

Fall 2008 Booz Allen Cohort 2


CSci 175 Information Policy Michael Corsello

view the entire application at once (often many thousands of individual files), a time is selected from the
calendar to represent the view of the system to acquire. This view will show the state of all files in the
system at that point in time. This is used practically as a means of “rolling back” when a change is made
that is later found to be less than beneficial.

As development proceeds, a label may be placed in the SCM on the current state of all files in the SCM
at that point in time. This label indicates a version for the code base and is often a release milestone.
Given the SCM has this power “built-in” it is often simply adopted as the de facto means of managing
source code content.

Retention or disposal
Since source code is the application and it evolves over time based upon changes made to the source
there is a high value placed upon the code itself. The source code in the SCM is considered to be the
primary source of value in all development projects and may often be reused in part or in whole across
projects. While there are issues of intellectual property rights at stake with source code, the time
demands for completing a development project often outweigh any considerations for replicating effort
for similar work product.

Due to this high-perceived value proposition and due to the inexpensive nature of storing this content it
is rarely every disposed of until it is entirely out of date. This often results in the retention of source
code for years beyond the conclusion of a project including all edits ever made during the development
process.

Since the cost of developing software is so high and the demands upon development teams are
generally quite unrealistic, many sloppy processes exist which are poorly followed. The proper disposal
of software development content including source code is of high importance and is rarely done
properly. There is a tremendous opportunity for investigating how software is actually developed to
respond to a disappointed client as mostly all project leave circumstantial evidence around to be
retrieved many years beyond their practical usefulness.

The entire process of software development has never been required to address the issue of content
disposal practices, as IT professionals are primarily concerned with retaining information. Overall, a
guidance package is required to illustrate the liabilities of not actively planning, scheduling and following
a standard process for content retention and disposal.

New software applications are created to solve specific practical problems in business. These solutions
generally are not planned based upon legal or liability implications of how the applications are used. It
will become increasingly important to ensure that software applications are developed based upon a
dynamic set of uses that can be modified to adapt to unplanned purposes.

Fall 2008 Booz Allen Cohort 2


CSci 175 Information Policy Michael Corsello

Software Implementation Requirements


Software applications meet a set of requirements defined when planning the application. Any unknown
or unanticipated requirements at that time are likely to be unsupported by the application when
completed. For commercial applications there are no clients directing the requirements. Instead, all
requirements are anticipated needs for expected or current clients using other products.

For emerging regulatory requirements, few applications can currently support those demands. That
results in a requirement to modify existing applications to support the requirements or to fulfill those
requirements outside of the system (often manually). Defining what content in an application must be
regulated and purged is a challenge when the client will often not understand the legal implications of
the application data storage.

A major area of development in software applications is the use of data for new purposes such as data
mining and analysis. This will have growing legal implications as advanced analytics become easier to
produce. As applications are developed that increasingly centralize data into consolidated databases,
these databases may violate regulation implicitly via data aggregation due to poor alignment of
regulators understanding technology and technology implementers understanding regulation. The
centralization of data and security is a major area driving application architecture and enabling
enterprise analytics. These converging aspects are happening to maintain high levels of performance
given increased data volumes at a cost of separation of concerns.

Politically and socially the free sharing of information is at the forefront of progress in the area of
information technology. However, the issues of security, privacy and piracy are most likely of higher
importance. There is an increasing number of sophisticated attackers attempting to compromise
systems and information for profit. Regulation to protect this information and privacy must be in step
with technologies to ensure both can be realistically implemented and enforced given the workforce
and tools available. Regulation that is too difficult, technical or costly to implement will not be and
skilled workers will not become available as education systems are already being streamlined to
increase the rate of production.

Software is arguably the most complex undertaking of mankind with technology being implemented in
dependent layers. Each layer of technology relies on the one below it, with the lower layers each being
older than the preceding layer. Older technologies tend to have less emphasis on security and multi-
user synchronization. Therefore, we should have no expectation of “fixing” our problems any time soon
without unrealistic costs. Over time the best solution will be to replace technologies to implement the
required capabilities prior to becoming regulation.

Summary and Conclusions


Content retention is a complex topic that has impacts on all aspects of the software development
process. The actual practice of developing software is impacted by the content created during the SDLC
and by retention of that content after completion of the efforts. The applications produced are also
9

Fall 2008 Booz Allen Cohort 2


CSci 175 Information Policy Michael Corsello

impacted by content retention issues in a much more significant way than is currently addressed in that
application developers will be required to construct applications that enforce compliance with retention
policies and regulations.

Document retention legislation has a significant impact on the software industry and the personnel
responsible for the construction of applications overall. The skills of developers in the industry are
already stretched with much higher demand than supply of skilled workers. Clients do not understand
the implications of implementing compliant systems and allowances for time costs will likely not be
acceptable. The practical reality of the need for document retention practices will remain
overshadowed by the practical costs of doing so for some time.

Information sharing is of increasing importance to the businesses using information technologies. This
need to share culture that is developing will further expand the issues of content retention and
technological implementations to address the social implications of this content sharing. Overall,
technology is focused on opening up information and capabilities for widespread use, while little
attention is paid to illegitimate or illegal use of this information.

In summary, the concepts of content management including retention and disposal are in need of
immediate attention from technologists and policy makers alike. The emerging trends around sharing,
privacy, security and discovery must be addressed to ensure a sustainable approach is defined and
followed by technology implementers and users alike.

10

Fall 2008 Booz Allen Cohort 2

Das könnte Ihnen auch gefallen