Beruflich Dokumente
Kultur Dokumente
archive U.S. government websites at the end of President George W. Bush’s administration. The
stated goal of this project was “to execute a comprehensive harvest of the Federal Government
domains (.gov, .mil, .org, etc.) in the final months of the Bush administration, and to document
the changes in the federal government websites as agencies transition to the Obama
administration.” (End of Term Web Archive, 2018) Initiated by the Library of Congress (LOC),
project partners included the California Digital Library (CDL), George Washington University
(GWU), Internet Archive (IA), the University of North Texas Libraries (UNT), Stanford
University Libraries (SUL), and the U.S. Government Printing Office. The partners joined forces
with members of the International Internet Preservation Consortium (IIPC) and the National
Digital Infrastructure and Preservation Program (NDIIPP). While IA hosts the public access copy
of the archive, the LOC holds the preservation copy, and the UNT holds an additional copy for
data analysis. Later, project participants resumed their efforts to document changes in the
government web during the transition to President Barack Obama’s second term in 2012, and
websites in the legislative, executive, and judicial branches of government with deference to
subject (i.e. end of presidential term), creator (i.e. government agencies), genre (i.e. government
records or policy issues), and domain (i.e. .gov). The objective of the EOT’s web harvest is to
preserve and “document the federal government’s presence on the web during the transition of
Presidential administrations and to enhance the existing collections of the partner institutions.”
1 STEPHENSON
(Ashenfelder, 2016) The primary challenge was to identify and select U.S. government content to
archive. While the EOT partners work from some known lists of government URLs, the lists are
not entirely comprehensive. Thus, in addition to archiving all .gov URLs they’ve previously
identified, “anything nominated by volunteers will become ‘priority’ URLs that get a bit more
attention during the archiving process.” (Manus, 2012) Volunteers can nominate any number of
websites for consideration, but a few topic areas of focus requested by the archive includes
judicial branch websites, important content or subdomains on very large websites (i.e.
NASA.gov) that might be related to presidential policy, and government content on non-
government domains (i.e. .com or .edu). (Grotke & Hartman, 2012) According to the EOT
archive, in order to “identify, prioritize and describe the thousands of U.S. Government web
hosts, the University of North Texas built the Nomination Tool” (End of Term Web Archive,
2018), which “enables collaborative collection development for web archiving, and has since
of sites at varying frequencies, crawling areas of the government web of particular interest to
their organizations. Information specialists, including librarians and political researchers, were
employed to assist with the selection and prioritization of selected websites to be focused on in
the crawl, which “included sites identified as being potentially greater at risk of rapid change or
disappearance.” (End of Term Web Archive, 2018) The prioritized URLs were initially collected
in December of 2008, and again after the inauguration in January of 2009. A final broad,
comprehensive crawl was performed in Spring and Fall of 2009 to document any changes that
had occurred. Each project partner participated in transferring the content they had collected to
form a single consolidated archive. All of the partner institutions collected content using the
Heritrix web crawler, which was developed by the Internet Archive with support from the IIPC.
2 STEPHENSON
The IA also “reconfigured existing in-house tools to automatically generate metadata records for
the over 6,000 websites in the End of Term Web Archive.” Highlighting the collaboration of
efforts, the CDI provided input on the Dublin Core (DC) format (i.e. title, date, and description),
while IA generated the records and thumbnail images one finds when browsing the archive.
Strategically, the Library of Congress set up a central transfer server for receiving
NDIIPP content. The “server was for ‘pull’ transfers by the Library from other institutions”
(Lazorchak, 2011). An “rsync server” was also set up at this time and made web archive content
available for other institutions, like the University of Maryland, to “pull” from the Library. The
flow of content in both directions was successful, while also establishing “transfer history with
the Internet Archive, which had the bulk of the ‘End of Term’ content.” In short, the workflow
for transfers was initiated by the Library, who pulled content from a partner to their transfer
server, verifying the files received for quality assurance (i.e. complete files and free of
corruption), followed by copying the content to a long-term storage server and the rsync server,
whereby partner institutions could transfer and “pull” for themselves. As each partner pulled the
content from the rsync server, the Library deleted it to make room for the next transfer.
(Lazorchak, 2011)
The transfer of web content began in May of 2009, and the process was completed by
mid-2010. All transfers were performed over the network using Internet2. To address the
challenges presented by transferring and aggregating the extensive EOT content, all “partners
organized their content in ‘bags’… based on ‘BagIt (PDF),’ a specification for packing digital
content for transfer and storage, developed collaboratively by NDIIPP partners.” (Lazorchak,
2011) In practice, a complete bag is one that is holding all its content, whereas a “holey” bag,
alternatively, is empty of its contents but contains a crucial file, “fetch.txt,” that includes the
URL location of every file that makes up the complete bag. Thus, a “complete bag is the target of
3 STEPHENSON
transfer and the holey bag is the means of transferring it.” Essentially, “a source institution
provides pairs of complete and holey bags on its server.” The transferring partner will download
the holey bag first, use it with its fetch.txt file to transfer to a complete bag from the source
server, then “fill” the holey bag at its end. Bags are not strictly a tool of transfer, but they also
serve to hold content on storage and access servers. By manipulating holey bags for transfer,
institutions can organize content on their own servers without having to re-organize or resize it
for transfer. Based on this principle, the LOC developed an open source tool called BagIt
Library, which aided in the transfer of content and manipulation of bags. They subsequently
shared the tool with all participating partners to bag their own content and transfer other partners’
content from the Library. They also released a desktop version for users called Bagger.
Installation is said to be minimal, and the tool proved successful and easy to use. According to
Laura Graham, a Digital Media Project Coordinator at the Library of Congress, “When we began
this project, we were much less daunted by total content size than by the scheduling and tracking
issues” (Lazorchak, 2011), but their “simple, straightforward methods served those
requirements,” and “were greatly aided by the organization of all content into bags and the use of
project’s website, under the Freedom of Information Act (FOIA), the public has the right to
access unprivileged information and government documents from any federal agency. The
Congress, President, and Supreme Court have all recognized the FOIA as a vital part of our
democracy, and as such, it is reasonable to deduce that all content archived by the partnering
institutions for the EOT project is legally sound. Furthermore, the involvement of the Library of
Congress and U.S. Government Printing Office would indicate that copyright infringement or
breach of ethics would not be an issue of concern for the project’s archiving practices.
4 STEPHENSON
The selection, acquisition and archiving were deemed a major accomplishment for all
partnering institutions when the initial End of Term Web Archive was officially completed in
2010. Team members later recommenced for both of President Obama’s terms, and when
reviewing the EOT archive’s website today, you can browse by presidential timeframes, 2008-
2009, 2012-2013, and 2016-2017. In addition to the impressive scale of the collection and the
influential scope of collaboration behind this project, new developments in web harvesting and
access technologies were bred from this monumental effort to preserve at-risk government digital
content during periods of presidential transitioning for cultural and historical importance. Any
challenges posed during the process has certainly “been met with innovation” (End of Term Web
Archive, 2018) along the way and “resulted in considerably more than the archive alone.”
5 STEPHENSON
References
Ashenfelder, M. (2016, August 31). Nominations Sought for the U.S. Federal Government End
of Term Web Archive. Library of Congress. Retrieved from
https://blogs.loc.gov/thesignal/2016/08/nominations-sought-for-the-u-s-federal-
government-end-of-term-web-archive/
End of Term Web Archive. (2018). Project Background. Retrieved from
http://eotarchive.cdlib.org/background.html
Grotke, A., & Hartman, C. (2012, July 9). 2012 End of Term Web Archive: Call for
Volunteers. Free Government Information. Retrieved from
https://freegovinfo.info/node/3739
Lazorchak, B. (2011, July 26). The ‘End of Term’ Was Only the Beginning. Library of
Congress. Retrieved from https://blogs.loc.gov/thesignal/2011/07/the-end-of-term-was-
only-the-beginning/
Manus, S. (2012, August 17). Collaborating to Identify Government or Election-Related
Websites to Preserve. Library of Congress. Retrieved from
https://blogs.loc.gov/thesignal/2012/08/collaborating-to-identify-government-or-election-
related-websites-to-preserve/
6 STEPHENSON