Sie sind auf Seite 1von 6

Name: Crystal Stephenson

Assignment 2: Report on Web Archiving Practices


Course: LIS6515 Web Archiving
Due: November 1, 2018
“The End of Term Archive Project”
The End of Term Web Archive was founded in 2008 as a collaborative project designed to

archive U.S. government websites at the end of President George W. Bush’s administration. The

stated goal of this project was “to execute a comprehensive harvest of the Federal Government

domains (.gov, .mil, .org, etc.) in the final months of the Bush administration, and to document

the changes in the federal government websites as agencies transition to the Obama

administration.” (End of Term Web Archive, 2018) Initiated by the Library of Congress (LOC),

project partners included the California Digital Library (CDL), George Washington University

(GWU), Internet Archive (IA), the University of North Texas Libraries (UNT), Stanford

University Libraries (SUL), and the U.S. Government Printing Office. The partners joined forces

with members of the International Internet Preservation Consortium (IIPC) and the National

Digital Infrastructure and Preservation Program (NDIIPP). While IA hosts the public access copy

of the archive, the LOC holds the preservation copy, and the UNT holds an additional copy for

data analysis. Later, project participants resumed their efforts to document changes in the

government web during the transition to President Barack Obama’s second term in 2012, and

again at the end of his presidency in 2016.


Following a thematic model of selection criteria, the EOT archive focused on federal

websites in the legislative, executive, and judicial branches of government with deference to

subject (i.e. end of presidential term), creator (i.e. government agencies), genre (i.e. government

records or policy issues), and domain (i.e. .gov). The objective of the EOT’s web harvest is to

preserve and “document the federal government’s presence on the web during the transition of

Presidential administrations and to enhance the existing collections of the partner institutions.”

1 STEPHENSON
(Ashenfelder, 2016) The primary challenge was to identify and select U.S. government content to

archive. While the EOT partners work from some known lists of government URLs, the lists are

not entirely comprehensive. Thus, in addition to archiving all .gov URLs they’ve previously

identified, “anything nominated by volunteers will become ‘priority’ URLs that get a bit more

attention during the archiving process.” (Manus, 2012) Volunteers can nominate any number of

websites for consideration, but a few topic areas of focus requested by the archive includes

judicial branch websites, important content or subdomains on very large websites (i.e.

NASA.gov) that might be related to presidential policy, and government content on non-

government domains (i.e. .com or .edu). (Grotke & Hartman, 2012) According to the EOT

archive, in order to “identify, prioritize and describe the thousands of U.S. Government web

hosts, the University of North Texas built the Nomination Tool” (End of Term Web Archive,

2018), which “enables collaborative collection development for web archiving, and has since

been used in other archiving efforts.”


Beginning in August of 2008, the archive team performed a broad, comprehensive crawl

of sites at varying frequencies, crawling areas of the government web of particular interest to

their organizations. Information specialists, including librarians and political researchers, were

employed to assist with the selection and prioritization of selected websites to be focused on in

the crawl, which “included sites identified as being potentially greater at risk of rapid change or

disappearance.” (End of Term Web Archive, 2018) The prioritized URLs were initially collected

in December of 2008, and again after the inauguration in January of 2009. A final broad,

comprehensive crawl was performed in Spring and Fall of 2009 to document any changes that

had occurred. Each project partner participated in transferring the content they had collected to

form a single consolidated archive. All of the partner institutions collected content using the

Heritrix web crawler, which was developed by the Internet Archive with support from the IIPC.

2 STEPHENSON
The IA also “reconfigured existing in-house tools to automatically generate metadata records for

the over 6,000 websites in the End of Term Web Archive.” Highlighting the collaboration of

efforts, the CDI provided input on the Dublin Core (DC) format (i.e. title, date, and description),

while IA generated the records and thumbnail images one finds when browsing the archive.
Strategically, the Library of Congress set up a central transfer server for receiving

NDIIPP content. The “server was for ‘pull’ transfers by the Library from other institutions”

(Lazorchak, 2011). An “rsync server” was also set up at this time and made web archive content

available for other institutions, like the University of Maryland, to “pull” from the Library. The

flow of content in both directions was successful, while also establishing “transfer history with

the Internet Archive, which had the bulk of the ‘End of Term’ content.” In short, the workflow

for transfers was initiated by the Library, who pulled content from a partner to their transfer

server, verifying the files received for quality assurance (i.e. complete files and free of

corruption), followed by copying the content to a long-term storage server and the rsync server,

whereby partner institutions could transfer and “pull” for themselves. As each partner pulled the

content from the rsync server, the Library deleted it to make room for the next transfer.

(Lazorchak, 2011)
The transfer of web content began in May of 2009, and the process was completed by

mid-2010. All transfers were performed over the network using Internet2. To address the

challenges presented by transferring and aggregating the extensive EOT content, all “partners

organized their content in ‘bags’… based on ‘BagIt (PDF),’ a specification for packing digital

content for transfer and storage, developed collaboratively by NDIIPP partners.” (Lazorchak,

2011) In practice, a complete bag is one that is holding all its content, whereas a “holey” bag,

alternatively, is empty of its contents but contains a crucial file, “fetch.txt,” that includes the

URL location of every file that makes up the complete bag. Thus, a “complete bag is the target of

3 STEPHENSON
transfer and the holey bag is the means of transferring it.” Essentially, “a source institution

provides pairs of complete and holey bags on its server.” The transferring partner will download

the holey bag first, use it with its fetch.txt file to transfer to a complete bag from the source

server, then “fill” the holey bag at its end. Bags are not strictly a tool of transfer, but they also

serve to hold content on storage and access servers. By manipulating holey bags for transfer,

institutions can organize content on their own servers without having to re-organize or resize it

for transfer. Based on this principle, the LOC developed an open source tool called BagIt

Library, which aided in the transfer of content and manipulation of bags. They subsequently

shared the tool with all participating partners to bag their own content and transfer other partners’

content from the Library. They also released a desktop version for users called Bagger.

Installation is said to be minimal, and the tool proved successful and easy to use. According to

Laura Graham, a Digital Media Project Coordinator at the Library of Congress, “When we began

this project, we were much less daunted by total content size than by the scheduling and tracking

issues” (Lazorchak, 2011), but their “simple, straightforward methods served those

requirements,” and “were greatly aided by the organization of all content into bags and the use of

a common set of tools.”


While the legalities of copyright or ethical concerns are not explicitly stated on the

project’s website, under the Freedom of Information Act (FOIA), the public has the right to

access unprivileged information and government documents from any federal agency. The

Congress, President, and Supreme Court have all recognized the FOIA as a vital part of our

democracy, and as such, it is reasonable to deduce that all content archived by the partnering

institutions for the EOT project is legally sound. Furthermore, the involvement of the Library of

Congress and U.S. Government Printing Office would indicate that copyright infringement or

breach of ethics would not be an issue of concern for the project’s archiving practices.

4 STEPHENSON
The selection, acquisition and archiving were deemed a major accomplishment for all

partnering institutions when the initial End of Term Web Archive was officially completed in

2010. Team members later recommenced for both of President Obama’s terms, and when

reviewing the EOT archive’s website today, you can browse by presidential timeframes, 2008-

2009, 2012-2013, and 2016-2017. In addition to the impressive scale of the collection and the

influential scope of collaboration behind this project, new developments in web harvesting and

access technologies were bred from this monumental effort to preserve at-risk government digital

content during periods of presidential transitioning for cultural and historical importance. Any

challenges posed during the process has certainly “been met with innovation” (End of Term Web

Archive, 2018) along the way and “resulted in considerably more than the archive alone.”

5 STEPHENSON
References
Ashenfelder, M. (2016, August 31). Nominations Sought for the U.S. Federal Government End
of Term Web Archive. Library of Congress. Retrieved from

https://blogs.loc.gov/thesignal/2016/08/nominations-sought-for-the-u-s-federal-

government-end-of-term-web-archive/
End of Term Web Archive. (2018). Project Background. Retrieved from
http://eotarchive.cdlib.org/background.html
Grotke, A., & Hartman, C. (2012, July 9). 2012 End of Term Web Archive: Call for
Volunteers. Free Government Information. Retrieved from

https://freegovinfo.info/node/3739
Lazorchak, B. (2011, July 26). The ‘End of Term’ Was Only the Beginning. Library of
Congress. Retrieved from https://blogs.loc.gov/thesignal/2011/07/the-end-of-term-was-

only-the-beginning/
Manus, S. (2012, August 17). Collaborating to Identify Government or Election-Related
Websites to Preserve. Library of Congress. Retrieved from

https://blogs.loc.gov/thesignal/2012/08/collaborating-to-identify-government-or-election-

related-websites-to-preserve/

6 STEPHENSON

Das könnte Ihnen auch gefallen