Preserving Government Websites with ‘End of Term President Harvest’
By: Valerie Nye
In the normal transfer of power from one presidential administration to another, changes occur to U.S. government websites. Websites historically change or may disappear altogether in this transition. Legally, there isn’t an agency or department responsible for preserving government websites.
In order to avoid the loss of historic information and internet content, the End of Term Presidential Harvest has become a regular activity undertaken by librarians and archivists across the country. The goal of the harvest is to preserve government websites that were created during the outgoing administration’s term in office. Harvests of this nature took place in 2008, 2012, and again in 2016. The harvests target as many federal .gov and .mil domains as possible. Federal content on .edu and .com websites and social media content created by the administration is also preserved.
The End of Term President Harvest 2016 is a collaborative project that is being undertaken by the California Digital Library, Internet Archives, Library of Congress, University of North Texas Libraries, and U.S. Government Publishing Office. The following is an interview with Mark Phillips, the associate dean for Digital Libraries at the University of North Texas.
VN: How did you get involved in this project?
MP: I began working with the End of Term (EOT) Project at the very beginning in 2008. At the time, a group of U.S. institutions that belonged to the International Internet Preservation Consortium (IIPC) were at the General Assembly meeting in Canberra, Australia, when we heard that the U.S. National Archives and Records Administration (NARA) would not be doing a comprehensive crawl of the federal domain at the end of the Bush administration. They had performed this crawl in 2004 at the end of the first Bush term.
The group of U.S. institutions there in Canberra decided that we should work together to complete a 2008 End of Term Presidential Harvest for what became the transition from the Bush to the Obama administration.
We successfully completed the crawling, sharing of data, staging for access and preservation of the first 16 TB EOT crawl in the summer of 2009 and closed up shop until 2012, when we got together to capture a snapshot of the .gov domain at the end of the first Obama administration. We added a few additional partners for this set of crawl and things went well.
In 2016, we started planning again for the end of the Obama administration into what is the Trump administration. This round has had quite a bit more publicity that I think highlights the value and importance there is in web archives like the EOT.
VN: Have you worked on this type of project in the past (any of the past harvests)? If so, can you tell me about it and how this project is similar or different?
MP: The University of North Texas (UNT) Libraries’ have been archiving websites since 1997 with the CyberCemetery, a web archive of defunct federal websites for agencies, commissions, committees and initiatives. The CyberCemetery was where we first got started in archiving websites. In addition to that we’ve been archiving the unt.edu domain twice a year since 2005 as part of the University Archives.
The work with the End of Term project really pushed us to work with larger infrastructure for harvesting, processing and providing access to archived web content. We now have over 250 TB of Web content that we preserve and provide access to as part of the UNT Libraries’ Digital Collections.
VN: Have there been any changes to the project scope since the November elections?
MP: I don’t think the scope for the EOT team has changed since we got started this time in January of 2016. We knew either way there would be a transition from the Obama administration to another and that would cause change no matter what. Once the election was over, we continued on our normally planned crawling in the time until the inauguration.
What did change was the massive interest in the project that followed the election. Librarians, researchers, technologists and citizens became worried that important information on government websites would be removed by the new administration and began to ask us how they could help. A number of projects like the Guerilla Archiving Event in Toronto or the DataRescue projects began to organize events, and we worked with the different groups that were interested in submitting URLs or data to the EOT project for crawling. That is something we hadn’t seen in the past.
We’ve operated a URL Nomination Tool during the 2008 and 2012 EOT projects, and we had an instance of this tool setup again in 2016. In the first two EOT projects, we received nominations from about 30 people. In 2016, we have nominations from 379 people. That’s serious growth in the number of people that were concerned about federal web content going away.
VN: How many people have been involved in the project to date?
MP: While the EOT project continues to grow with new institutions and volunteers, the core group of institutions who are crawling, sharing and replicating the data is pretty small. The most involvement is the Internet Archive (IA), who takes on the lion’s share of the crawling. The Library of Congress and UNT Libraries also do a significant amount of crawling of content with a narrower focus than IA. This year, George Washington University is crawling the 7,000+ Twitter and Tumblr sites with a tool they developed called the Social Feed Manager. This is the first time that we’ve really incorporated a comprehensive crawl of the social media sites like we’ve done this year. Finally we’ve crawled a massive amount of ftp data, which is primarily comprised of data and datasets. We are all excited about preserving more of this content.
VN: Can you briefly describe how people are working to capture websites? Are they working on a daily basis? Are they students? Are they volunteers?
MP: First off, all of the partner institutions in the EOT project are volunteering their time, infrastructure and storage resources to complete this project. We’ve talked about trying to approach granting agencies about helping to support this project but haven’t ever gotten around to that part of the work. After the EOT season ends, we all go back to the other things that we do in our day job, even though for many that is still web archiving.
With four primary crawling institutions (IA, LOC, UNT, GWU), we all work with our own infrastructure to crawl content nominated by people using the URL Nomination Tool, or from bulk lists of domains and URLs that we’ve worked to compile over the past year. We are using a combination of tools including Heritrix, which is an open source archival web crawler, and Social Feed Manager. When the harvesting completes we will make the content available using OpenWayback, a tool maintained by the International Internet Preservation Consortium (IIPC).
VN: I am looking at the report for past harvests. It looks like the EPA has more than 700 pages archived in 2016 and only had one archived in 2012 and five archived in 2008. I just want to make sure I am reading the data correctly and the definition of “page” hasn’t changed in the since the past harvests.
MP: The information in the URL Nomination Tool is used by the crawling institutions as places to start our crawling. We call these “seed URLs” and our crawlers will download the content from the seed URL, extract links from that page and continue to follow and download content until it runs out of new content or reaches a limit set by the crawlers. In 2008 and 2012, the EPA (probably epa.gov) was probably nominated at a fairly high level. With all of the interest in preserving climate change data this EOT, projects like DataRefuge have been feeding a much larger number of URLs from the EPA and associated agencies.
VN: Do you have any sense about data that has been lost before it was archived?
MP: It is really hard to tell what has been lost and even harder to say why. Even though we do a good job of capturing a snapshot of the federal domain once every four years that is ages on the Web. Just think about how often your library, university, city or favorite website changes its look, or its content management system. Each of those changes cause URLs to no longer resolve or “break.” If we had the resources we would love to start doing the EOT style of harvesting every two years, or every year, or best would be continuously. But as of right now once every four years is all we are currently able to do.
One of the things that we have now are three web archives — 2008, 2012 and 2016 — that we would be interested in working with researchers to try and better understand what has changed in our snapshots of the federal web. Questions that we are curious about in addition to “what is lost” are also: What is new? What has changed but is still there? What are the websites that existed in 2008 that none of us remember because they may have been gone now for almost a decade? We are hoping researchers begin to look to web archives to answer questions that they are asking.
VN: Is there anything you wish the press was looking at in regards to this project that I have missed? If so what?
MP: For those that have worked on the 2008, 2012 and now the 2016 archives, much of what we’ve done this time is what we had been planning on doing since January of 2016. We would have done this work had Clinton won instead of Trump. The thing that is different and wonderful in one sense is the amount of community engagement we’ve seen with the EOT and other projects like DataRefuge that will hopefully empower libraries, research departments and others to download, preserve and provide access to datasets that they find important to their work.
VN: Is there anything else you would like to add?
MP: Another thing that we’ve done this time around for the End of Term is that when we saw the large number of URLs in the URL Nomination Tool that were for PDFs we decided to download a copy of all of those publications and add them to a collection in the UNT Digital Library. We are calling this collection the End of Term Publications collection and it currently has over 1,200 publications nominated as part of the EOT project. The problem we have, and a new opportunity for people to get involved, is that none of these publications came with metadata records. We’ve already created metadata for 285 of these items but that’s not quite one-fourth of the reports. We could really use some help creating simple descriptive records for these items. It might just be way for someone to contribute to the preservation and access to federal publications. If anyone is interested in working on metadata for these publications they can take a look at our documentation for this project.
Valerie Nye is the library director at the Institute of American Indian Arts in Santa Fe. She has been active in local and national library organizations; recently serving on ALA Council, the New Mexico Library Association, and the New Mexico Consortium of Academic Libraries. Val has co-written or co-edited four books including True Stories of Censorship Battles in America’s Libraries, published by ALA Editions in 2012. True Stories is a compilation of essays written by librarians who have experienced challenges to remove material held in their libraries’ collections. She has an MLIS from the University of Wisconsin-Madison. In her time away from the library she enjoys road trips in convertibles and kayaking on lakes. Contact her at firstname.lastname@example.org.
This was the great post you shared with us glad to see this type of informative and helpful post you are doing such an excellent work keep doing like this it was incredible. Thank you for sharing this awesome info with us.