January 2004
TABLE OF CONTENTS
Introduction......................................................................................................................... 3
Community Profile.............................................................................................................. 3
Bibliographic Content Structure ......................................................................................... 4
Dot Gov Domain Metrics................................................................................................ 4
Dot Gov Attributes.......................................................................................................... 5
Versions and Editions ................................................................................................. 5
Official-ness................................................................................................................ 5
Integrity....................................................................................................................... 6
Discovery and Tracking Issues ....................................................................................... 6
Implications for a LOCKSS Documents Application..................................................... 6
Social and Economic Aspects of Collection Development ................................................ 7
Preservation Architectures for Government Documents ................................................ 7
Roles for Depository Libraries........................................................................................ 7
Roles for the Government Printing Office...................................................................... 8
Roles for Others .............................................................................................................. 9
Sustainability................................................................................................................... 9
Legal Aspects of Collection Development ....................................................................... 10
Conclusion ........................................................................................................................ 10
Introduction
Community Profile
Congress established the Federal Depository Library Program (FDLP) in 1860 to ensure
that the United States public has access to its government’s information. Authorized by
44 US Code Section 1902, the program involves the acquisition, format conversion, and
distribution of depository materials and the coordination of Federal depository libraries in
the 50 states, the District of Columbia and U.S. territories. The mission of the FDLP is to
disseminate information products from all three branches of the Government to
approximately 1200 libraries nationwide. Libraries that have been designated as
Federal depositories maintain these information products as part of their existing
collections and are responsible for assuring that the public has free access to the
material provided by the FDLP. Depository libraries represent a mix of library types,
including research libraries (both public and private), local public libraries, law libraries
and state library agencies. Of these 1200 libraries, Fifty-two have “Regional” status:
these libraries automatically receive every document distributed under the program and
are expected to maintain access to the material in perpetuity.
1
The summary of that meeting is available on the project website at http://lockss-docs.stanford.edu
FDLP institutions serve as diverse array of local community interests ranging from
scholarly research communities to citizens as government information consumers (both
individually and collectively). They need to acquire and preserve this content for a
variety of reasons, in order to ensure efficient and free access to important content as
well as to preserve this content for future citizens and scholars.
In this portion of the report we consider the critical technical aspects of web-based
government information content that must be addressed in any potential LOCKSS-based
implementation.
The development of electronic media, digital formats, and the internet has added
significantly to the variety of government information content and structure. Both of these
genres—publications and archival sources—have their equivalent in the electronic
arena. However, at least four new government information genres have emerged to
challenge access and preservation strategies: (1) portable digital media including an
array of diverse content on portable digital media (tape, diskette, CD-Rom, DVD, PDA,
etc); (2) databases (spatial, numeric, etc); (3) electronic administrative records (e-mail,
transaction records, etc.); and (4) software applications.
A study conducted by the California Digital Library within funding from the Andrew W.
Mellon Foundation explores the characteristics of the web-based Federal information
domain in great detail.2 In general, the study reveals that the domain of web-based
Federal information can be characterized as:
2
Web-based Government Information: Evaluating Solutions for Capture, Curation, and Preservation.
Project Report – August 7, 2003. Unpublished.
A few metrics of the “dot gov” domain identified in this study can serve to illustrate these
themes.
• Federal websites occupy approximately one-half to one percent of the “surface web”
(available to the public)
• There has been a thirteen-fold increase in the size of the “dot gov” domain between
1992 and 2003
• The half-life of a Federal web resources is four months
• The vast majority of files on Federal websites are in one of two filetypes: html or pdf
• The dot gov domain occupies as much as 85 percent of the so-called “deep web”
(hidden behind firewalls and password-protection screens)
These findings are borne out by an unpublished analysis conducted by Jill Vassilakos-
Long. Her study focused on the realm of published content as cataloged through the
GPO’s Catalog of Government Publications and indicated that electronic records fell into
a number of common formats: 88% Pdf files with the remainder falling into the category
of an html, text or wordprocessed format.
Official-ness
A second important issue for government information relates to its “official” character.
The legal and academic communities as well as the general public depend upon the
ability to identify the “official” version of a Federal government document. “Official-ness”
in the print environment is in general a settled question deriving from broadly-accept
customs in the areas of citation and imprint. In the web environment—characterized by
easily mutable and replicable content—the question of what constitutes the “official”
version of a government document is not settled. The first question is whether any
electronic or web-based version of a document can be described as “official”; this debate
continues within the legal community where the print version remains the “legal” version.
Assuming one agrees that electronic or web-based versions can qualify as official, what
are the constraints under which this designation can be made? Is the official version
that which resides on the originating agency web-site? Is it equivalent in official
character to versions that might be produced on tangible electronic media or republished
on another agency server such as GPO Access?
Integrity
A third critical issue relates to file and content integrity. According to Charles Cullen: “An
authentic object is one whose integrity is intact—one that is and can be proven or
accepted to be what its owners say it is. It matters little whether the object is
handwritten, printer, or in digital form”3 The digital format provides a number of peculiar
challenges to affirming for the reader that its content has been corrupted neither through
bit-degradation nor content-manipulation. There are a number of technical approaches
to maintaining file-integrity. Digital preservation science proposes a number of
strategies involving routine file backups, refreshment, and migration. However, the issue
is not only technical but also social: establishment of authenticity will depend not only
upon the development of technical routines that guarantee chains of custody but also
includes “third-parties” in the chain.4
The metrics of the dot gov domain demonstrate the size, complexity and volatility of
web-based government information. The attribute issues discussed above point to the
need to maintain file integrity and authenticity. A third set of issues relate to the
identification, cataloging and monitoring of the content. The same document may
appear at several different URL’s over time. Conversely, new content may be
substituted within the same URL path for existing content without notification or change
to the file-name.
The bibliographic content structure analysis raises possibilities for a LOCKSS technical
solution as well as some challenges. The LOCKSS program deals well with a number
of the key aspects of the content structure. LOCKSS works well with static web filetypes
transmitted via http such as html and pdf, demonstrated to be the predominant filetypes
in the dot gov domain. In addition, LOCKSS holds the promise of assuring content
integrity and integrating third parties in the chain of custody in a way that could help to
address many of the attribute issues identified above.
However, the dot gov domain also poses some challenges to the existing LOCKSS
program in terms of versioning control and discovery. LOCKSS thus far has worked with
content that is regularly published with new editions or issues on relatively regular
frequency patterns. And the location for the publishing is clearly identified as part of the
LOCKSS publisher plug-in development effort. A clear possibility for solving some of
this dilemma would be to collaborate with a central government agency such as the
3
“Authentication of Digital Objects: Lessons from a Historian’s Research.” In Authenticity in a Digital
Environment. May 2000. Washington, DC: Council on Library and Information Resources.
4
See for example Duranti and MacNeil, The Protection of the Integrity of Electronic Records: An
Overview of the UBC-MAS Research Project.” Archivia 42:46-67.
Government Printing Office with whom responsibility for bibliographic control and
dissemination of government information is formally vested.
In this portion of the report we consider the social and economic issues governing
development of web-based government information collections specifically as they relate
to a potential LOCKSS documents implementation.
These two approaches are not mutually exclusively. The Texas Electronic Depository
program involves a mix of both models, with copies of digital web-based state
government content being placed in electronic depository collections for public access
and certain institutions agreeing to preserve the files in their locally developed and
managed digital repositories. The group believes that it would be in the interest of the
Depository Library Community to rely upon and develop both approaches in order to
assure the best insurance policy for long-term access to Federal information content.
An implementation of LOCKSS for the Federal depository library community implies the
potential for supporting new roles for depository libraries beyond preservation and
access. Depository libraries are diverse and representative group of organizations. The
institutions are often leaders and innovators across a range of library services, from
reference, resource discovery, and collection building. They constitute a locus for the
application of technological innovation
The Federal Depository Library Program (FDLP) is based upon a concept of resource
sharing, both in terms of the concept of shared ownership and management and the
encouragement of resource sharing arrangements at the local and regional levels. The
arguments supporting this distributed approach in the print environment – cost savings
through cost sharing – is certainly at least as valid and appropriate in the distributed
digital library environment.
One value of the FDLP is the concept of heterogeneous access points sustaining
diverse community needs and interests. For example, in the print realm local libraries
select digital Federal content and agree to provide an array of public and technical
services sustaining access to and preservation of this content. Within the existing
guidelines, there is significant room for local variation. The program does not specify a
particular classification scheme or organization of reference services. The program
does not specify binding styles or stack access policies. In a networked environment,
LOCKSS might support the same ability of local libraries to acquire and organize
according to local need information resources related to local interests. And this would
allow for the creation of new local modes of content access, new databases and
enhanced cataloging for digital content. The resulting resources would have the benefit
that anyone might access these specialized resources, not just the local depository
community.
Another key role for libraries has been as memory organizations that help through their
local stewardship of government content to ensure the authenticity of this content. The
very distributed nature of this content within local physical repositories poses
tremendous barriers to any attempt to tamper with the content (falsification or
destruction) in systematic fashion. In this respect, memory organizations serve as
trusted repositories for both the public and academic communities.
Just as a LOCKSS implementation for Federal documents has the potential to affirm,
reinforce and broaden the public benefits of existing roles for libraries, it might open a
range of new roles for depositories. As libraries manage the digital content, they are
positioned to create new knowledge in many forms. For example, one library might
extract or generate text files related to a particular set of documents of interest to their
local community. Searches within these files could be integrated with other text
searches in the library’s digital collection ensuring a broad cross-fertilization of
government and non-governmental content. Searchable interfaces to these text files
could become a valuable enhanced access point for this content, potentially available to
all members of the depository community.
In addition to creating new forms of access to richer digital collections, the availability of
files in the LOCKSS caches to citizens and scholars could empower new and innovative
interactions with the content beyond those that are possible with either print collections
or static page image files for that matter.
As mentioned in the preceding section of the report, there are several potential roles for
GPO in a LOCKSS-based network of web-based government information. In large part
these derive directly from GPO’s existing mission as a key Federal agency responsible
for the dissemination and bibliographic control of government information. GPO is
uniquely positioned to facilitate a LOCKSS documents implementation. Specifically, the
adoption by GPO of the LOCKSS model, including establishment of a LOCKSS server
and might help solve several issues. GPO could:
• leverage its relationship with various Federal agencies and partnerships with
depository libraries to develop agency-specific LOCKSS plug-ins
• crawl agency web sites
• normalize formats across agencies
• authenticate captured content
• apply as appropriate digital signatures
• disseminate the content and associated metadata through the LOCKSS network
Sustainability
Building a community of LOCKSS partners from the existing FDLP program raises
several issues. What are some of the key reasons that libraries might join a LOCKSS-
based network? These might include some of the following:
Assuming LOCKSS were broadly adopted within parameters of the existing FDLP,
regional libraries would be required to retain the content in perpetuity and selective
librariess would have the option for withdrawal of content after a period of 5 years had
elapsed since distribution of the content. In addition, GPO can upon the request of an
agency require return or destruction of FDLP content.
Although depository collections are built around the needs of Congressional district
community in which the institution is sited, depository collections should be available to
all members of the public. GPO has developed an inspection program that ensures that
all collections and ancillary services are available to all members of the public. The
inspection regime focuses on a number of areas including: reference services;
bibliographic control; promotion/outreach; continuing education; collection development;
and physical facilities
Conclusion
This report finds sufficient need and interest within the Federal Depository Library
community for a government documents implementation of the LOCKSS preservation
program. Such an implementation would complement existing planning in GPO for
digital preservation and extend the current distributed archiving model of the FDLP into
the digital realm. Such an implementation would require fuller technical development of
the LOCKSS model to accommodate that distinct nature of web-based government
content as well as the social, economic and legal aspects of depository documents
distribution.