Digital Archive Profiles

Draft

Walter Henry
March 14, 1997

"... to ensure that no valued digital information is lost to future generations, repositories claiming to serve an archival function must be able to prove that they are who they say they are by meeting or exceeding the standards and criteria of an independently-administered program for archival certification [emphasis added]"

-- ArchTF Report, p.8

"Repositories claiming to be digital archives in a changing and uncertain environment must be able to prove that they are who they say they are, and that they can deliver on the preservation promise.

-- Donald J. Waters. "The Implications of the Draft Report of the Task Force on Digital Archiving"

The phrase "they are who they say the are" with all its delicious ambiguity, seems to lie at the very heart of the call for certification. In traditional archives, we know who we're dealing with, we can see their buildings, their catalogs, even the eyes of their librarians. In a very real sense we can say we "know who they are" and what they stand for. In a distributed digital archive network, this basis for trust is now longer a given. What is needed, whether or not it involves formal certification is a mechanism by which interested parties can know not just "who they are" but "what exactly is the preservation promise they are making, and as the network of digital archives will be a diverse one, it is not a simple matter to answer these questions. As a foundation, we propose what could be considered a negotiation mechanism, a means by which Archives disclose their intentions, and the community, provided with an explicit statement of "the promise", can determine whether or not the promise is adequately fulfilled. Leaving aside for now all questions of implementation (architecture, data representation, etc.), we propose the development of a standard format for communicating metadata about a digital archive's preservation practices, and a (probably distributed) database to provide access to that metadata).

Digital Archive Profiles

A Digital Archive Profile (DAP) is the set of assertions made by an entity wishing to be considered a digital archive. [* Henceforth, "the entity wishing to be considered a digital archive" will be referred to as the "Archive", whereas "archive" will refer to the collections themselves]. These assertions describe in great or lesser detail

In the aggregate, these assertions serve as a profile describing the Archive's "promise" to the community. It is roughly analogous in intent to a collection development conspectus record.

This proposal posits the creation of a distributed database of DAPs, roughly analogous to the DNS, by means of which the community can determine an Archive's intentions (what they say they'll do). For reference purposes, each DAP will be given a DAP-UID, and the cataloging records for each digital object to which the DAP applies will include a reference to that DAP-UID. Because an Archive has many digital objects and does not make identical assertions for all of them, each Archive will provide numerous DAPs, each covering a subset of the Archive's collections.

Access to the DAPs for a particular Archive's collections would be fully adequate "certification" to some members of the community. For example, if a scholar at Stanford wonders whether a particular digital object at Berkeley is being adequately preserved, she queries a bibliographic database to discover the object's DAP-UID (i.e. what profile applies to the object). She then uses that DAP-UID to query a DAP server, to find out what promises Berkeley has made about the set of objects including the one she's concerned about. She discovers that Berkeley has asserted

a long-term commitment to preserve the object as a high-resolution page image, as an uncompressed TIFF 6.0 file

that there is also a plain text (OCR) version;

that both are stored offline on magnetic tape under environmental conditions meeting a given NISO standard;

that for each, the media will be refreshed within 5 years;

that metadata about the object will be preserved for the same period, etc.

The scholar is satisfied that the digital object will be preserved in a form that will meet her expectation of scholarly need, and trusting Berkeley to honor its commitment, worries no longer about the object.

She then carries out a query for another object whose DAP, while similar to the first object, differs in one important respect: Berkeley promises to keep the TIFF format file only for 5 years and after that may opt to retain only the plain text version. As this does not meet the scholar's expected need she does not consider the object to be adequately preserved and looks elsewhere (or asks another archive to take on the commitment to preserve the object).

In either case the Archive has acted responsibly and the user (whether individual or organization) has meaningful information.

Whether or not a formal program of certification is developed, the DAP provides a basis for determining exactly what it is an Archive has committed to do. Given an adequately structured DAP format (an issue not addressed in this paper), formal verification that Archive is honoring its commitments could in part be done by automated means.

NB. While this paper suggests that DAPs could serve as a mechanism for effecting formal certification, we in no way wish to imply an endorsement of the notion of certification. See Deana Marcum's Responses to the RLG/CPA Report for perspectives on this matter

Level of Profile Detail

In order to encourage the widest possible participation in the distributed digital archive network, it is important to reduce barriers to entry. Many repositories of digital object contain valuable cultural property but lack resources to serve as full-blown "classical" digital Archives. Nevertheless, these Archives serve important functions, including holding material in the short term that may be valuable in the long-term, and which might in time be transferred (i.e. responsibility for their preservation might be transferred) to Archives with greater resources.

The level of detail in the profile, and the level of commitment the profile represents may range from minimal, as in the case of some "mom-and-pop" archives, who may commit only to preserving a set of objects for a short time (perhaps in hope that a larger Archive will take over responsibility for migration, and long-term preservation), to very detailed and ambitious commitments in the case of large institutions, national repositories, etc. What the DAP provides is a a means by which an Archive can declare its intentions, clearly and in a standard format.

It is assumed that an Archive will, in most circumstances, develop a limited number of skeletal profiles, describing the range of their digitization/preservation activities, and apply one of those skeletal profiles to each object it wishes to include in the archive.

In a sense, the Digital Archive Profile, can--perhaps should--be viewed as a set of attributes adhering to a given set of objects. That is, while the intent of the Digital Archive Profile is to provide metadata about a collection (or more usually, a subset of a collection), the picture is readily inverted and the same metadata can be seen to pertain to any object in the profiled collection. Indeed, this could as well be called a Digital Object Profile, and a digital archive could be defined as the set of objects having such a profile. It is assumed that item-level cataloging for digital objects will include an identifier for an archive-level record encoding the DAP (the latter residing in the distributed DAP system, but not necessarily in the local catalog).

When envisioning a virtual digital archive, it is important to consider that to a far greater extent than is the case with a conventional archive, what makes the collection an "archive" is, in fact, immaterial; it is the intention of the organization that constitutes an archive. That is, an organization's determination to preserve digital objects A, B, F and Z makes them, in the context of the ARCHTF recommendation, an archive. Whether A, B, F, and Z also constitute a meaningful collection from a subject, historical, literary, or other content-based perspective is another matter entirely. When viewed by a subject specialist, a given digital object might well be part of several different archival contexts, but it would be part of a set of objects to which a single DAP applies [* This is an oversimplification, and may well not be supportable proposition. For example, the object B viewed by someone considering it part of a legal archive might consider an assertion about the object's provenance to be inadequate, while another viewing B as a component of a literary collection might consider the provenance complete. Archivists will immediately imagine many other complications.]

Assertions

An Archive may choose to make any number of assertions and the ARCHTF report describes many of the types of assertions that would be meaningful in the context of a network of distributed digital archives. Some examples, intended just to suggest the scope of the issues involved:

Conceptually, assertions can be situated in a space defined by two axes: Weak--Strong and Soft--Hard, defined below. [* For a formal model these definitions would have to be made more rigorous]

Weak assertions. Relatively little is claimed, therefore it will be easy for the Archive to fulfill its commitment.

Examples:
We are Stanford University Libraries
We have an object A.
The object is a plain ascii text with no markup beyond punctuation and spaces between words.
We have no plans to migrate this object to new media.
The TTL for this assertion is 1 second.
Some kind of cataloging exists

Strong assertions. Relatively much is claimed, and it is correspondingly more difficult for the Archive to fulfill its commitment.

Examples:
We are Stanford University Libraries and our public key is ZXYXW
We maintain an object A in 4 formats. TEI 2.0 (identified by
Formal Public Identifier); TeX; plain ascii text with no markup beyond punctuation and spaces between words; Page image in TIFF 6.0
The object is stored offline on magnetic tape stored under conditions meeting standard NISO Z39.xx
The digital signature (MDA5) for the uncompressed TEI version is 12345
The digital signature (MDA5) for the uncompressed TeX version is 23456
The digital signature (MDA5) for the uncompressed TEI version is 34567
The digital signature (MDA5) for the uncompressed TEI version is 45678
Degraded versions of the TIFF 6.0 version are publicly available at a resolution of 400 dpi, with JPEG compression
The TTL for this assertion is 5 years.

Soft assertions. These are difficult to verify by automated means.

Examples:
The object is stored offline on magnetic tape stored under conditions meeting standard NISO Z39.xx
We maintain a detailed paper record describing the provenance of object A from the time of its creation to the present.

Hard assertions. These are, at least in principle, easy to very by automated means.

Examples:
The digital signature (MDA5) for the uncompressed TEI version is 12345
Degraded versions of the TIFF 6.0 version are publicly available at a resolution of 400 dpi, with JPEG compression
The digital signature (MDA5) for the degraded versions is 9876545
The digital signature (MDA5) for the uncompressed TEI version is 12345 etc.

Naturally, most objects fall somewhere between the extremes of both axes. For example the following is a fairly weak, slightly hard assertion:

Each of the 4 objects is compressed

because it is an easy promise for the Archive to keep, and while not easy to verify (since the verifier doesn't know what compression scheme is used, it must try many schemes. [* This is an interesting case in that the assertion may be "provable" but not "falsifiable"; that is, a verifier may in fact find that the object has been compressed, because the verifier is able to find a scheme to uncompress it; but it may happen that the verifier is not able to find an uncompressor even though the object was, in fact compressed. This possibility raises some interesting problems].

In contrast, a considerably stronger (because the Archive commits to restricting itself to a particular compression scheme for the TTL even when it might be convenient for the Archive to change to another scheme) and much harder assertion might be:

Each of the 4 objects is stored using CCITT Type 4 Compression

The "hardness" of some assertions may be difficult to determine in some contexts. For example, in a network context, the weak assertion

We are Stanford University Libraries

may be generally easy verified with familiar networking tools, but may be subject to spoofing. Therefore verification of this ostensibly hard assertion might require corroboration by verifying other elements of the profile such as digital signatures, and perhaps by reference to external trusted agencies.

For an external certifying agency to verify soft assertions, expert human evaluation would be necessary. While some soft assertions, if supported by online catalog records, may be verified remotely, other soft assertions might require site visits as well as examination of the Archive's documentation.

The verification of hard assertions, however, could be done by automated means and, assuming the Archive is willing to cooperate in the process by providing the certifying agency with network access to the entire collection (including, under suitable assurances of confidentiality, those areas of the collection not normally made public, such as "master" image files normally stored offline), could be done from remote sites. [* In the interest of clarity and simplicity, this document glibly ignores many real situations that will introduce considerable complexity into the scheme. For example, an Archive may store a "master" or "archival" copy of an object, but disseminate degraded versions (e.g. lower resolution, JPEG compressed images). In this case, hard assertions about the master tell us nothing about the disseminated versions. Automated verification may then involve a considerable degree of human-to-human cooperation.]

The Minimal Profile

There exists a set of required assertions (one implicit in the creation of the DAP) without which a DAP is rendered useless. This report identifies three such assertions, but others will no doubt emerge as work on this model continues. The DAP-UID could be considered a fourth required assertion, but might be assigned by a naming authority other than the Archive.

Minimal profiles are, obviously, of limited value, and Archives with strong commitments to preservation are expected to provide richer profiles. One approach to formal certification might describe a Required Profile and mandate adherence to it for a significant portion of an Archive's holdings.

  1. Time To Live (TTL)

    This is an assertion that all the other assertions in this DAP will be valid until the TTL expires. This assertion is important because it provides the community with a way of determining the extent of the Archive's commitment, and of deciding whether some action is required (such action being something as simple as retrieving a new DAP record or something as drastic as initiating a "rescue" procedure as described in the ARCHTF report.

  2. Identity of Archive

    This is neither as trivial nor as simple as it may seem on the face of it. The identity of an Archive is a first clue to its trustworthiness, especially to end-users. Moreover, as organizations establish inter-institutional partnerships, as businesses form temporary "strategic alliances", and responsibilities for preservation (in "traditional" libraries and archives) begin to cross historical organizations boundaries, the question "who is this", becomes increasingly important and increasingly difficult to answer. My confidence in an Archive identified as "IBM/Apple Peaceful Coexistence Group" might be tempered by caution.

  3. We have something we are preserving

    Implicit in the act of creating a DAP, this asserts nothing stronger than that the Archive is holding some digital object at least until the TTL expires. This minimal assertion tells us little except that we have a fixed amount of time before the object may disappear.


[Search all CoOL documents]