Conservation DistList Archives [Date] [Subject] [Author] [SEARCH]

Subject: Preservation of electronic formats

Preservation of electronic formats

From: Pete Jermann <pjermann>
Date: Monday, August 16, 1993
                      A Report by Peter Jermann on

                 THE PRESERVATION OF ELECTRONIC FORMATS
                    presented by Dr. Michael Spring

                                 at the
                    Preservation Intensive Institute
                            August 1-6, 1993
                        University of Pittsburgh

We were bitted, byted, and nibbled until we grumbled.  We were ASCIIed
and UNICODEd until we groaned.  We were TIFFed, JPEGed, and SGMLed until
we screamed.  And we were RLE, Huffman, and LZW compressed until we
exclaimed "Shoot me, please!"  But when the volcano of information that
is Dr. Michael Spring stopped,  the swirling lava of detail became
stone, the world became calm, ... and we saw meaning.  The purpose of
this report is not to summarize the week but to reflect on the lessons
learned.  The reflections that follow are the outcome of one of the
three discussion groups, that met on the final day of class, in
combination with my own ideas.  Where these reflections seem correct
please credit the discussion group, where they seem misdirected please
blame me.

As people concerned about the implications of electronic technology for
preservation, we need to understand three basic concepts: 1) all digital
information is coded information; 2) digital technology can be used as a
tool for preservation; and 3) regardless of whether we exercise the
options presented by number 2, we have to cope with analog and digital
electronic records that already exist or will be produced.  A more
detailed consideration of each topic follows.

1. Digital information is coded information.

The first concept, that all digital information is coded, means that in
order to preserve digital information and provide future access to this
information, we must also preserve the key that translates the code.
Digital information is merely a series of 1's and 0's (bits) gathered
into parcels of 8 (bytes) that are gathered into collections called
files.  The meaning of these bits, bytes, and files can be, and often
is, arbitrarily determined by the information's creator.   Though we may
come to excel in preserving the physical media on which digital
information is stored, without the key to decipher the information
preserved, future access will require the services of a cryptographer.

The solution for those faced with the responsibility of transferring
information to or preserving information in electronic formats is an
awareness of and support for standards that define the meaning of
digital information.  We need to know who creates standards and how we
can influence their development.  We need to be aware of existing
standards such as the ASCII standard (American Standard Code for
Information Interchange) for text and TIFF (Tagged Image File Format)
for graphics, as well as the hundreds of proprietary formats established
by software vendors.  Finally, we need to look to the future and support
both the development and the use of emerging universal standards such as
UNICODE (ASCII code + a possible several billion English and non-English
national characters), TIFF and SGML (Standard Generalized Markup
Language - a standard used to describe documents which may include
textual data, image data or other data in predefined formats).

These reflections led to the following recommendations:

    a)  A national repository should be established to preserve both
    public and proprietary standards for interpreting digital
    information. This recommendation is based on the importance of this
    information for the preservation of digital information and the
    realization that such a task would be impossible for any individual
    library to assume.

    b) As a profession, librarians and preservation professionals need
    to develop a forum (journal, electronic journal...) where digital
    standards can be discussed and explained in terms comprehensible to
    members of the profession.

2) Digital technology as a preservation tool.

Digital technology can be used as a tool for preserving information
currently in non-digital format.  The uses of this technology include
analog to digital conversion (for sound and video recordings), image to
digital (for documents, books, photos, etc.) and text to digital
(OCR/ICR - optical or intelligent character recognition).

In order to understand conversions to digital format, we must understand
how the digital record relates to the original.  What do we gain and
what do we lose? All digital conversions are based on series of discrete
samples of the original information, whether an analog recording, a page
from a book, or a photograph.  The completeness with which the original
information is captured is determined by the distance of these samples
from one another, either in time or space, (e.g. dots per inch, samples
per second) and the quality of each sample taken.  The more samples
taken and the higher the quality of the sample, the higher the potential
resolution of a digitally reproduced copy.

The quality of the sample directly relates to the size of the scale by
which the each sample is measured. For example, if a color photograph is
scanned into digital format, we could sample at three quality levels
ranging from low to high. We can sample it as a black and white image
(using only two values for any given sample), a gray scale image (up to
256 different values of gray per sample) or a full color image (up to
16.7 million different color values per sample).

Once we understand the relationship between the original information and
its digitized copy we need to understand the limits and costs of the
technology. What are the costs of information input, information
storage, and information output?  What information might be lost, or
enhanced?  What are the limitations of image to digital or audio to
digital conversions?  Once information is digitized how do we store it?

The quantity of information digitized represents its own limitations.
The more information we save (higher sampling rate and/or more values
per sample) the higher the cost of processing and storing that
information. Storage requirements can be tempered by a variety of data
compression schemes.  As preservation specialists we must understand
that compression algorithms can be lossless (no information lost on
decompression) or lossy (decompressed information differs from the
original compressed information).  What do we gain and what do we lose
in such schemes?

Once our information is digitized and compressed, how do we organize the
data (see discussion on standards in part 1 above), and how do we
retrieve it?  What are the limitations of converting a graphic image of
text,  such as the scanned image of a page from a book, into keyword
searchable, character based information?   What are the advantages or
disadvantages of CD-ROM?  How stable is the physical medium?

Only when we understand the technology involved in the hardware and
software, can we make decisions concerning the uses of digital
technology as a preservation tool.  These decisions must be guided by
answers to the following questions:

    - By whom and how will the information be used?

    - Can digital technology achieve the quality required by these
    perceived users and uses at a cost we can afford?

    - Will access to a digital copy increase or decrease demand on the
    original?

    - Should textual information be digitized as an image, as is
    currently done with microfilm, as text that can be indexed and
    searched on a computer, or should it be digitized in both formats?

    - How can we index or catalog the digitized information?

The answer to these questions will help us to answer questions like the
following:

     - Is digital technology the answer to this particular application
     or should more traditional means be used?

     - At what resolution should an image be scanned or an analog audio
     recording sampled?

     - Should basic black and white printed text be scanned as a black
     and white image or as a gray scale image?

     - Should we use a lossy or lossless compression to store our data?

3. Coping with electronic records

As preservation specialists we must learn to cope with existing
electronic records as well as those we produce through our preservation
efforts. Preservation of electronic formats requires that we know and
understand a) the logical format or code by which the information is
translated to human terms; b) the technology that can read the
information on the particular medium on which the electronic record
exists; and c) the life of the medium on which the information is
stored.

    a) The importance of the encoding scheme or format of digital
    information places two requirements on preservation specialists.
    First, it requires that as digital information is collected we must
    acquire knowledge of the format in which it is stored. This
    knowledge must be inextricably tied to the electronic record through
    cataloging or other means. Further, it is necessary to insure that
    the specifics of the format are preserved somewhere (see part 1
    above).  If the format is peculiar to the records in hand, as may be
    the case with a custom-designed software application, then we must
    also obtain a detailed record of the encoding of that format and
    ensure that it remains tied to the data.  Second, we must support
    the development and use of universal standards so that the problems
    associated with the existing standards lessen with time.

    b) Whereas the format tells us how information is arranged within a
    digital file it tells us nothing about the mechanism required to
    read the digital information from the particular medium on which it
    is recorded.  Information regarding this reading technology, like
    the format or coding information should be tied to the electronic
    record through cataloging or other means. We need to know and
    document the hardware or combination of hardware and software that
    enables us to read the electronic record from the particular medium
    in our possession.  We cannot assume that all similar media require
    similar technology to read. Magnetic tapes, for example, can contain
    digital or analog information and can only be read by an appropriate
    machine. Floppy disks are an example of media that can be physically
    identical yet incompatible. Disks formatted on an Apple computer are
    not easily read on IBM compatible computers, nor are IBM formatted
    disks easily read on Apple computers.

    c) Finally we need to understand that the life of the medium
    (magnetic tape, floppy disk, CD-ROM, etc.) on which electronic
    information is stored depends on a combination of two factors. The
    first is the rate of the medium's physical decay.  How long is the
    medium capable of maintaining its information intact?  The second
    factor is the life expectancy of the technology used to write to and
    read from that medium. Should this technology disappear the
    information on the medium becomes inaccessible.

This combination of factors affecting the life of a electronic medium
requires the preservation specialist be diligent on two fronts.  He or
she must act in a traditional sense and monitor the condition of the
artifact on which the electronic information is stored.  When the
artifact can no longer sustain the information, like a brittle book
unable to support its printed message, it must be copied or its
electronic image refreshed on the existing medium.

Unlike a brittle book, however, the preservation of electronic media
requires that the specialist monitor the technology that placed the
information on the electronic medium. The obsolescence of the brittle
book's printing technology has no impact on its preservation.  The
obsolescence of an electronic reading technology can mean loss of access
to the information stored with that technology.  Consequently, in
addition to monitoring the artifact the preservation specialist must
monitor both the reading technology connected with the artifact and
emerging technologies that will supersede that technology.  It becomes
his or her responsibility to migrate data to the newer technology before
the old technology disappears.

Fortunately, a significant advantage of digital electronic data (though
not analog data) is its ability to be refreshed, without loss, on its
existing medium or transferred, also without loss, over a wire from one
computer to another regardless of hardware and/or software differences.
If the reading technology is properly documented (see part b above) any
data produced by a given technology can be quickly identified en masse
and transferred to a newer medium.

NOTE: Special thanks to Michael Spring, Shannon Zachary, Karen
Motylewski, Barclay Odgen and  my wife, Mary Jermann, for reviewing the
draft of this essay and offering comments, criticisms and encouragement,
all of which have made it better than it otherwise would have been.

Pete Jermann
Preservation Officer
Friedsam Memorial Library
St. Bonaventure University
St. Bonaventure, NY 14778
(716) 375-2324

                                  ***
                  Conservation DistList Instance 7:20
                  Distributed: Monday, August 16, 1993
                        Message Id: cdl-7-20-001
                                  ***
Received on Monday, 16 August, 1993

[Search all CoOL documents]