Introductory Notes on Digital Imaging and Preservation

Bob Savage
Media Preservation
Stanford University Libraries
April 4, 2001

We do two types of digital reformatting.

  1. Reformatting to prevent the types of deterioration to which the native format is prone, such as dye fading of color photographs, crackling of albumen prints, "mirroring" of photolytic silver images, & flaking and breaking of glass plate negatives. We could call this true preservation reformatting, except that with true reformatting we could consider the digital version the new preservation master.
  2. Creation of digital surrogates, which extend the life of the original by diminishing handling, or exposure to air and light. In the case of paper-based images, we are most likely thinking about this latter type of preservation work. We are loathe to dispose of the original because, even though it is slowly decomposing, it retains so much artifactual value, and we can think about most photographic prints as paper objects, the conservancy issues of which are well understood. Where we might see a movement to favor the digital versions over the originals would be when we start to confront the color prints of more recent vintage; it is quite conceivable that in 25 years someone will compare the scan that we make today of a color photograph in the archive (say of a staff picnic in 1980) to the original and find the original colors have faded so much that the original is no longer satisfactory.

With respect to videotapes, motion picture film, records, digital originals, and so forth, although these aren't the subject of this talk I wanted to point out that, for these media types, many issues I discuss will be the same, or at least analogous. However, in these cases, we are more likely to be doing true reformatting (not just providing digital surrogates) because the original media are so susceptible to problems.

Our first priority, when undergoing a digital reformatting project is to not make another problem that is as bad as the current one.

To do this we look at three areas:

Media Integrity
Format Obsolescence
Information Fidelity

Media Integrity
This refers to the inherent characteristics of the host medium, for example an optical disc, like the compact disc. Every digital format has a relatively low life expectancy,1 <100 years as opposed to microfilm which is estimated to have an LE of 500, and, of course, the Stanford University Libraries have paper that is already older than that. I should point out, in this context that relatively low LE ratings are common among 20th century media formats.

On the positive side, with digital files we can get around this problem because we can make perfect "clones" of the digital information, unlike the copies we make in the analog world, which often result in signal loss. So there are some pretty straightforward solutions to digital media instability: regularly refreshing the host media, and keeping redundant copies (backups). In addition, with compact discs, for example, we could monitor something like the Block Error Rate (BLER), which measures the number of data blocks per second that have one or more bad symbols. That way, when we see the BLER rise too high, we could refresh the host.

But, I want to point out that for our unique materials, for example the scans of photographs of the Stanford Family, the Preservation department does not recommend compact discs. When I started we had a model that was inherited from the days of Microform reformatting, where master copies were deposited in salt mines, and our intention at that point was to send CDs of our TIFFs to Kansas. As I said just a moment ago, Microfilm has a much higher life expectancy, and it would be a mistake to follow this model with digital media, but this really brings me to the second area.

Format Obsolescence
We can infer from historical experience, or even our immediate experience of rapid change in the realm of computer technology, that big changes are going to happen in the next 50 years. It is, therefore, incumbent upon us to plan for these changes.

A rather obvious example of format obsolescence from recent history would be a 5 �" floppy containing files in some old proprietary format such as XYWrite. From this example we can see that format obsolescence has two parts:

  1. Physical Media—e.g. the floppy diskette itself. The implication of this is a regular program of data migration. In fact it is currently believed that media format obsolescence is a greater risk than media integrity, not because the various media types (e.g. CD-ROM) are so resilient to the traditional factors of physical, chemical, and biological degradation, but because the risk of losing data to format obsolescence is so great. This is the reason the Preservation department favors a central repository of all of the libraries' digital information: if it is scattered on hundreds of CDs and DVDs, and zip disks, or what have you, we are creating a massive reformatting project for ourselves in about 20 to 30 years an expense we can do without.
  2. File Structure—e.g. XYWrite, et al. The solution here is to use only file formats based on open standards so that we are not locked out of our own files. In digital imaging there is a reasonable standard in place: TIFF (Tagged Image File Format); the specification is open, meaning we could actually provide a text file describing how the file should be read, or even provide code for reading the file format in a high-level language such as ANSI C, which could be compiled on whatever computing platforms exist at the time of need.

    Of course this implies an ongoing reformatting program as well, but file formats based on open standards are far more stable than media formats, which require functioning hardware. Also the effort required to reformat data files does not increase directly in relation to the number of files, as would happen in a media reformat. Imagine a large number of files, say 100,000 TIFFs. To reformat the whole lot we probably only have to execute one "batch" command, whereas a reformatting project of 100,000 floppy disks is almost inconceivable due to the labor cost required to hand manipulate that many disk insertions, to say nothing of the tedium that would result.

Information Fidelity
When we create the image file we need to create a file which describes as accurately as possible the source signal. One aspect of this is, of course, resolution. We always say we want to the maximum resolution we are capable of achieving, but how high a resolution do we need? In the Media Preservation Unit we have based our needs for "resolving power" on the RLG specifications for microfilming. Those specs are based on the ability to distinguish a lowercase e, which in turn requires the ability to distinguish 5 lines in a horizontal direction. Since a photograph probably wont contain an e we use a standard reference material supplied by the National Institute of Standards and Technology (NIST) called SRM 1010a, which shows sets of line pairs at different sizes. We selected a resolution of 1200 DPI based on the minimum resolution scan to be able to distinguish 12 line pairs per millimeter.

Our experiments with SRM 1010a led us to the conclusion that this was too subjective a method of evaluating resolving power. It also led us to ask the question "How much resolution is too much?" There is a point where one scans in so much detail about a photograph, that one can clearly see the shape of the silver particles themselves, but weve calculated that we are no where near this resolution. What we discovered is that the scanner manufacturers treat DPI as a selling point, and make excessive, although legally accurate claims. The key is that they state how many samples per inch they are capable of taking, but they dont say how big the samples taken are.

Fortunately for us, the FBI already determined the need for objective analysis of visual sharpness, and they developed a system using a Modulation Transfer Function (MTF) and a sine wave target. We are now starting follow up studies with the program and target that the FBI developed to measure the quality of fingerprint scanners. What we hope to get out of this is first of all a better understanding of where that "sweet spot" is where we are getting the maximum resolving power, without generating additional false information, and secondly a better tool to evaluate future acquisitions of digital imaging devices.

But the creation of the file is only half the story. Seeing doesnt take place on the printed page, or a computer monitor; it takes place in the mind. Seeing, especially seeing color is an event, and we all know this because we have seen that the color of a building will look quite different in late afternoon sunlight than it will at noon, and that an article of clothing that appears to be one color when viewed indoors, in artificial light appears to be a different color when seen outside in natural lighting.

In the first half of the last century an international committee was formed which developed some standards for describing color. The CIE, the Commission Internationale De L'Eclairage, or the International Commission on Illumination, established, amongst other things, a standard observer, which is sort of a theoretical viewing event where many of the variables which effect viewing something can be factored out, and the ISO, the International Organization on Standards, adopted the work of the CIE. More recently the International Color Consortium (ICC) defined a file format to characterize imaging devices in accordance with the work of the CIE. When we scan an image, we supply an ICC profile of the capturing device, which basically explains how the device differed from the standard observer. An end user could, then take the digital image, the ICC profile, and an ICC profile for their output device (either a monitor or a printer) and hope to faithfully create a seeing event similar to the CIE standard observer viewing the original photograph.

Metadata

I wanted to finish by talking about metadata, which is really tied up with all three areas I have discussed. The ICC profiles I just mentioned, for example, are important metadata components, but metadata is also important for avoiding data loss due to media integrity problems or format obsolescence. Apparently older records for both the general collection and special collections use very broad terms like "motion picture", which could mean 16mm color polyester film, 35 mm B&W acetate film, or even PAL formatted VHS, and so on. When we determine that we need to reformat all the U-matic videotapes, how do we locate them? Similarly, in the digital realm, if we have holdings on CD, we need to know that, as well as if they are ISO-9660 or use the Jolliet extensions, or follow the Red Book specifications. We also need to know if the files are encoded in TIFF, GIF, or PDF. General information like "computer file" is not going to be enough, for us to manage all of the reformatting and data migration tasks that we face.

Notes

LE as per ANSI / NAPM IT 9.13-1996, "The length of time that information is predicted to be retrievable in a system under extended term storage conditions."


[Search all CoOL documents]