[padg] FW: Google Book Project

To: "PADG" <PADG@xxxxxxx>, <libfac@xxxxxxxxxxxxxx>
Subject: [padg] FW: Google Book Project
From: "Paula De Stefano" <pd3@xxxxxxx>
Date: Thu, 3 May 2007 10:05:23 -0400
Delivered-to: whenry@stanford.edu
Importance: Normal
List-archive: <http://lists.ala.org/wws/arc/padg>
List-help: <mailto:sympa@ala.org?subject=help>
List-id: <padg.ala.org>
List-owner: <mailto:padg-request@ala.org>
List-post: <mailto:padg@ala.org>
List-subscribe: <mailto:sympa@ala.org?subject=subscribe%20padg>
List-unsubscribe: <mailto:sympa@ala.org?subject=unsubscribe%20padg>
Message-id: <BMEKLKGFKFHCGPNFKFMMMEKLCDAA.pd3@nyu.edu>
Reply-to: padg@xxxxxxx

FYI. More news on the Googlebooks Digitization Project.

April 30, 2007

Google Books: Whats Not to Like?
http://blog.historians.org/articles/204/google-books-whats-not-to-like


By Robert Townsend

The Google Books project promises to open up a vast amount of older
literature, but a closer look at the material on the site raises real
worries about how well it can fulfill that promise and what its real
objectives might be.

Over the past three months I spent a fair amount of time on the site as
part of a research project on the early history of the profession, and
from a researchers point of view I have to say the results were deeply
disconcerting.

Yes, the site offers up a number of hard-to-find works from the early 20th
century with instant access to the text. And yes, for some books it offers a
useful keyword search function for finding a reference that might not be in
the
index. But my experience suggests the project is falling far short of its
central promise of exposing the literature of the world, and is instead
piling
mistake upon mistake with little evidence of basic quality control. The
problems I encountered fit into three broad categoriesthe quality of the
scans
is decidedly mixed, the information about the books (the metadata in
info-speak) is often erroneous, and the public domain is curiously
restricted.

Poor Scan Quality

My reading of the materials was not scientific or comprehensive, by any
means,
but a significant number of the books I encountered included basic scanning
errors. For instance, the site currently offers a version of the Report of
the
Committee of Ten from 1893 (the start of the great curriculum chase for the
secondary schools). It offers a catalog of scanning errors, as Google has
double-scanned pages (page 3 appears twice, for instance), pulled in pages
improperly so they are now unreadable (page 147 between page 164 and 166),
and
cut off some pages (page 146, for example).

I've digitized a number of the AHAs old publications and appreciate that
scanners dont always work as they should and pages can often get jammed. But
even fairly rudimentary quality controls should catch those problems before
they go live online. After years of implementing those kinds of quality
checks
hereprecisely because friends in the library community took me to task about
their necessityI find it passing strange that so many libraries are joining
in
Googles headlong rush to digitize without similar quality requirements.

Faulty Metadata
Mistakes in Google Book Search Metadata

Beyond the fundamental quality of the scanning, a more significant problem
is
the incredibly poor descriptive information attached to many of the books on
the site (the metadata). This is particularly evident in the serial
publications, where having the proper name and date of a publication is
particularly important. Take for example a volume of History Teachers
Magazine
that is labeled as a volume of Social Studies (the name the magazine took in
1934) and dated as published in 1953 (even though it seems to be from 1917).

These kinds of problems have two unfortunate effects. First, it makes it
more
difficult to place a particular work in time and thus actually locate a
particular item discovered by using Google Books. At the same time, in many
instances you will be unable to inspect public domain items more closely,
because the erroneous date places the information on the wrong side of the
copyright line.

Truncated Public Domain

These problems are exacerbated by Googles rather peculiar views on
copyright.
While taking an expansive view of copyright for recent works, it has taken a
very narrow view about books that actually are in the public domain. As I
have
always understood it (and the U.S. Copyright Office confirms), works by the
U.S. government are not eligible for U.S. copyright protection. But Google
locks all government documents published after 1923 behind the same wall as
any
other copyrighted work. Among other things, that locks up works that should
be
in the public domain, such as the AHAs Annual Report (published by the
Government Printing Office from 1890 to 1993) and circulars from the U.S.
Bureau of Education. This problem is exacerbated by the often errant data
about
when these materials were publishedwhich places these works even further
beyond
reach.

For more than a year now, Siva Vaidhyanathan, a cultural historian and media
scholar at New York University, has been objecting that the rush to digitize
is
moving far in advance of considered thought. His concerns seemed rather
abstract when I first heard them last year, but working with Google Books
over
the past few months made his objections seem much more tangible and
worrying.

What particularly troubles me is the likelihood that these problems will
just
be compounded over time. From my own modest experience here at the AHA, I
know
how hard it is to go back and correct mistakes online when the imperative is
always to move forward, to add content and inevitably pile more mistakes on
top
of the ones already buried one or two layers down. With Google adding in
more
than 3,000 new books each day, the growth in the number of mistakes seems
that
much higher.

The problem of quality control only exacerbates my most basic worry about
the
larger rush to digitize every scrap of informationthat we are adding to the
pile much faster than the technology can advance to extract the information
in
a useful or meaningful way. When I have asked people who know a lot more
about
the technology than me about this problem, they tend to wave their hand and
mumble about brilliant scientists and technological progress. Forgive me if
I
remain unconvinced. Even for someone fairly proficient in Boolean search
terms
I find a lot of the results from Google Books (and Google more generally)
just
page after page of useless and irrelevant information. I find it
increasingly
hard to believe that Google can add tens of thousands of additional books
each
month to the information pilemany containing basic mistakes in content and
metadataand the information results will actually grow better over time.

So I have to ask, whats the rush? In Google's case the answer seems clear
enough. Like any large corporation with a lot of excess cash the company
seems
bent on scooping up as much market share as possible, driving competition
off
the board and increasing the number of people seeing (and clicking on) its
highly lucrative ads. But I am not sure why the rest of us should share the
companys sense of haste. Surely the libraries providing the content, and
anyone
else who cares about a rich digital environment, needs to worry about the
potential costs of creating a universal library that is filled with mistakes
and an impenetrable smog of information. Shouldn't we ponder the costs to
history if the real libraries take error-filled digital versions of
particular
books and bury the originals in a dark archive (or the dumpster)? And what
is
the cost to historical thinking if the only substantive information one can
glean out of Google is precisely the kind of narrow facts and dates that
make
history classes such a bore? The future will be here soon enough. Shouldnt
we
make sure we will be happy when we get there?


Robert Townsend
Assistant Director, Research and Publications
American Historical Association

Prev by Date: [padg] CCAHA Subsidized Assessment Programs
Next by Date: [padg] position available: Preservation Officer - UCLA Library
Previous by thread: [padg] CCAHA Subsidized Assessment Programs
Next by thread: [padg] position available: Preservation Officer - UCLA Library
Index(es):
- Date
- Thread

[Table of Contents]

[padg] FW: Google Book Project

[Subject index] [Index for current month] [Table of Contents]