[Table of Contents]


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ARSCLIST] National Recording Preservation Board (NRPB) Study



Steven Barr wrote:
> Jon Noring

> I said that there was a difference between discographic and cataloguing
> databases (the former would be more common for ARSC listeners). A
> catalog database refers to specific individual copies of a phonorecord,
> and as such must provide information concerning the copy held by the
> cataloguing party (price paid, specific location and possibly internal
> identification code, condition, any damage, usw.) along with such
> discographic data as is desired. OTOH, a discographic database provides
> data general to any and all copies of a phonorecord (catalog number,
> matrix number, credits *as they appear on the label*, takes known,
> actual artist data if known, usw. Note that a catalog database, if
> made publicly available (via the Internet, usually) can serve as
> a discographic data source to the extent that is included.

Definitely, and my comment on the "WEMI" system of FRBR intended to
cover this -- but I didn't explain how in my prior reply. I just
assumed those interested would look at the PDF document explaining
FRBR and WEMI.

WEMI stands for "Work-Expression-Manifestation-Item".

Discographical data covers the first three letters (WEM) with a strong
focus on Manifestation and Expression. An Item is an actual artifact that
is held by a repository or a collector.

Let's look at an example (data from Rust):

Work:          "St. Louis Blues" by W.C. Handy
Expression:    "St. Louis Blues" mx 14424-1, recorded by Clarence Williams 1933-12-06
Manifestation:  Vocalion 2676  (also Brunswick A-86050)
Item:           Copy of Vocalion 2676 in the collection of John Doe

Each level of WEMI has its particulary metadata, which of course
intersects with the other levels. Catalog info tends to be Item
oriented, but obviously contains metadata of a discographic nature
that originates in the other levels. True discographic data lies in
the Expression and Manifestation levels. Song composition info lies in
the Work level. WEMI sort of ties the three major components together,
the three being song/composition, discographical info, and catalog
metadata. If one focuses on the discographic level, it is important to
provide the necessary hooks to allow it to tie to both
song/composition and catalog databases.

Certainly, the WEMI system has its limits, and it is sometimes
difficult to cleanly use it (but then there is no such thing as a
system that perfectly works for everything under the sun).
Nevertheless, it helps one to better visualize how song compositions
tie in with recordings, who does the recordings, and how the
recordings relate to real world artifacts (i.e., catalog info.)


> As far as Jon's further comments, I defer to him, since these get
> into areas where I lack experience and knowledge. What I was thinking
> of (try to figure out how to end THAT clause with other than a
> preposition...?) was a cataloguing database which could be made
> available, possibly through ARSC, to anyone who wanted it...as
> freeware or very inexpensively. It would also have to have a
> user interface that was essentially intuitive, since the
> objective (to me) would be to accumulate an archive of as
> much phonorecord data as possible which could eventually
> lead to an "ultimate database" of nearly all phonorecords
> that still exist.

Yes, being able to merge individual collection catalogs will assist
with coming up with a universal catalog, which by definition begins to
become discographical in a global sense since it begins to form the
picture of all that exists.


> Admittedly, since these early analog recordings are almost
> the opposite of computers and digitized data, there will be
> collections that aren't...and probably never will be...entered
> into a computer-based digital database. However, the RDI was
> conceived as a sort of "ultimate database" which was based on
> several large collections of phonorecords...while it encountered
> difficulties due to the state of the digital art as its 
> conception, I tend to think it could be accomplished in
> this era of 500GB hard drives...in fact, the objective
> of the (I hope temporarily) sidetracked Project Gramophone...
> that being to create an archive of the contents of every
> known 78rpm recording as sound files...is becoming more
> practical/possible as I type! I estimate about three
> million, give or take, 78's were issued...almost all
> double-sided...so we need six million 3-minute sound
> files (I'm not allowing for the fact that some recordings
> showed up on as many as 20 labels!). Assume 1MB each, and
> we need six terabytes of storage (or 6 500GB drives, or
> about $2000 worth). Here again, I defer to the experts...
> but I feel this is something worth debating/discussing
> here on ARSCLIST...?!

Well, as I've noted before, we have to be anal when we massively
digitize our sound recording heritage. We must "do it right". This
means 96/24 (two channel) with lossless compression (and of course to
use the right equipment to do the transfers.)

A 3 minute side would be (if my calculations are correct, assuming
50% lossless compression, which seems to be fairly universal based on
"entropy" factors) about 52 megs in size. (The 1 meg for a 3 minute side
would be a quite lossy MP3 or similar compression which would audibly
sound quite poor. There are higher-quality MP3 or similar compression
schemes, and in my opinion the artifacts become mostly inaudible
around 128 kbit, which means about 2-3 megs in size -- anyway the
weird things MP3 does to the audio signal renders the source unsuitable
for archiving and restoration, in my opinion.)

O.k., with 6 million sides, that works out (if my math is correct) to
312,000 gigs, or 312 terabytes, or about 1/3 petabyte.

As of last year, one could put together a "petabox" (e.g., Brewster
Kahle at the Internet Archive has been working on a petabox) for a
little over $1,000,000, with a five year maintenance cost (e.g.
electricity, sysadmins, replacing bad drives, etc.) for about
$3-5,000,000 (from memory in talking with Brewster last year.)

Of course, there are several revolutionary technologies now on the
horizon which may increase storage densities from 10 to 100 times for
the same cost. Assuming 100 times improvement (I've even heard 1000 to
10,000 times improvement, but let's look more near term), we are now
back into the realm of a few tens of thousands of dollars per year to
maintain the complete archive of all 78's ever pressed, in the highest
possible digital quality (unrestored). It'd be pretty easy to find the
money to maintain *that* collection, and to redundantly duplicate it
in several locations as well as storing it on tape. (Now, if we see a
10,000 fold improvement in storage density for the same cost, now we
are talking about a portion of a single $100 hard drive -- or similar
size hardware such as holographic storage -- holding the entire 78 RPM
corpus in high-quality sound files.)

About Project Gramophone, that is still alive and kicking, but is in
"quiet" mode to see how several things work out. Both the copyright
issue, and the cost needed *to do it right* on a massive scale (and
this includes a high level of quality control and uniform transfer),
are impediments, but not show stoppers. We are working on several
angles to address both.

I am adamant that the digitizing must be state-of-the-art quality -- I
won't support anything less. I've noted in prior replies *why* it is
important To Do It Right (tm). It is better not to do the organized
massive digitization until it is possible to do it properly.
Unfortunately, this takes a significant pile of $$$. We are working on
this -- it is more institutional/financial than technical, although
there are still a few technical issues that need to be researched and
resolved.

Jon Noring


[Subject index] [Index for current month] [Table of Contents]