Thursday, April 12, 2007

"How Google Books is changing academic research"

Jo Guldi is a Ph.D. candidate in California looking for information on roads in the 18th and 19th centuries. She had, in proper library research fashion, explored the contents of the Yale, Harvard, and British Libraries, California libraries, and ultimately the contents of a large number of North American libraries via interlibrary loan.

Then she stumbled upon Google Books. And turned up 20 new sources related to her research in a single search. That's 20 new sources, sources which she had no idea existed from her library research.

How did this happen? I see two major contributions, one fairly trivial, one major. The trivial one was her search. According to her blog post, she did a search for "roads", a single keyword across the vast collection of Google Books. While she doesn't say so, you can easily limit your search to fully available sources, which generally means out of copyright--mostly pre-1920 or so. This limits the search to works that are most likely to be of use to her, 19th century texts, since those are both out of copyright and in good enough shape to be scanned. In short, she was lucky, and lucky with a search that few librarians would have recommended in a non-subject specific index. (Confession: I actually do these sort of searches in general databases occasionally. When you are having trouble, sometimes it's worth doing single keyword searches to see if one particular keyword is limiting your search too much. At least you get a sense of the extent of the universe in which your more complex searches are operating.)

However, I don't think that is the real message of her search. Google Books also searches the full text of the books (both in and out of copyright). No library catalog can do that. And, as librarians know all too well, there is no way that a typical library catalog record can record every topic covered by a book. A library catalog is, in a way, a simulation of a library. It models the library in a way that makes it easier to manipulate, but the information available is degraded or simplified. (If a simulation has all the information contained in the system it is simulating, it is that system, not a model. Simulations work because non-essential information is stripped out. However, sometimes you are wrong in what is non-essential. Library catalogs work fairly well for many needs, but they are not suited for every single need.) Google Books increases the amount of information within the simulation, increases it to an amazing extent.

It is still a simulation of a library. Copyrighted books are not available fully. You lose some physical knowledge of the book. It doesn't contain the range of books in a full research library (though Google would like to, I'm sure.) Google has not replicated the metadata produced by the experts (librarians) who cataloged the books in the first place. However, for what it does contain, it's hard to argue that it's not a better simulation of a library than your average library catalog is. The major loss is that metadata, which Google attempts to replace with tables of contents, indexes, and a "Key words and phrases" section.
For those unfamiliar with library metadata, I'm mostly referring to subject headings. Subject headings, and other controlled vocabulary systems, allow the assignment of one or more subject terms to a book, article, etc., with the specific terms being consistent over the entirety of the system, and not dependent on whether or not the author used that term. It's related to the way libraries shelve books by subject, so that all the gardening books are together, or all the versions and criticisms of Hamlet end up on the same shelf.
Google's "Key words and phrases" imitates this, but is strictly dependent on the words used in the actual book. I'm assuming that this is an automated process, so this probably related to the frequency of use, and the placement of the words. I would guess that a word used over and over again will score high for inclusion on this list, as will words used in the title, as chapter headings, etc. It should lead to some interesting choices: a book of General Sherman's letters includes "affectionately yours" as a key phrase. It's an interesting fact of letter conventions at the time, but not likely to be a good search phrase. (In case you are curious, the phrase "affectionately yours" turns up 1730 books in Google Book Search, mostly, though not entirely, books of letters. OK, I can see a use for this search, but how often to you need to know the relative frequency of different salutations in different eras? Getting back to the subject, we do loose the specificity of the LC subject heading "Correspondence"--used with a name to specify the letters of this person.)

I hope that Google will see the advantage of not tossing out decades of metadata that would only increase the usefulness of their system. Just add subject headings to the keywords already in use. It does lead to an interesting question of who owns library catalog records. Could Google buy them from someone? And from whom? LC? OCLC? Individual libraries as the books are scanned? Are they copyrighted, or is an individual record similar to an entry in a phone book?

To jump back to Jo for a final moment, she also points out one other service that Google Books performs: multiple access. She found a book that was invaluable to her research. She had kept it for months, paying overdue fines and all, so that she could ensure needed access to a book that she had had such a hard time getting a hold of. Here comes Google Books with instance access, as long as she needs it, and in no competition with any other researcher. (Previously, I had mentioned a tip for using Google Books for copyrighted books before putting in interlibrary loan requests.)

I'm not sure where Google Books is going, or even where academic research is going, but I do conclude that Google Books should be one of several avenues of library research that good scholars should perform.

