6.03.2009

It's all about discovery

These are some of my thoughts that are the result of reflecting on a presentation by Dan Clancy, engineering director @ Google: Google Book Search Project: Present Status and Next Steps for the Google Book Search Project. Presented at Archiving 2009, May 2009.


Some impressive statistics were revealed at the beginning of this talk – I think these numbers should cause the library world to sit up and take notice (if they haven't done so already). Of the 10 million items included in Google Book Search every month users preview 81% of content contributed by the partners and 78% of the public domain content. The daily numbers are equally impressive: users preview 40% of content contributed by the partners and 17% of the public domain content. That is every day. Most of the traffic comes from Google.com.

That really puts a spin on the notion of the long tail. The backlist is heavily used because of discovery. It is all about discovery. Having full text available and searchable makes it discoverable.

I was interested in what Clancy had to say about quality assessment (QA). This has been an issue that has plagued our group since we started with our first film to digital books project, Beyond the Shelf. We treated those images as if they were precious objects. We scanned and did QA on about 1,000 books. We really wanted the page images to look great. That speaks to our obsessive compulsive nature, I think. We wanted perfection. Eventually it hit us in the face that there was a huge cost to this approach and that was quantity. We had fantastic discussions about cost, scalability, feasibility as it related to the number of searchable page images that we could make available.

Clancy observed that as projects want to get to a 99.9% confidence in the quality of the image/ocr that each “9” leads to an order of magnitude in cost.

This reflects our experience as well. While we really want excellence across the board at some point we have to cave to quantity and develop a good workflow for correcting errors as they are reported.

Clancy said that Google Books Search is doing this…they strive for as good of an initial capture as possible and then have developed good QA to catch errors. They also fix problems as they are reported. They have committed to: a. keep making software smarter, b. keep taking user input, and c. fix things as needed. It is cheaper to fix errors because it is a small problem.

I think this is really the bottom line for those of us creating digital content. It is about quantity and developing methods and processes to create the content faster and cheaper. Here at Kentucky we have learned to do one of the hardest types of content – newspapers. (here is a link to information about our NDNP participation) We are looking at developing efficiencies and balancing quality and quantity – is there a happy medium? Newspapers present the added challenges of small fonts, reading order problems, publishing errors (metadata problems for the most part), etc. If we can make newspaper digitization faster and better it is all to the good.

As we examine what else we choose to digitize it seems to me that we can look in our collection for the unique items and start there. There is no point in us duplicating efforts of those libraries already participating with Google. As we mine our collections I believe that we will discover a great deal of unique material in our special collections. Adding those items to the corpus of digital, discoverable content will be good for everyone. Let’s get going!

No comments: