The Data Driven Library

Once again, this is my first post in quite a long time. But I've been busy. I think I will keep a sporadic log here of the ways in which we develop new Web services at the Boston College Libraries, using a combination of LogiInfo, Java Web apps, Web services, Perl CGI, Ajax, perhaps some ASP.NET and, frankly, whatever else works.

We are calling this endeavor the Aerie Project. The purpose of the project is simple and open-ended: to create a framework/(dare I say it?) portal/dashboard/iLibrary to deliver online library services that take into account the student/faculty/staff member's context - and to decouple these services from the Aerie framework and reuse them in other environments to meet the library's overall service goals. Simple, right? Everyone's doing it. Web/Library 2.0 and all.

For me, one of the inspirations for this project was Lorcan Dempsy's 2003 article The Recombinant Library. In it, Dempsy presents an excellent distillation of the development of information "portals" and the problems they are meant to address, as well as some of the problems such portals themselves present. With much foresight, he muses on the possibilities presented by decoupling services from specific library systems and making them available in other contexts - for example, in course management systems or campus portals/intranets.

He says, "The major development issue facing libraries today is how to create a network environment which is rich in services and which meshes with user behavior in useful and convenient ways."

Indeed. That is it in a nutshell, and, despite the fact that Dempsy's article is more than five years old, this contention still holds true.

The difference, perhaps, is that we - people, not just librarians - are much, much more aware of the novel ways that decoupled, granular data services can be combined and reused. Think mashups, etc. When Dempsy was writing in 2003, Web Services were, of course, just emerging.

In any case, when I first read the article, the conceptual framework Dempsy provided really helped me to begin to think coherently (I hope) about possible ways to provide richer, more flexible, services that can be woven into user behavior more easily.

If you think about it, we - as a University Library - actually know a hell of a lot about our users. Not only do we know what books they have checked-out (now and in the past), we know their current schedule, their major(s)/minor(s), what degree(s) they are pursuing, where they are from, where they currently live, what events are going on in their department/major area of study. If I took the time to think some more I could probably come up with a lot more stuff we know (or can infer).

So, what do we do with all of this juicy info?

This is just the first set of thoughts and ideas. There is a lot more...

A Great Recommender System
And a Fantastic Application of Collective Intelligence

BibTip is a signal example of the potential inherent in harnessing collective intelligence to serve the needs of the library. BibTip uses Andrew Ehrehberg's "Repeat Buying Theory" as a framework to statistically analyze user search behavior. Repeat Buying Theory is a highly successful and well-tested statistical framework to describe the regularity of repeat-buying behavior of consumers within a distinct period of time.

The developers of BibTip at Karlsruhe University in Germany very skillfully adapted Andrew Ehrenberg's Repeat Buying Theory to the session-based search behavior of library OPAC users. The key is that BibTip only records the inspection of the full details of an individual bib record selected from a larger list of search results. It does not "follow" the user. In this framework, clicking-on and reading the full details of a given record is viewed as an economic choice. That is, the choice of one record over all of the others in a given list is very similar to an individual's choice to purchase one thing over another during a given trip to the store. There is a real cost in time (e.g. an economic cost) for the user each time he/she selects and views a record. It can be assumed that the "search cost" to a user is high enough that he/she is willing only to view the details of a record which is truly of interest. Users, in effect, are self-selecting. That is, users with common interests will select the same documents, and, since recommendations are only provided to users from the full details view, we can surmise that recommendations are only offered to interested users.

In order to build relationships among given documents, BibTip analyzes record pairs. For each record X that has been viewed in the full details view of the OPAC, a "purchase history" is built. This is simply a list of all of the sessions in which record X has been viewed. Record X is then compared with all other records (Y) which have been viewed in the same session as X. For each pair of records (X,Y) that have been viewed in the same session, a second purchase history is built. The number of users who have viewed record X and another record Y in the same session is statistically analyzed and the probability of a “co-inspection” of records X and Y in a given session is calculated. A recommendation for record X (That is, users who liked X also liked…) is created when record Y has been viewed more often in the same session that can be expected from random selections (in statistics, the recommended record would be an "outlier").

This “repeat buying theory” is remarkably good at automatically determining relevant recommendations for a given item. This is because the theory actually models the "noise" created by random clicks on records in a list of search results - that is, groups of records that are clicked-on, but which are not actually related. Diving into a record quickly and backing-out quickly falls well within the repeat-buying theory model. Recommendations are based upon those records that fall outside of regular random co-browsing - the outliers. To quote Dr. Geyer-Schulz (one of the developers of BibTip):

"Ehrenberg's theory faithfully models the noise part of buying processes. That is, repeat-buying theory is capable of predicting random co-purchases of consumer goods. Intentionally bought combinations of consumer goods--a six-pack of beer, spareribs, potatoes, and barbecue sauce for dinner, for example--are outliers. In this sense, Ehrenberg's theory acts as a filter to suppress noise (stochastic regularity) in buying behavior." [From: Andreas Geyer-Schulz, Andreas Neumann und Anke Thede. An Architecture for Behavior-Based Library Recommender Systems. Information Technology and Libraries 22(4), p.169 (2003).]

That is, *most* of the given transactions are noise. Search terms and strategies are irrelevant. The co-browsing of records that lies outside of the usual background noise is the browsing that needs to be examined for potential recommendations.

It takes some time for enough data to be collected so that good recommendations are available for a substantial part of a collection, but what is the hurry? Of course, the longer you have the algorithm running, the better your recommendations become. The more users you have, the better your recommendations become. But, time is on our side in this case ;-)

The Data Driven Library

Thursday, April 2, 2009

The Boston College Libraries, LogiInfo and Contextual Delivery of Library Services

Friday, May 16, 2008

BibTip Recommender System: How it works

Blog Archive

About Me