Tuesday, March 11, 2008

Vendor – Supplied Versus Open Source

Vendor – Supplied Versus Open Source

The products you saw Friday 3/7/08 and at our last meeting do not replace our library catalog MadCat. Instead they require us to export data from MadCat, process it, possibly merge it with data from one or more other sources, then feed that data on a regular basis into a powerful indexer and search environment for our patrons to use.

So a main goal is faster, easier, more powerful and more flexible searching and retrieving of data. Another goal is having the ability to change as new features and new ideas and new methods of presenting data become available.

Are we going to be stuck with a rigid look and feel that is simply ‘newer’? Or could we output our data in a way that, as new ideas come along or new mobile devices become more available, our data can be readily adapted to have a new look and work in a different way that suits our rapidly changing needs?

And can our output of data be pre-sorted and relevancy-ranked according to criteria we possibly have control over?

So, the question is, do we pay up front for a vendor-supplied solution where these “paths” of exporting data have already been set up for us, and the look and feel is only moderately under our control? Or do we use vendors who provide the infrastructure but use API’s to let us design the exact interface we want. An API application programming interface is a source code interface that an operating system or environment provides to support requests for services to be made of it by computer programs. (This is a wikipedia quota from a Computerworld article.)

AquaBrowser, Primo, Endeca and WorldCat Local, are all vendor-supplied, and all have the ability and are already tested in large institutions like ours. The Digital Library Federation is also working on a list of features any API from an ILS should be prepared to support.

Another option is to use WorldCat Local’s API (which gives us the ability to use WorldCat Local’s underlying structure, but write our own public interface using ‘calls’ back to the data. David Walker showed at Code4Lib very recently his interface code based on the WorldCat Local API, so this capacity is functioning at some level right now. Clearly OCLC is recognizing the importance of offering multiple options for differing types of organizations and the importance of allowing local control and innovation using a stable underlying base.)

Either way we choose, we’ll need staff to set up the processing from MadCat and other sources. And if we go the Open Source route, which could potentially offer us the most flexibility, we have to make a staff investment to be able to make the changes and implement new features. Some of this cost of change might be lessened with a vendor-supplied solution—but then, depending on the vendor, we could be right back to where we’re at right now which is running on a dinosaur catalog infrastructure while the web-world changes so rapidly around us.

If we go the Open Source route, Steve Meyer recently reminded me of a quote by Richard Stallman: “Think free as in free speech, not as in free beer.” (I should add to this that I am somewhat mis-using Stallman’s quote here. He considers the Open Source movement a very watered down version of the Free Software Movement and he really wasn’t a supporter of it—he wants software to really be free for ethical reasons, not just the practical reasons behind the Open Source movement.)

The main point I want to make is that whatever solution we take is going to cost people, $$, and time. So the most important decision I think we can make is to choose a platform and a path where we keep the doors open to make at least look and feel and even underlying structure changes very easily and very rapidly. The data needs to be in our control. I mean, aren’t you sick of, as you say ‘can’t we do xxxx with MadCat?’ someone like Curran Riley or Edie Dixon saying ‘No, we can only change yyyy, not xxxx.’? :-)

But one thing to keep in mind is our size. It’s far more work and effort to do this level of relevancy ranking ‘on the fly’ on large sets of data. And that’s why either a vendor or Open Source solution really needs us to export and pre-process the data. We have on order of 6 million bibliographic records and we want to mesh this with data from additional sources. Steve Meyer was recently telling me that he had learned from our very knowledgeable DoIT LIRA staff that once you get over about 1 million records, the processing and work needed to do the indexing of this number of records is a completely different beast and FAR more complicated.

VUfind, the only Open Source project we’ve demo’d so far in the last forum, at this point in time hasn’t handled this large -- millions of records -- technical issue yet. It is currently indexing well under 1 million records. However, it is built on an underlying Open Source structure (using Lucene and Solr from the Apache foundation) which has the ability to handle larger size databases. We do have excellent technical staff here at DoIT, but we’ll need more if we choose undertake this level of work.

And there are other Open Source projects are also coming along using an underlying structure that can support our needs. One of which is the eXtensible Project, as Karen pointed out.

Thank you.


Anonymous said...

Have you also considered Auto-Graphics product AGent Search? From what I can read about it and tell from working a bit with iCONN, it is pretty customizable and allows searching across many resources, e.g.catalog and databases.

Kate Ganski

Sue D. said...

No, I haven't looked at this product. Thanks for pointing it out. I'll follow up!