Obviously, searching can get pretty complex, and many pitfalls can prevent a user from achieving success. So how does it get done in the non-Web world, and can we learn anything from it?
In the real world, reference librarians and other information professionals often make the difference. In fact, without them, civilization would creak to a grinding halt. They are better than anyone else at finding information because they break up what seems to be a huge, complex information need into simpler, more digestible components by conducting a reference interview that is designed to learn more about the information need and its context (unless, of course, you're just looking for the bathroom or the copiers!). Before you get spooked by the term reference interview, consider that you probably have been through quite a few of them yourself. When you go to the library and ask someone behind the reference desk a question, they'll probably respond with an open question, such as "Can you tell me a little more about how you'll be using this information?" The interview will often continue with more specific questions, such as "Do you need this information for business (or school, a dissertation, personal enjoyment, etc.)?" "Do you need it right away (or can we take some time to do some more involved searching or interlibrary loan for it)?" "Are you looking for something at no cost (or would you like us to do a literature search in some commercial databases like LEXIS/NEXIS or DIALOG)?" "Are you looking for a few items (or do you need all there is)?" and so on. These interactive iterations help both the librarian understand what you're looking for, and may also help you better understand your own needs by forcing you to articulate them. In effect, both you and the librarian engage in associative learning about the information need. Associative learning comes naturally to humans, but is extremely difficult for software systems to handle.
Can a web site do what a reference librarian does? Well, sort of, but not quite. We've already covered a sample of the variation found in users and their information needs, and we know that well-architected sites can largely address these needs. If we can determine the major needs of our sites' users and take steps to address them, then perhaps we'll cover 80% of all possible search queries. That would be wonderful, as most sites probably don't do half that well. But that other 20%, the really tricky stuff, can't be handled by
automated means like a web site. You really do need humans to help out in those situations, because only humans are really good at figuring out context and knowing the right questions to ask. Don't hold your breath for this issue to be solved by an automated approach, such as with an intelligent agent. Instead, consider making someone in your organization (maybe the librarian, if your organization employs one) responsible for handling the tough queries, and make sure your site actively seeks feedback and directs it to those human information specialists.
6.5 Indexing the Right Stuff
So, let's get back to whether you need a search engine. Let's assume that you do intend to slap a search engine on top of your web site. Shouldn't be a problem right? Just point the indexer at the directory where all the pages live, and, voilà! Searchable site!
Of course, you knew it wasn't that simple. Searching only works well when the stuff that's being searched is the same as the stuff that users want. This means you may not want to index the entire site. We'll explain. 6.5.1 Indexing the Entire Site
Search engines are frequently used to index an entire site without regard for the content and how it might vary - every word of every page, whether it contains real content or help information, advertising, navigation menus, and so on.
However, searching works much better when the information space is defined narrowly and contains homogeneous content. In other words, the more you search through indices that combine apples and oranges, the worse your retrieval results will be. After all, when you search a site, you're probably looking for apples only, not oranges. As already discussed, a site's content is usually a mix of apples, oranges,
kumquats, bell peppers, chainsaws, and Barbie dolls to begin with. So, when you tell your search engine to index your entire site, the site's users will be performing searches against all kinds of stuff - navigation, destination, and other kinds of pages - all at once. What they retrieve can often be ugly.
Let's try an example to see what happens. Searching Netscape's site for plug-ins, what do we find? Exactly 100 documents. Of these:
•
58 documents are Welcome to Netscape Navigator version X.X pages for just about every version of Netscape Navigator and include information about plug-ins.•
16 documents are in German (a language I don't read).•
6 documents contain the potentially relevant term application in their titles, but 5 of these 6 have exactly the same title (Netscape Handbook: Application Features).•
2 documents actually contain plug-in in their titles.•
18 other assorted documents may be relevant, but are not labeled in a way that indicates whether this is the case.Analyzing these search results, we find two common problems. First, we are presented with documents that clearly don't belong. If the site had been selectively indexed with audience differences in mind, 16% of the results would not have been displayed at all. Second, regarding relevant documents, it's not clear why we need 58 versions of the same type of document. It would have been useful to index pages more selectively, such as files relevant to Windows or Macintosh users, or recent versions versus older versions of the software. Are very many people still interested in old Netscape Beta versions? So, our search is less successful than it could have been; it gave us a lot of irrelevant documents, and too many that could be relevant.
Our search performed poorly because all the content in the site was indexed together. By doing so, the site's architects chose to ignore two very important things: that the information in their site isn't all the same, and that it makes good sense to respect the lines already drawn between different types of content. For example, it's clear that German and English content are vastly different and that their audiences overlap very little (if at all), so why not create separately searchable indices along those divisions?
The site designers at Netscape are already doing this, in a limited way. They have put a lot of effort into helping you download the right version of the software from the nearest location. To download the software, you get asked several questions (not unlike those in a reference interview). Shown in Figure 6.15, the site asks the user:
•
What operating system does your computer use?•
What language do you speak?•
Which of our products do you need?The result is a list of links to download sites that provide the user the right information (i.e., software appropriate to the user's platform), taking into account his or her geographic location and language. Why not apply this same careful approach to matching users with the right information to the entire site, instead of just to this specific situation?