Miscellaneous Supporting Software Tools
3 Information Retrieval Tools
There is a wide range of open source search engines or information retrieval tools available on the web from Sourceforge (http://www.sf.net/). These systems can be categorized into two main groups, viz., those that use inverted files and those that use database systems. We will look into some of the most popular search engines.
3.1 Ht://Dig
Description: The ht://Dig system is a complete world wide web indexing and searching system for a domain or intranet. Instead it is meant to cover the search needs for a single company, campus, or even a particular sub section of a web site. Special Features: Some of the special features are
• Intranet searching
• Robot exclusion is supported • Boolean expression searching • Configurable search results
• Email notification of expired documents • Searches on subsections of the database
(Go to http://www.htdig.org/require.html for full feature set)
History: ht://Dig was developed at San Diego State University as a way to search the various web servers on the campus network.
Project Sponsors/Administrators: San Diego State University Dependency: C++ Compiler, libstdc++ (for building from source) Supported Platforms: Linux, UNIX, BSD, Solaris, HP/UX License: GNU GPL
Availability: http://www.htdig.org/mirrors.html, http://www.htdig.org/where.html Further Information: Project Home Page: http://www.htdig.org/
3.2 Swish-E
Description: Simple Web Indexing System for Humans - Enhanced (SWISH-E) is a fast, powerful, flexible, free, and easy to use system for indexing collections of Web pages or other files.
Special Features: Please refer to http://swish-
e.org/current/docs/README.html#Key_features for full feature set. Some of the major features are:
• Quickly index a large number of documents in different formats including text, HTML, and XML
• Use “filters” to index other types of files such as PDF, gzip, or Postscript.
• Includes a web spider for indexing remote documents over HTTP. Follows Robots Exclusion Rules (including META tags).
• Can use an external program to supply documents to Swish-e, such as an advanced spider for your web server or a program to read and format records from a relational database.
• Document “properties” (some subset of the source document, usually defined as a META or XML elements) may be stored in the index and returned with search results
History: Developed by people at University of California (Berkeley and San Francisco) and other places.
Project Sponsors/Administrators: Roy Tennant (UC, Berkeley)
Dependency: (To build from source) GCC (C++ Compiler), and some other optional packages. Please refer to http://swish-e.org/dev/docs/INSTALL.html for the latest requirements.
Supported Platforms: Sun/Solaris, UNIX, BSD, Linux, OS X, Windows License: GNU GPL, or LGPL
Availability: http://swish-e.org/Download/ Further Information:
1. Project Homepage: http://swish-e.org/
2. How to Index Anything: http://www.linuxjournal.com/article.php?sid=6652
3.3 ASPseek
Description: ASPseek is an Internet search engine software developed by SWsoft consists of an indexing robot, a search daemon, and a CGI search frontend. It can index as many as a few million URLs and search for words and phrases, use
wildcards, and do a Boolean search. Search results can be limited to time period given, site or Web space (set of sites) and sorted by relevance (PageRank is used) or date.
Special Features: ASPseek is optimized for multiple sites (threaded index, async DNS lookups, grouping results by site, Web spaces), but can be used for searching one site as well. ASPseek can work with multiple languages/encodings at once (including multibyte encodings such as Chinese) due to Unicode storage mode. Other features include stopwords and ispell support, a charset and language guesser, HTML templates for search results, excerpts, and query words highlighting.
History: Developed and maintained by SWsoft.
Project Sponsors/Administrators: SWsoft (http://www.sw-soft.com/) Dependency: C++ STL, RDBMS
Supported Platforms: Linux License: GNU GPL
Availability: Binary packages: http://www.aspseek.org/packages.php, Source packages: http://www.aspseek.org/download.php
Further Information: Project Home Page: http://www.aspseek.org/
3.4 Harvest: A Distributed Search System
Description: Harvest is a system to collect information and make them searchable using a web interface. Harvest can collect information on inter- and intranet using http, ftp, nntp as well as local files like data on harddisk, CDROM and file servers. Special Features: Current list of supported formats in addition to HTML include TeX, DVI, PS, full text, mail, man pages, news, troff, WordPerfect, RTF, Microsoft Word/Excel, SGML, C sources and many more. Stubs for PDF support is included
in Harvest and will use Xpdf or Acroread to process PDF files. Adding support for new format is easy due to Harvest's modular design.
History: Unknown
Project Sponsors/Administrators: Developers: Kang-Jin Lee, Javier Masa Marin, Harald Weinreich
Dependency: Apache, Perl, GCC (C Compiler), Bison, Flex Supported Platforms: UNIX, Linux
License: GNU GPL
Availability: http://sourceforge.net/project/showfiles.php?group_id=27808, http://harvest.sourceforge.net/harvest/doc/download.html
Further Information: Project Home Page: http://harvest.sourceforge.net/
3.5 Zebra Server
Description: Zebra is a high-performance, general-purpose structured text indexing and retrieval engine. It reads structured records in a variety of input formats (e.g.. email, XML, MARC) and allows access to them through exact Boolean search expressions and relevance-ranked free-text queries.
Special Features: Zebra supports large databases (more than ten gigabytes of data, tens of millions of records). It supports incremental, safe database updates on live systems. You can access data stored in Zebra using a variety of Index Data tools (e.g. YAZ and PHP/YAZ) as well as commercial and freeware Z39.50 clients and toolkits.
History: Unknown
Project Sponsors/Administrators: Index Data (http://indexdata.dk/)
Dependency: YAZ Toolkit, [To build from source: C++ Compiler (GCC or VC++)]
License: GNU GPL
Availability: Source and binary: http://indexdata.dk/zebra/
Further Information: Project Home Page: http://indexdata.dk/zebra/
3.6 SiteSearch
Description: The OCLC SiteSearch software provides a comprehensive solution for managing distributed library information resources in a World Wide Web environment. It offers tools that integrate electronic resources under one web interface, provide flexible access to resources, and build text and image databases locally.
Special Features: Unknown History: Unknown
Project Sponsors/Administrators: OCLC, Inc Dependency: Java
Supported Platforms: Platform Independent License: SiteSearch Open Source License Terms Availability:
http://www.sitesearch.oclc.org/project/showfiles.php?group_id=16381 Further Information: Project Home Page: http://www.sitesearch.oclc.org/