Thursday, April 26, 2007

File system indexing

I was looking for ways to index an entire array of websites to create a repository. the idea behind is that people can download and dump their favourite websites like tldp.orf ( The Linux Documentation Project), wikipedia etc., and create an index out of them that facilitates searching among these repositories.

wget is a small linux utility that allows you to recursively dump websites into your hard drive. For the indexing purpose, we can use Zebra which is a high-performance, general-purpose structured text indexing and retrieval engine.
Zebra also supports large databases (more than ten gigabytes of data, tens of millions of records). you can download Zebra at http://ftp.indexdata.dk/pub/zebra/idzebra-2.0.12.tar.gz and install the same after installing dependencies like yaz etc.,

Zebra documenattion is available at http://www.indexdata.dk/zebra/doc/ read specifically the sections on Administering zebra. you can do different types of indexing ( see zebra.cfg options). For indexing your dumped websites, you should use the indexing with File Record IDs which will also support incremental updates to your repository.

You can access data stored in Zebra using a variety of
Index Data tools (eg. YAZ and PHP/YAZ) as well as commercial and freeware Z39.50 clients and toolkits.






Powered by ScribeFire.

1 comment:

Anonymous said...

hey, man your bog page is entirely techie without any useless things.good keep it up.