bookgears ...book data grabber  
Home
Bookpiles
BookGears
Download
In German
Examples
Screens
Contact
SF project summary
Bug reporting
Monthly News
Search on this site


SourceForge.net Logo


Last update 30 Mar 2007

BookGears Homepage

BookGears is the data grabber for Bookpiles. 

There are thousands of public libraries on the web. Each has thousands of book to search for. Nearly each library shows the book data in a different format. How can you get this book data into a book organizer without typing?

BookGears is the data grabber for library web pages and can send the extracted data to the book organizer Bookpiles.

BookGears uses data for each library and their web pages and especially regular expressions  to extract data from web pages.



It's sometimes not easy to create those regular expressions, but they have to be created only once for each library. Except a web designer changes the web page, then the regular expressions have to be adjusted.

Does this work for all library web pages?

It works for most web pages, but not for all.

Sometimes book data is incomplete or ill structured. Some libraries use a lot of Java script on their pages. Regular expression often don't work then.  

 

A regular expression is a string used as pattern for a set of strings.

 

One example

On a web page the following book data is shown.


 

The web pages consists of HTML text. For the sample above it is the following HTML text: 

<TH NOWRAP ALIGN=RIGHT VALIGN=TOP>Personal Name:</TH>
<TD dir="ltr">
<A HREF="/cgi-bin/Pwebrecon.cgi?SC=Author&SEQ=20060921031040&PID=26950&SA=Brueck,+Dave.">
Brueck, Dave.</A></TD>
</TR>

 

The regular expression

 Personal Name:</TH>.*?<A HREF.*?">(.*?)</A></TD>

describes the search for the author's name.

 

The characters with a special meaning are:

.           any character

*          an Iterator, 0 or more

?          non-greedy search. Search for the shortest number of characters to match the expression

( )         Group, to find a sub string.
 

The search can be described as follows:

On the web page search for the string    Personal Name:</TH>   then read the next characters until  <A HREF, a HTML web link,  read further until "> . Read the next characters until  </A></TD> the end of the link.

The author's name is the text in the first (and only) group.

This regular expression should work for all author names on this library, at least for the same type of page. A detail page here.

 

I have started with a few libraries. When someone uses an interesting library and wants to test BookGears, please let me know. 

 


  


 

Current release: 0.81

License: GPL

 

Download BookGears 0.81

 

BookGears News

30 Mar 07

Version 0.81 released together with Bookpiles version 1.1.

22 Dec 06

Version 0.8 released together with  Bookpiles version 1.0.

1 Oct 06

Version 0.7 released together with  Bookpiles version 0.9.