|
BookGears is
the data grabber for Bookpiles.
There
are thousands of public libraries on the web. Each has thousands of
book to search for. Nearly each library shows the book data in a
different format. How can you get this book data into a book organizer
without typing?
BookGears is the data grabber for library web pages and can send the
extracted data to the book organizer Bookpiles.
BookGears uses data for each library and their web pages and especially
regular expressions to extract data from web pages.

It's sometimes not easy to create those regular expressions, but they
have to be created only once for each library.
Except a web designer changes the web page, then the regular
expressions have to be adjusted.
Does this work for all library web pages?
It works for most web pages, but not for
all.
Sometimes
book data is incomplete or ill structured. Some libraries use a lot of
Java script on their pages. Regular expression often don't work then.
A regular expression is a string used as pattern
for a set of strings.
One example
On a web page the following book data is shown.

The web pages consists of HTML text. For
the sample above it is the following HTML text:
<TH NOWRAP ALIGN=RIGHT VALIGN=TOP>Personal Name:</TH>
<TD dir="ltr">
<A HREF="/cgi-bin/Pwebrecon.cgi?SC=Author&SEQ=20060921031040&PID=26950&SA=Brueck,+Dave.">
Brueck, Dave.</A></TD>
</TR>
The regular expression
Personal Name:</TH>.*?<A HREF.*?">(.*?)</A></TD>
describes the search for the author's name.
The characters with a special meaning are:
.
any character
*
an Iterator, 0 or more
?
non-greedy search. Search
for the shortest number of characters to match the expression
( )
Group, to find a sub string.
The search can be described as follows:
On the web page search for the string Personal
Name:</TH> then
read the next characters until <A
HREF, a HTML web link,
read further until ">
. Read
the next characters until </A></TD>
the end of the link.
The author's name is the text in the first (and
only) group.
This
regular expression should work for all author names on this library, at
least for the same type of page. A detail page here.
I
have started with a few libraries. When someone uses an interesting
library and wants to test BookGears, please let me
know.
|