>> home >> support >> documentation >> examples


anthracite example: small business database




In this Anthracite example, we show how a small businessperson making a product could use Anthracite to collect information about stores from listings on a website. This example presumes that the website is a public domain listing, that the site allows automated data extraction and/or that the copyright holder licenses this use. In our example we're going to use a sample database of fictitious business listings.


To get a sense for how easy this is to set up, here's the layout for the first step of this process, collecting the URLs, any one of the three "Result Objects" at the bottom hold enough information to move on to step two.



(Click for larger)


But, as with any Anthracite Web Mining process, we recommend you begin by spending a little time analyzing your source(s) with a pencil and paper handy before creating the system on the computer.

In this case, we started with a web page that provides links to a directory of hypothetical retail stores and wholesalers:

http://www.metafy.com/anthracite/samples/database/database.php


And when we click on the individual store links, we see that we are taken to what appears to be a database generated page using a fixed template. Looking at the URL for the page, we see that the directory entry pages are keyed to a "page id" (pageid), for example, this page for Page ID 005:


http://metafy.com/anthracite/samples/database/database.php?pageid=005

The template that is generated for each store appears to be based on a table layout, which should make our work even more straightforward when we're examining the data on the directory entry pages.

With just this information, we're ready to begin, although we'll take a moment to examine the table layouts later.

Takeaway: We know that there is a list of all the stores that we might be interested in and that the list has URLs that point at pages describing individual stores. We need the details about the individual stores.

Thus, we're going to design a two-step process that first collects the links for the stores, and second, examines the individual store pages for the information we want.

First, the list of URLs of the stores.



Step One: Collecting URLs of Stores

There are several ways to work with lists of URL sources in Anthracite, and so this process will create lists suitable for three of them (see Tools - Sources): URL List File, Links on Page, and the URL Generator.

The "URL Generator" can take a list of unique elements for a URL and generate a list and then load all of them. In this example, we know that the stores are at URLs that end with "page=001" and the like.

Thus, if we can create a list of just the number after the "pageid=" part of the URL, we can feed that to the URL generator to load the pages for the stores we're interested in.

The "Links on Page" processor will load all of the pages that are pointed to by a particular page, so if we had a webpage with the links (and we do, see above), we can use that, or, we can make our own web page of links we want, and use it.

The "URL List File" processor will take a list of URLs you want loaded for processing and load the data from each of them.

So, in our sample, we've got three final "Result Objects" that hold each of these three types of output.



The overview image above shows the entire set of processes for collecting links.

We're going to begin with the objects in the upper left corner of this process layout:



The Source Object in this process loads the webpage mentioned above that has the entire list of health food stores on it (http://www.metafy.com/anthracite/samples/database/database.php).

The output of that is passed to a "Extract Links" Processor object created from the presets on the toolbar. It simply grabs the text between HTML anchor tags ( the "<a href" and "</a>") which we know will be links.


Then, from looking at early results, we see that these links are all "relative" links that point at the server, and to be able to load them, we need their "absolute" address, so we use Anthracites "Fix Links" processor object to point the urls to the correct host.



After the "Extract Links" process above, Anthracite has been treating each of the individual links as an element in an array (see "Terminology"). We also picked up all the links that might have been on the page, including advertisements and navigation links. We know that the store links have "pageid" in them, so we're going to use the UNIX Command "grep" to filter out just those links from everything that we've collected. However, to make it go faster, we're going to send it one document of links, instead of potentially a thousand individual links to look at, so we "Join" the Array of links together with the UNIX linefeed character LF ("\n"), and then pass it to the grep command.




As you can see in the overview above, we also branch after the grep command to save our results and this point, and also create just a list of the URLs, instead of the full anchor tags for the links.

To do this, we use Text Between again, to collect what's between the "href=" and the end of the tag, which after the Fix Links processing above and "grep'ing for the pageid", we know is an absolute link to one of the store pages.





Compare And Save!

Compare that five or six step process to get a URL List and Page of Links to the Right Hand Side of the Overview above, now, where in two steps (after the common link extraction) we collect just the Page IDs to pass to the URL Generator:



and then, since the Text Between processor creates an array of results, we again "Join" them with the UNIX linefeed character LF ("\n") to produce a text document of results suitable for "Export" from the Results Object view (double-click on the Results Object, select the individual result entry, and then click "Export" to save a text file of your results).

Next, step two...



Step Two: Using the URLs

A useful quick stop at this point is to proof our results. In the last step we generated three different sets of results, and that will help us now.

First, make sure we got what at least looks like the output we want in the "List of PageIDs" Result Object. Double-click the Result Object, "open" the Outline View by clicking on the disclosure triangles, and then double-click on the inner result to bring up the Anthracite built-in results viewer. You should have a list of IDs.



But, proofing list of ids is easier if we use the "PageOLinks" Result Object (the result connected directly off the "grep" command), where we can preview a webpage full of what should be good, clickable links. Following a few of those does two useful things, one it confirms that we've got useful URLs (and we can spot check the pageids), and two, it also introduces us to the regularity of the template layout in which the information is displayed.


Just reviewing a few of the result pages shows us a couple things. We have to presume you've got at least a basic knowledge of HTML from this point on.

You might also want to read HTML Tables and Anthracite to get a feel for how Anthracite can get specific information from HTML tables.

We know that the source page is laid-out using HTML Tables because of the rectangularity of its design and the use of common features such as borders and background colors. It also seems clear that the information we want, such as the contact name, is stored in some component of the table layout, like maybe "row 4 column 2" if we think of the table like a spreadsheet.

With this information, and a sample of just one of the stores information pages, we can start to construct the second Anthracite Process which will compile the information we want from these pages.

Using the Result Object with list of PageIDs that we were just proofing above, choose "Export" and save the list of PageIDs to a file on the Desktop.


Envisioning The Process

Anthracite's Processors provide a variety of ways of getting and manipulating the data from webpages.

In this case, we know that the data we want is in tables on individual template-based webpages, and so with some research into how one of the tables are laid out, we can focus in and extract just the data from each of the pages in the list we made in step one.

Here's how it will look when we're done:


(Click for larger)


We'll use a URL Generator object to load each of the pages, create four Table Extractors to get each of the fields we want, clean them up, and then write the results into an HTML Report.



Note that we're using a List File source to determine what the URL Generator will insert in the changing portion of the URL. That List is the exported results from Step One, SmallBizPageIDList.txt.

By using the "Extract All Tables" process described in the linked HTML Tables document above, we see that Anthracite sees the data we want in Table 1 of the individual listing pages.

From there, we can also see that the Name, Contact, Address and Category fields are in column two of rows 1 to 4 of that table.

Then, to remove any extraneous HTML data that we don't want, we apply a "Strip Tags" processor to each extracted data field. It's important to note that since this will be the last step in the processing chain, we're going to name our processor object with the name of the data field before we pass it to the Report Object next.

The Report Object uses the mutliple named inputs and a Snippet of HTML as a template to build a table of our results. You could also use these snippet reports types to create other types of output here such as comma-separated or XML.



From here, we need to enclose the HTML Report's rows with an outer HTML table tag, so we pass the Report's output through a "Wrap" processor object that does just that.

Finally, we store the output within this Anthracite document, although we've used the Report Object's quick and dirty "Export" function to provide this sample output here:

SmallBizSampleResult.html

Note that we have now converted data that was originally stored on multiple individual templated HTML pages into one consolidated table report.


Next steps that the user may want to explore from here would be filtering results using "Grep" to select just the "Retail" category, or exporting the results to a MySQL database.

 

[top] [anthracite home]


last update: 07/07/2004
last update: 06/15/2004