>> home >> support >> documentation >> examples >> html tables


processing html tables with anthracite


This is an overview of typical HTML Table processing tasks in Anthracite.



Examining All Tables on a Webpage

One of the first things that is often required when designing a web mining extraction system is to know how the source page uses tables in its layout and data formatting.

A common way to quickly examine the table layout of a page is to:

1) Drag the URL from the Location bar of your browser into an Anthracite Document. (We're going to use King County, WA Election Results URL:http://www.metrokc.gov/elections/2004may/resPage2.htm in these samples.)

2) Add a Processor Object from the Toolbar and then configure it to be a "Table" type and use the default "Extract All Tables" setting.



3) Add a Result Object from the Toolbar, and then Command-Click and Drag to create connections from the Source to the Processor to the Result.



4) Run your process.

5) Preview Your results by double-clicking on the "Result Object",



then opening the outline view to select the current result,


(click for larger)

and then click "Preview" to bring the results up in your browser.


(click for larger)

What you see in the browser are all of the the source page's tables in a numbered list. The next step is to refine the collection of data with this information.



Examining A Specific Table

Now, suppose we want the "Ballots Cast/Registered Voters" (Turnout) number for Fire District 16. From the example, we see that's in Table #5 on the Source Page.

Add a "Table" Processor Object (you can use copy/paste), and set it to extract just the specific Table Number 5.



Now, what we get when we run the process is just the HTML for Table #5 in which we can see there is one row ("TR") and several data cells ("TD"s).





Getting Useful Info from HTML Table Cells

What we want are the two numbers in Cell #3 and Cell #5 (Ballots Cast and Total Registered).

Now we can configure two Table Processor Objects to get that data.



Here's the one for Cell #5:



Since we want just the numbers and not the HTML formatting, we're going to run these table results through a pair of "Strip Tags" processor.

But, since this is the final step of processing for these pieces of data, and we want to use this output in our own reports, we're also going to change the Title of the Objects to reflect what the data is for easy specification in the Report Object, next.



In the Report Object, we now use the data we've collected. Set the type to "Template Report" and the template type to "Custom Snippet" and then enter a snippet of HTML for how you'd like your report formatted.

For example, for the output below we used:

FD-16 Ballots Cast: <b>__BallotsCast__</b><br>\nFD-16 Registered Voters: <b>__TotalReg__</b><br>\n

And as it appears in the configured Report Object:



And then use a "Wrap" Processor Object to put the Snippet Report's output into a table frame.




Finally, when run, and then the results are viewed, we have:

<table border=1><tr><td>FD-16 Ballots Cast: <b>5296</b><br>
FD-16 Registered Voters: <b>21521</b><br>
</td></tr></table>


Which should look like this when previewed:

FD-16 Ballots Cast: 5296
FD-16 Registered Voters: 21521


It should also be clear that we can configure reports in this same way to provide additional formats of output (such as CSV).

You can also use the names of the Processes (e.g., "BallotsCast", "TotalReg") as the names for fields in your MySQL database when using the MySQL export object.


[ Metafy Home ] [ Documentation ]



last update: 6/16/04

Copyright © 2004, All Rights Reserved, Metafy LLC