>>
home >> support >> documentation >> screencasts
anthracite screen casts
These screen movies demonstrate a variety of ways of using Anthracite, starting with loading a single URL from the current page in your browser using drag and drop, to spidering output in two simple steps. Follow along with the screen by screen descriptions of each movie below as a "howto" use these web mining techniques in your own processes.
Drag & Drop URL Loading (6MB via Dot Mac)
Table Extraction (12MB via Dot Mac)
Using Reports (25MB via Dot Mac)
URL Path Export (9.5MB via Dot Mac) New in Version 1.5!
Explanations:
Drag & Drop URL Loading (6MB via Dot Mac)
The first movie shows how easy it is to drag a URL from the current location field of the browser and drop it into Anthracite, ready to load. Connect up a result object to the source with a command-drag, run the document by clicking on the Run button, and "you've got data!" Double-click on the result object, then double-click on the latest item to view the raw output, and then preview it in the browser to see it fully formatted.
Table Extraction (12MB via Dot Mac)
This movie picks up where the last ends; once data has been loaded, the next step is to process it, in this example, by extracting a column from a table. The first step is the same as before, to load the data from the URL, this time examining the raw output and observing that it is formatted in HTML tables. Next, a Table Processor is added to the document, set to extract all tables from the source, and then added to the existing process by dragging the new object onto the line that connects the two previous objects. After running this configuration, the output is checked where two tables are found. The Table Process is reconfigured to extract only table one, and then convert it from HTML to an array for further analysis. The output after this run shows that the table is four columns wide and 27 rows long. The processor is again reconfigured, this time turning off the HTML to array conversion, and specifying only column four be extracted. In the final run, just the fourth column has been extracted, and is in HTML format ready for further processing.
Using Reports (25MB via Dot Mac)
The Report Object is the primary way of formatting results in Anthracite, including formatting repeating data into structured form, such as HTML table rows. This movie shows a progression of developing a final document to build an HTML table report of results from a Google API query, ending with creating an HTML page on which more than 10 Google results are shown at once. Note that to follow along with this example requires a Google API Key, available here. In the first step, a Google API Source Object is configured to search for "Metafy" and load 10 results in raw format (simply returning the data result of the Google query, not the contents of the pages found). Examining the output after running this process reveals that the format of the data is a "Plist" format semi-colon separated list with easily identifiable key value pairs for the URL, the Title, and a Snippet of the pages found. In the second step, the Anthracite document is configured with three "Text Between" processor objects to extract this information from the results. Running this Anthracite document produces 10 sets of results for each text processor. In the next step, a Report Object is added and connected to each of the text processors. The "Custom Snippet" report type is configured by adding a snippet of HTML to the edit field, using a double-underscore notation to specify by name where each of the previous results should be inserted inside of a single HTML table row. Because this report is generating the rows, a "Wrap" processor is added to put opening and closing table tags around the generated rows. A Results Object is added to hold the output, and then the process is run, showing a table with three columns, one for the URL, one for the Title and one for the Snippet returned from the Google search. To improve on this, the Google Search is reconfigured to load 25 results instead of only 10, and then the Report Object is reconfigured to combine the URL and Title into a single clickable link in the report. The Anthracite document is re-run, and the outputs are previewed in the browser, showing a webpage with 25 Google results, instead of the usual limitation of ten.
URL Path Export (9.5MB via Dot Mac)
Anthracite 1.5 adds powerful URL path exporting capabilities, demonstrated in this movie. Starting with a hypothetical example of a database of small business listings displayed individually on template pages, a "Links On Page" Source Object is configured to load each individual listing from the links on the main index list page. The movie demonstrates the nifty trick of using the built-in context menu option "Open URL" when a URL is selected to preview the sources pages. The main listing page and two detail pages are reviewed, highlighting the source URL path for the pages, including the "get" arguments after the question mark in the URL path loading the distinct records by the "pageid" parameter. Next, a File Export Object is added to the document, and configured to write output to a folder on the current user's Desktop, to use the "Source URL Full Path" naming option, and not to have combined output (so we get individual files for each page, instead of one document with all the pages). The objects are connected using a control drag from the source to the export, and then the Anthracite document is run. A folder is created on the Desktop in which we find output from the source "www.metafy.com", inside of which we find nested folders corresponding to the source URL path from which the documents were loaded, resulting finally in a folder containing files for each of the links loaded from the original source page, which are then reviewed in the browser. The source page is spidered in only two steps, using a simple visual metaphor that does not require scripting.
[ metafy home ]
Last Modified: 5/30/06
Copyright © 2006, All Rights Reserved, Metafy LLC