about the sample documents
The sample documents are found on the Anthracite Disk Image distribution in the "Goodies -> Samples -> Examples" folder path. Sample sources are also included on the disk.
CO House Phones
Bible Sample "Bold Red Jesus"
SEC Examples
Structured Text to Tables
COHousePhones
CO House phones is a "scraper" for contact information for Colorado US House of Representatives Members. The source page (108th House Phone Directory) is a nicely formatted HTML layout with graphics and and a column down the middle with all the representatives in alphabetical order.
In this example, we want to extract just those members who are from Colorado and convert the output into a Comma Separated Value (.csv) list that is compatible with a spreadsheet program. This is accomplished in 7 steps represented by the seven icons in the document window.
1. The Source Object ("house phone dir") just loads the example page we're working with (linked above).
2. The "Table 5" Processor Object extracts table 5 from the source of the sample page, and converts the table to an array of values. This step eliminates irrelevant source data and focuses on just the list of Congresspeople, then converts that data into an in-memory spreadsheet-style representation.
3. The "striptags" Processor Object strips all HTML tags from the document source, making it plain text (to eliminate font, italic and bold tags).
4. The "join comma" Processor Object converts the columns of the in-memory array representation into a list of text strings where the fields of the columns are separated by commas, but still leaves a an in-memory array of those rows of text.
5. The "join \n" Processor Object then takes that in-memory array of text strings and combines them with the linefeed (LF) character, making the data essentially a text file of comma-separated lines. In fact, at this point if you captured the results, you'd have a complete comma-separated list of the 108th House of Representatives Members (and you might want that!).
6. For this example, we just want the Colorado portion of the whole list, so we're going to use good old UNIX's tried and true "grep" command to filter that portion out. Looking at the source data, we see that can be done as one might expect by filtering on "CO", so we use the command "grep CO" (specifying the path and arguments separately in the edit window).
7. Finally, we choose to store the result within Anthracite for now, so our scraping ends with a "Result Object" that stores the final output of each process run. The included sample has some stored result data.
Bible Sample "Bold Red Jesus"
This sample document demonstrates Anthracite searching through the text of the King James Bible (approx 4.5MB) to find all occurances of the name "Jesus" and produce a report showing the context of the name's use by excerpting 1K of text around the term, and then highlighting the name in bold, red letters. This is done in four steps: load the source, find text near the name, highlight the name, and store the results.
1. The "kingjames_bible_pd.txt" Source Object loads the sample text from the included samples on the disk image file (you must have the Anthracite distribution volume mounted or change the file location to point at a local or net-based copy).
2. The "tn:Jesus" Processor Object finds the name "Jesus" in the source text, and then excerpts 1024 characters centered around the search term. The output of this is an array of text chunks of 1K size.
3. The "fr:BoldRedJesus" Processor Object performs the Find and Replace function on each of the incoming text chunks and uses HTML tags to highlight the search term in bold face and red color.
4. The "New Results Object" stores the output of the process chain.
SEC Examples (SEC10Qs and SECStepTwo)
See the first item on the online examples page.
Structured Text To Tables
In this example, we're going to take formatted addresses stored in a text file on the Internet and convert them into both comma-separated (.csv) and HTML Report formats. This is a good, simple example of the power of Anthracite to help solve business problems with Internet data. An actual real-world example of this type of system is described at Metafy Dogfood. This is the same kind of task that previously would have required custom programming or scripting, but now you can do it yourself quickly and easily.
First, we start with the URL of the source data, in this case, http://www.metafy.com/anthracite/samples/data/SampleAddressList.txt. This is a dummy plain-text document that shows a typical layout for data such as directory listings.
From there, the data is sent to seven different processor objects which use Regular Expressions to extract just the lines of iterest from the source content. This is also explained in the Metafy Dogfood example, but a sample Regex would be
Phone: .+$
To collect any line that has "Phone:" followed by anything up to the end of the line. In the next step (find/replace) we remove the word phone from what we collected.
We name the object in final step in each element we've extracted with the name of the data that we've collected, e.g., "Phone", "Name", "City", "Zip" etc. This is done for compatibility with databases and ease of constructing exports. That is, if you have a MySQL database with a table for this data, and fields named "Phone", "Name", "City", "Zip" etc., the data will be automatically inserted into those fields names. Likewise, when we make a "Custom Snippet" report in the next phase, we use those field names to identify where we want to place the data in our report output.
Before we examine the CSV output file, let's look at the "TemplateSnippet" Report Object. Here, we're going to use Anthracite's built in template system to generate an HTML Table that holds the data from the structured text file, to help us proof the work we're doing.
In the "Custom Snippet" template type, we enter the following HTML snippet, which creates just the Rows for the table we want to create, where each row has one of the addresses from the data we extracted above. To insert data from the upstream processors, we use two underscores around the names of the input objects.
<tr><td>__Name__</td><td>__Address__</td> <td>__City__, __State__ __Zip__</td><td>__Phone__</td> <td>__Code__</td></tr>
Then we take the output of that Report Object and use a "Wrap" Processor Object to put the main table tags around the generated rows.
Finally for this branch of the processing chain, we store that in a Results Object where the resulting output data will be saved within this Anthracite document.
Now a quick step back to the Export Object (green cone) at the bottom center of this example document, where we are going to write out the comma-separated (.csv) file that will convert the webpage into a spreadsheet compatible file.
We've configured this File Export object to write a file using the object's name ("StructuredDataToCSV"), we set the "Extension" to ".csv", which will also specify how the output should be formatted.
Two other important settings to note are "Combined Output" which will put all the results into one file (instead of different files for each address), and "Overwrite Existing Files" which will write the data to the same file each time, deleting what was there before.
Here are samples of the output of this Anthracite process for you to compare to the input data sample:
Output as Structured Text To Table (as HTML)
Output as StructuredDataToCSV.csv (as .csv file)
Still to come (but try them anyway!):
AnthraciteTesting
AppleScript Export - Mail
AppleScript Source - Safari
AppleScript-EMailsFromEmail
CurlHTTPSTest
EasiestDocument
MajorBugsRDF
MySQLTesting
RDF Example
SEC-Apple10Qs
SysAdmin-MRTGExcerpt
TableTests
UNIX Command - Perl Touch.anth
UNIX Command Source
URLListFile