last modified: 05/21/04
04/08/03anthracite examples: using cron to generate a log
In this example, we're going to use Anthracite to extract data from four different web pages and then write the information into a MySQL database.
Here's what the completed process tree looks like:
There are quite a few objects in the tree, but they are divided into six branches that are each easy to understand. The next four sections describe the four sources in the document from which all six process brances start.
Starting on the left, the "Site 4K Meter" loads a password-protected webpage of status information about the webserver and its usage. From this page, we use a Text Between processor named "mfc excerpt" to extract just the subset of information we want (the text between the website name and the next horizontal rule tag), and then a second Text Between processor to get just one piece of data from that (the number at the end of the sentence). The second processor is named "avgwebunique" corresponding to the name of a field in the MySQL database into which we want this data to be stored. That data is then passed to the export object, discussed below.
Next, second from the left, is a UNIX Command object that uses the command-line program "curl" (included with MacOS X) to load an HTTPS (secure) webpage from a remote server. [KSL: Anthracite does not currently support https URLs directly, must use external tool curl to access secure pages.]. The decrypted data (the source of the webpage as you'd see it in your web browser) is passed to a Table Processor object which reads the value from row 2, column 2 of table1 on that page. That data is HTML formatted, so it is then run through a "Strip" processor to remove the tags. That strip processor is also named "simuls" to represent the field of the MySQL record that we want the result to be put into.
Following that is a source named "cursong" that loads a webpage script that computes the current song playing on a specific internet radio station. The data from that webpage script is "raw" with no formatting, so the source object is named after the field to put the data in, and the source is connected directly to the MySQL export object.
The final source, the farthest right, named "BunkerNowPlaying" loads a webpage script that computes information about the playlist for a specific internet radio station and then passes it through three different processor pairs (six total processor objects, divided into three functional paths) to extract data. The first of the three uses a pair of Text Between processors to extract the remaining time of the current song, the second uses a pair of processors that extracts the name of the current playlist from a link to it which appears in the source document, and the third pair uses Text Between and Strip-Trim to extract and clean the data about the remaining time in the current playlist of this specific internet radio station.
The output of these six processor paths is sent to a MySQL Export object.
This object is configured to place the incoming data into the "test" database into a table called "mfclog" as a single entry (all the processed data from the four sources will be used to make one database insertion).
[Note about using remote databases]
To illustrate the output of this Anthracite document, here's an abbreviated sample from the MySQL database:
mysql> select * from mfclog LIMIT 1; +----------------+-----------------------------------+---------------+ | timestamp | cursong | cursongremain | ... +----------------+-----------------------------------+---------------+ | 20030406103545 | 1499:001499_vn_con_in_d__andant | 5:04 | ... +----------------+-----------------------------------+---------------+
So, now we've constructed this process chain and it writes data to the MySQL database as we desire.
The next step is to use this Anthracite document in a recurring manner.
To do this, we'll use another built-in capability of MacOS X, "cron."
A complete discussion of cron is beyond the scope of this document (type "man cron" at a Terminal prompt), and there are some GUI-based cron tools available for MacOS X if you are uncomfortable with the UNIX command line approach.
Anthracite comes with built-in support for the MacOS X command line that enables you to use it with cron.
[Warning, presumes Anthracite is loaded in Applications directory]
For example, from the UNIX command line (accessed via the Terminal program), you can type:
% /Applications/Anthracite.app/Contents/MacOS/Anthracite -f /PATH/TO/ANTH/FILE -rwhere "PATH TO ANTH FILE" is replaced with the path to an actual saved Anthracite document.
If you want your Anthracite process to run every minute, such as in the example above where we're compiling a log of information about what's being played on an internet radio station, use a crontab entry like this one:
*/1 * * * * /Applications/Anthracite.app/Contents/MacOS/Anthracite -f /Users/testing/Anthracite/MFCLogDB.anth -r
and your process will be run regularly while the computer is turned on.
[ top of page ] [ anthracite examples ] [ anthracite documentation ]