easy rss creation with anthracite
How To Generate RSS Feeds from Any Web Site or Data using Anthracite
Anthracite is the easiest way to create RSS feeds from any website, and it's flexible enough to do it using a variety of different methods. This document will walk you through several of them, including creating an RSS feed from SEC Edgar Filings, from a webpage of "Top 10" links, and from Google Search Results.
Google Results![]()
SEC Edgar Filings
Working Assets Top 10 Cartoons
Once you've generated an RSS Feed document with Anthracite, copy it to your web server (or have Anthracite save it to your web server) where you can "subscribe" to it just like any other feed using your favorite News Reader. We're going to use NetNewsWire Lite and the built-in personal webserver running on the local machine in these examples. To try Sample Output from Anthracite, jump to the end of the SEC Edgar recipe below.
Anthracite makes it so easy to generate RSS Feeds, you should be able to set one up using any of these examples in only a few minutes.
Then, when coupled with the power of automating your Anthracite processes using iCal, you can schedule your feeds to be updated regularly, for example, every morning.
NOTE: These examples require Anthracite version 1.0.7 or newer.
For a refresher on what the Anthracite Objects do and how they are configured, see Anthracite Tools Documentation.
To Download the software for a free two-week trial, click here.
To Purchase Anthracite (only $99USD), click here.
Examples:
Apple Edgar Filings | Working Assets Most Sent | Google Results to RSS Feed
Google Search Results to RSS Feed
(click for larger)
As you can see from the screen shot, this example demonstrates just how easy it can be to create an RSS Feed from other information, such as the results of a Google search.
To use the Google API, you must have a Google API Key.
As part of the first step, enter the key they send you into a Source Object configured to the Google API subtype, and it will be saved in your Keychain so that you do not have to re-enter it each time.
1. Configure the Source Object
Configure the Google API Source object by simply entering the search term you'd like to get results for, and configuring any other details of the Google API Source Object, such as the number of results you want returned.
In our example, we're searching for "Italian Lounge Music MP3s" and we want 25 results (just to make it quick, Google allows up to 10,000 results per day).
We're also going to use the Raw XML result returned from the Google API and simply extract the summary that Google provides, however, we could use Anthracite to go out and fetch the content of each of the pages that the Google query returned, and use the source of the loaded pages to build the RSS feed instead.
2. Extract the Data from the Result
Our RSS Feed will be quite simple, we just want the link, the title, and the summary description that Google returns.
In the screenshot shown, there are six processor objects divided into three processing paths, two objects each per the three fields we're going to extract.
Starting on the left, the "URL" Processor Object is a Text Between processor that looks for the data between the XML "URL" tags in the result. The result of that process is passed to another Text Between processor that extracts just the "http" link portion of the tag.
In the middle, the "Snippet" processor chain does basically the same thing, first extracting the "Snippet" tags from the Google XML, and then using a "Strip Tags" processor to remove the wrapper XML tags which we're not going to use, leaving just the description info already conveniently in the HTML Encoding format.
On the right, the "Title" process is just like the middle process, first use a Text Between to extract a specific tag (in this case "title") and then use a Strip Tags processor to remove the tag, leaving just the data, which will then be automatically tagged by Anthracite.
3. Configure the File Export
This step is extremely easy, but before we cover it, here's a quick refresher on one aspect of how Anthracite works.
In the current example, we've decided we want to extract three fields, and we've now built processor chains to manipulate the source information to get what we want.
In the final step, we're going to connect the processing chains to an Export Object to write out the file containing the RSS Feed.
Anthracite Report and Export Objects "figure out" what data you want to use based on how you have named the Inputs to these objects. It's very important to remember this whenever working with Anthracite.
So, in this example, we've got three Processor Objects that connect to the Export Object: "link", "description", and "title".
These are the names of the data fields that will be written in the RSS Feed, corresponding perfectly to what we need to include in our feed to make it work.
This same concept applies when working with Report Objects and the other Exports, such as the MySQL Database and AppleScript.
Back on track now, the Export Object only needs a few minor details to be configured and then we'll be ready to run.
Set the Name of the Export Object to the base filename that you want, for example, if you want your feed to be in a file named "MyOutput.rdf", then set the object's name to "MyOutput" and choose the "Use Object's Name" option for file naming.
Since we need to serve this Feed to our News Reader client, we're going to want it to be on the webserver, so the Export Directory is set to "/Library/WebServer/Documents" to make use of the built-in Apache webserver on MacOS X.
Be sure to set the "Extension" of the file to ".rdf" (an XML file type used) to tell Anthracite you want it to convert the data into an RSS Feed.
Make sure that "Combined Output" is checked and that it's okay to Overwrite Existing Files.
Now, you're ready to run. Once the Anthracite process completes, switch to your News Reader program (in our example we're using NetNewsWire Lite), and "Subscribe" to your feed on your webserver, for example, "http://localhost/MyOutput.rdf".
You should then have a collection of "news stories" each representing a Google Search result.
![]()
[ Top of Page ]
SEC Edgar Apple Computer Filings:
http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0000320193&owner=include
(click for larger)
Step One: Get the Page, Extract the Rows
In Step One, we're going to configure the source to load the URL above, which contains the CIK code for Apple Computer, and will thus return a page of results with Apple's latest filings, from which we want the first ten to be our RSS Feed.
After loading the HTML of that page, we examined the tables that comprise the page and found that the latest filings are in Table 8, so we configure a Tables Processor Object to extract that for us.
We also noticed that the links are relative links, so we converted them to be absolute links using the Fix Links Processor Object.
Then, with the individual filing documents stored in rows in this table, we used a Text Between Processor Object to extract all the data between "TR" tags.
Finally in this step, we excerpt items 2 thru 11 of the table, to skip the header row and then collect just the ten most recent filings.
Step Two: Extract Specific Data from the Rows
After step one (the end of the straight line of objects at the top of the screenshot) we've got a collection of ten rows from Table 8 on the webpage. In this step, we're going to extract just what we want from the fields in each row to fill in the information in our RSS Feed.
First, we have to extract the individual table data cells from the rows, so we use a Text Between Processor Object to get all the text in "TD" tags.
We then want to get at data in the individual columns of the original table, which will now be in the array of individual cells for each row that we have after the last step, so we use an Excerpt Processor Object on the array to get Columns 1, 3 and 4 (we're skipping the file size info).
From Column One, we get both the link to the document and the type of filing that it represents (e.g., a "10-Q"). The link is extracted with a Text Between processor that finds data between the "href=" and end of an anchor tag. The form type is between the opening and closing of the anchor tag, and uses Text Between and Strip Tags to extract and clean-up just the info we want.
From Column Three we get the title of the filing, and after the excerpt process, we only need to strip the tags that are still there.
From Column Four we get the date of the filing which we want to include in the RSS Feed, and it too only requires a Strip Tags Processor Object to clean-up the remenant tags and be ready to use.
Now we're done with Step Two and have groomed just the specific data we want to be included in our RSS Feed, we're ready to use a Report Object to format the data for output, and an Export Object to write it out.
Step Three: Format the RDF Report and Export
In this final step we're going to use the data that we've extracted in the previous steps, format the inner body of the RSS Feed, and write it out to disk.
Each of the Processor Objects that are the final steps of the processor chains up to this point are connected to the Report Object, which we've configured to be a "Template" subtype and use the "Custom Snippet" method where we enter the template snippet directly in the edit field.
Based on how the Report Object works (see the important note above), we can use the named inputs to insert the data that we want in our output report, in this case, it will be the individual items in the RSS Feed.
Knowing just a little about how RSS works is helpful here, but suffice it to say it's a markup language like HTML and it uses tagged text to represent the various parts of a feed and story. Here's what the Snippet Report Object is using to generate this feed:
<item>\n <title>Form __formType__: __title__</title> <link>__link__</link> <description>Filed __date__</description> </item>\n
It's basically three tags (title,link and description) wrapped inside an "Item" tag, with the variable information extracted by Anthracite automatically inserted into the "__link__" fields wherever the replacement name matches the input object names.
Finally, we have to write out the resulting report to a file so that we can subscribe to it in our news reader.
To do this, we use an Export Object configured to write into the local machine's web server directory (/Library/WebServer/Documents), we specify the ".rdf" extension, and combine the output and overwrite any existing files.
Now we can run the process, and it spends most of its time waiting for the SEC Edgar server, then performs the processes and writes out the file. If everything is successful, you should have a file like these sample results (you can subscribe to this with NetNewsWire using this URL):
http://www.metafy.com/anthracite/docs/examples/easiest_rss_feed_generator/EdgarAAPLLatest.rdf
[ Top of Page ]
Working Assets Most Sent:
http://www.workingforchange.com/mostsent.cfm
(click for larger)
1. First, configure the source object to load the desired page (http://www.workingforchange.com/mostsent.cfm). While you're developing a process, we recommend you load the entire page into a result, then drag your result text back into Anthracite to create a Static Text Source Object that is a cached version of the page. This will result in faster development (and friendlier usage of the server) as you do not need to reload the data each time you try a process variation.
2. Examine the resulting HTML using the "Extract All Tables" option of the Table Processor Object. You'll find that Table 6 has the specific info that we want (a list of the stories), and that each item is a single cell in a row in the table. With this info, we set up the first three objects which simple extract all the individual elements from Table 6.
In the next set of steps, we're going to extract the specific information we're interested in for our RSS Feed.
3. The fields we want to use are Link, Title, Author, Date and SourceName. All the data we got in step two is within each cell, but now within "font" tags. We use another "Text between" to get those items, and then "Excerpt Array" processor to get columns 1 thru 3 into separate processing paths.
4. Next we clean up each field individually. Column One has the Anchor (link) tags, so that's where we'll get the URL and Title for our RSS items. Column Two has the date the story was published. And Column Three has the author and the publication source.
Column One, with the anchor tags, needs to have an Extract Links process run on it first to just get the anchors out, then we do a text between to extract the URL (between the double-quotes) and the Title of the link (between the tag parts).
Column Two is easy to process, we just need to strip the tags and other extraneous whitespace characters (tabs and returns) from the text and we're left with just the data.
Column Three takes a little more effort than the first. The data we want is in the format "First Last | Source" but has the same extraneous spaces that Column Two did, so first we clean that up, then we use the "Text Near" processor twice, once to get the text before the vertical bar, and once for the text after it. We then need to eliminate the vertical bar from the text, and in the case of the Source, we add the text "A story from" while we're doing it.
For the last step of processing each column, we name the object by the name of the RSS field that we want to create, so for example, the URL data comes from the object "link" and the title from "title".
This way, when these objects are connected to the Export Object in the next step, they'll be formatted automatically.
5. For the last development step, we need to configure the Export Object which will to do the heavy lifting of formatting the RDF/RSS output for us using the names of the inputs that we set up at the end of step 4.
We configure the directory path location and the name, as well as set the ".rdf" extension and make sure that we combine all the output into one file and overwrite any existing files (a feed is always being updated).
6. Finally, we can Run The Process and Examine the Results. The result should be a file that you can subscribe to with your news reader to see each of the "Top 10" items from the source page as an individual story.
[ Top of Page ] [ Metafy Home ]
Last Modified: 8/10/2004
Copyright © 2004, All Rights Reserved, Metafy LLC