Using Anthracite To Scan For Text in Large Quantities of Web-Based Documents

Find the paragraph near certain text.
(aka, find the chunk of text surrounding where a document says "Risks" or "Time to" as in, "from time to time".)


Help quickly scan through large quanties of SEC filings for particular keywords.

Step One

Load today's index file and extract just the links for "10-Q" filings.


Step One Figure One Shows the completed process chain.

At the top, we have loaded the daily index page as a source.


In this case it was http://www.sec.gov/Archives/edgar/daily-index/form.20030226.idx
(which may be expired by now, here's a cached copy, or just look in the directory for today's index or any recent one)

Step One Figure Two shows how we configured it to filter only the lines starting with "10-Q" using a regular expression.



We then send just those lines output by the regular expression to a Processor Object that extracts the URL ["extract link", get the text between "edgar" and ".txt" inclusive].


Then modify the links so that they are absolute references (instead of "relative" as they are as they come in the initial index document we loaded)
[find/replace "edgar/" "http://www.sec.gov/Archives/edgar/"].

Finally we use an Export Obejct to save these URLs to a file for use in the next step.

Step One Figure Three shows how the output is configured to save just the URLs as a text file.


Running this process chain ("Run Process" from the File menu, or Command-R), results in the output of this file on the desktop:

"SECDailyLinks.txt"

which contains just the URLs of today's 10-Q documents from the index (.idx) file.

Step Two

Load all of today's 10-Qs and look through them for the specified terms.


Step Two Image One shows the completed array of objects in the process chain.


First, we load all of the URLs in the list we created in Step One using the "URL List" source type.

We then pass these documents to two different processors, on the left "Time To" looks for 1024 bytes of text around the term "time to" (as in, "from time to time").

On the right, "risks" does the same but looking for "risks" in the documents.

Step Two Image Two shows how we configured the "Risks" processor.


Because some of these filings can actually be in HTML (despite their ".txt" endings), we use a "Strip Tags" processor to clean up any HTML or other tags and be sure that we just extract the text.


Step Two Image Three shows the contents of a "Results" object after a successful run of this process chain.

Looking at the Result Object sheet, you can see that 6 documents were loaded, the first had no occurrences of either time to or risks, the second and third appear to have had several (18345 & 14217 bytes of data respectively), and the final three only a little matching data.

[NOTE: Once tags are stripped, the quantity of data in the results may not match the quantity specified in "text near"].


Clicking "Preview" in the Results Object sheet as shown in Step Two Image Three would bring up this preview document of results in your browser:


Risks_Output_Sample.html

[top][anthracite examples]