home > anthracite > docs > tools

anthracite processor objects

Anthracite Processors do the work of extracting information from sources you've loaded and preparing that information for output.

There are currently 10 processors available, and many of them are capable of performing a wide variety of tasks. Additional processors are in development and these will be further enhanced over time.

[ Find/Replace ] [ Strip ] [ Wrap ] [ Fix Links ]
[ Unify Linefeeds ] [ HTML Entities ] [ Text Between ] [ Text Near ] [ Excerpt ]
[ REGEX ] [ Tables ] [ Split / Join ] [ UNIX Command ] [ Summarize ] [ Xpath ] [ Conditional ]

1. Find/Replace

The Find/Replace Processor Object tool provides basic "global search and replace" on text, similar to what you may be used to in a word processor. Text to be found must match exactly, with capitals and spaces included.

 

2. Text Between

The Text Between Processor Object extracts all text from its input that is found between specified starting and ending delimiters, which may or may not be included in the result and are optionally case-sensitive. Among many other uses, it makes it easy to extract most tagged text. The "Extract Links" preset is an example of how you can use the Text Between tool to extract all of the hyperlinks on a page into an array of results, simply by finding all the text between the starting string "<a " and the ending string "</a>" (ignoring case).

 

3. Text Near

Text Near is a Processor Object tool that provides the conditional excerpting of text, based upon the presence of matching text. For example, in the example pictured here, Text Near will find 1K of characters centered on either side of the word "risk." Text Near can also find before or after the matching text, and can be set to ignore the case of the text to be matched.

 

4. UNIX Command

The UNIX Command Processor tool is among the most powerful tools in the Anthracite toolkit because it allows you to process data using ANY UNIX Command line program that accepts input on Standard In and outputs results to Standard Out. As the preceeding sentence may suggest, you must have some familiarity with using the UNIX command line (on MacOS X this is generally done via the Terminal program) and understand how to configure UNIX command line programs to meet your requirements.

See also the Additional Documentation about using UNIX Command line programs in Anthracite, (including specific configuration examples).

To explain the example shown, this processor object is configured to run the UNIX command "grep" on the input data, looking for occurrences of "http:" at the beginning of the line. Also, the input data will be forced into UNIX linefeed mode for compatibility.

WARNING: It is theoretically possible for you to cause damage to or disrupt the operation of your system using the UNIX Command object, such as by erasing important files or rebooting the system. Please make sure that you understand what the commands you are running are capable of and how the settings may affect their operation. Metafy/Anthracite is not responsible for anything, intentional or accidential, that you or your users do with these tools.

 

5. REGEX (Regular Expressions)

The REGEX Processor Object is an interface to the poweful text matching capabilities of Regular Expressions that enable you to match patterns in text even when the actual text changes, such as looking for all phone numbers or e-mail addresses in a page of text you've loaded. A complete explanation of the capabilities and syntax of Regular Expressions is beyond the scope of this user guide, however, some examples of regular expressions that you can use and modify will be provided.

In the example pictured here, we are searching for phone numbers. The Regular Expression "[0-9]{3}-[0-9]{3}-[0-9]{4}" specifies matching a pattern of any three digits followed by a dash followed by another three digits followed by another dash and then four digits, such as 800-555-1212.

The checkbox settings for "Extended," "Ignore Case" and "Use Newlines" provide compatibility with the capabilities of the Regex Engine. [MORE INFO TK]

Please also see the Additional Documentation for Using Regular Expressions in Anthracite.

 

6. Table Processor

The Table Processor Object is designed to extract information from HTML tables, down to the granularity of rows and columns or individual cells.

To extract a specific table on a webpage, click in the round "radio button" next to "Extract Specific Table" and then type in the number of the table you want.

The extracted table(s) can be treated as a block of HTML text or converted into a spreadsheet-style array of rows and columns. To convert HTML table cells into an array of spreadsheet-like cells, check the "Convert HTML to Array" checkbox.

To specify the extraction of a specific row, column, or cell (a row and column pair), check the "Extract Specific" checkbox and enter the number of the row, column, or the row & column of the cell you want.

[NOTE: Specifying a table number, row number or column number that is not valid for a given table will result in an error/warning message in the console or log, and will produce no output for the invalid request].

NOTE: Although this tool has been tested extensively on diverse HTML samples from the Internet, it may not be able to accomodate certain types of malformed or incorrect table tags. During the initial field testing phase, it is important to let us know about problems you encounter using this tool, including the URL of the source documents.

Usage Notes: Due to the diverse uses of table layouts on many web pages, this tool may also require some trial and error on your part to extract the specific table or information you want. For example, the table number of the specific table you wish to extract may not be what you anticipate from viewing the page in your browser due to the use of "table within table" nested layouts. You may need to do a "first pass" process on a document and then preview the results of "Extract All Tables" to determine the specific table you want. Sometimes, the output of "Extract All Tables" will make it look as if there are many duplicates of certain tables on a given web page, however, this is simply the result of each "Table within a table" from the layout being treated as a unique table. When this is the case, you may wish to work with the table that is the "most nested" within the set of tables in the layout, often, that table will be the first of several similar looking tables, and have the least extraneous parts.

 

7. Strip Processor

The Strip Processor Object gives you the ability to remove specific characters or types of characters from input text.

Strip Tags - Remove all tags (text between "<" and ">") in the input text

Strip Returns - Remove all return characters ( \r CR, \n LF, & \r\n CRLF ) from the input text

Strip Tabs - Remove all tab characters (\t) from the input text

Strip Character - Strip a particular character from the input text (specify the single character to be stripped in the text field to the right of the Strip Character text)

Trim - remove all whitespace characters from the beginning and end of the input text

Strip Blank Lines - remove all occurrences of blank lines (eg, \n\n becomes \n) from the input text

 

8. Wrap

The Wrap Processor Object allows you to place processed text between starting and ending text that you specify. For example, as shown, this Wrap object would place any input data it receives between the "data" start and end tags, as you might use for preparing text for XML export.

 

9. Fix Links

If you have processed HTML from a source that makes use of relative paths for links and images, you can use the Fix Links Processor Object to repair them relative to the page or host from which they came.

[KSL: currently only "Host" relative method might work, and then does not currently seem to work for all relative links]

 

10. Unify Linefeeds

The Unify Linefeeds Processor Object converts several types of linefeeds in the input text into one standardized linefeed form that you specify from the menu. For example, if you would like to convert all '\r\n' (CRLF, Microsoft Windows format returns) into '\r' (CR, Apple Macintosh format returns), simply select the '\r' (CR) option from the menu. This tool is frequently useful when sending data into or getting text out of the various UNIX command objects.

 

11. Excerpt

Excerpt allows you to specify a numeric range of characters or array elements to sub-select. Enter the starting index and ending index, and specify if you want to excerpt strings or array elements. For example, if your source data were:

The quick brown fox jumped over the lazy dog.

and you requested 4 to 18 with the String excerpt method, you'd end up with:
quick brown fox
as the result.

If your source were an array, such as:
{ "The", "quick", "brown", "fox", "jumped", "over" }
and you requested to use the Array excerpt method, and specified 1 to 3, you would end up with:
{ "quick", "brown", "fox" }
in your output.

Finally, if your source is an array (such as the example above), and you request 0 to 2 with the String Method, you'd get:
{ "The", "qui", "bro", "fox", "jum", "ove" }
as your resulting output.


 

12. Split/Join

Split and Join allow you to convert Strings into Arrays and vice versa. If you pass a string as input, it will be Split into an Array using the delimiter string you enter. Likewise, if an Array is passed in, it will be joined together using the specified delimiter.

 

13. Summarize

Summarize allows you to process input text using the Apple Text Summarization engine. You specify the number of sentences you wish to have in your result, and the Processor Object will reduce any amount of input to that length.

 

14. Xpath


XPath is a powerful tagged text processing language. Anthracite enables using XPath commands to process input text data, as well as configuring the use of Tidy processing and the result format.

A complete XPath tutorial is beyond the scope of this software documentation, and several can be found online and in print, however, here's a simple example of some syntax to give you the general idea of how it can be used:
		//a
		
will return all anchor objects (links) in an HTML document

and
		//item/description
		
will return the description elements from all "item" objects in an XML file.

Slightly more complex examples include:

		/html/body/table[1]/tr[1]/td[1]
		
which will return table one, row one, cell one from the body of an HTML document.

and
		//a[ends-with(@href,".mp3")]
		
which will return all anchor objects that end with MP3.

15. Conditional



The Conditional Processor Object is a powerful integrated tool for filtering and transforming text input. This processor performs a string or regular expression comparison on the input it receives and then passes, blocks or modifies the input as configured.

Input text can be compared for exact equality with a string, for containment of a string, or for matching against a regular expression. The output can be passed as is, blocked, or replaced with regular expression back parenthetical matches.

Instead of the entire input string, the processor can also just pass or modify the matched portion.

To filter input into two different output paths, use two Conditional Processorors, and set one to be the opposite of the other (e.g., "Doesn't Contain" with the same text to compare)

 

 

16. HTML Entities

Use the HTML Entities processor to convert text to or from HTML entities.

[ Find/Replace ] [ Strip ] [ Wrap ] [ Fix Links ]
[ Unify Linefeeds ] [ HTML Entities ] [ Text Between ] [ Text Near ] [ Excerpt ]
[ REGEX ] [ Tables ] [ Split / Join ] [ UNIX Command ] [ Summarize ] [ Xpath ] [ Conditional ]

 

[ Top Of Page ] [ Anthracite Tools ] [ Anthracite Documentation ]



Last Updated: 5/21/2006