faq: frequently asked question
Important: Have you downloaded the latest version and experienced an immediate error that the trial has expired? This is most likely caused by the two-week period having expired since your previous trial use. Please read this User Support Forum note on this issue.
General:
How do I cancel a process?
How do I automate a process?
Is this a recursive spider?
How do I use a list of URLs to spider pages?
File Type Support:
Why Doesn't Anthracite Support Files of Type XYZ?
Does Anthracite Support Searching Text In PDF Files?
What About Processing Image Data?
MySQL Specific Support:
Why does Anthracite detect my MySQL server but not the client?
Miscellaneous:
When Will The Windows Version Be Available?
Is "Web Scraping" Legal?
Where'd You Come Up With The Idea?
What Good Is It?
Doesn't My Word Processor Do This?
How Is Metafy Pronounced?
What's The Name About?
What Is The Easter Egg?
General
Q: How do I cancel a process?
A: To cancel a process, press the "Escape" key. You can also control how long you want your processes to wait for for data from a remote URL before they timeout, see the "Data" preferences.
Q: How do I automate a process?
A: As of version 1.0.6, you can use AppleScript to run Anthracite Documents, an example script is included with the distribution. This means that you can schedule Anthracite to be run from iCal. You can also automate Anthracite using UNIX's cron command, for more information, please read the special FAQ on "Cron Jobs".
Q: Is this a Recursive Spider?
A: No, although Anthracite is capable of loading large quantities of data, it is not inherently designed to operate as a "recursive link traversal" spider. For example, the "Links on Page" Source Object can be used to load one webpage, and then follow all of the links that are on that one page, but it will not currently continue to links beyond that. This is one of the most requested features, and will be added in a future version. Update: As of version 1.5, Anthracite now also supports exporting loaded data using the URL path of the source, so a URL such as "http://www.host.com/sample/test.html" will be saved on disk as the file "test.html" in a folder named "sample" in a folder named "www.host.com".
Q: How do I use a list of URLs to spider pages?
A: There are several ways, presuming you have the list of URLs in a text file, the easiest is probably the "URL List File" Source Object. If your list of URLs is more like one URL with a variety of arguments (or a list of files like, "file_01.html", "file_02.html", etc.) then you might consider using the URL Generator. If the URLs are all links on a webpage, then you can also use the "Links on Page" source, or you might even write an AppleScript. There's also an FAQ addendum to this topic on the User Forums.
top
File Type Support
Q: Why Doesn't Anthracite Support Files of Type XYZ?
A: The most likely answer is, "We don't know about file type XYZ...yet." There are two possible solutions, 1) send a request for file type XYZ along with information on how we can parse it OR 2) use the UNIX Source Object to craft your own parser for the file type, data from diverse sources can be piped into Anthracite using the STDOUT of any command.
Q: Does Anthracite support searching text in PDF Files?
A: Yes! Beginning with Anthracite version 1.4, PDF support is built in, simply specify a PDF URL in your Source Object and Anthracite will use Apple's PDF Kit to extract any text in the file (as opposed to PDFs that contain scans and pictures). Then use the output of that Source Object to feed a processing chain as you would normally. For an older example based on pdftotext, see "Anthracite Example: Summarizing Legislation".
Q: What About Processing Image Data?
A: Sorry, Anthracite is not currently designed to process image or binary data, but that may be a future direction for the software. In the meantime, you can certainly try it using UNIX Commands for Sources and Processors, but there are no guarantees that it will work (don't forget the onerous license). Of course, this question also suggests that one has not yet found the easter egg hidden in the application (and that's the only hint).
top
MySQL Specific Support
Q: Why does Anthracite detect my MySQL server but not the client?
A: The MySQL Installer for MacOS X installs both a server component ("mysqld") and a command-line client program ("mysql"). The default location for the command line client is:/usr/local/mysql/bin/mysqland this is where Anthracite looks for the program. If you are using a third-party MySQL package (such as CocoaMySQL or OpenOSX's OpenWeb), you may need to create a symbolic link to your custom client location, or update your MySQL installation to use the standard paths. An upcoming update to Anthracite will allow users to specify a custom MySQL client location.
top
Miscellaneous
Q: When will the Windows version be available?
A: We've certainly thought about making a product that is compatible with Windows, but no schedule is set. Anthracite makes extensive use of the capabilities of UNIX built into Mac OS X, and so a direct port will be difficult, but other ideas are a brewin'.
Q: Is "web scraping" legal?
A: You should always check the Terms of Use for a given website you wish to use to determine if they allow automated access to their site (for example, Google does not, except via their API), and remember that Anthracite is specifically licensed for use only with materials for which you have the proper rights, so you should really ask this question of your attorney. There is also a lot of public domain material out there (eg, government laws & publications) and your "fair use" rights may give you wide latitude with how you use certain materials. There is also a movement to help clarify license usage terms for copyrighted material, with explicit clearances granted for certain types of uses, check out the Creative Commons.
Q: Where'd you come up with the idea?
A: I was doing some work for a company that had an idea to keep track of and display real-time info on what was happening on a large number of websites that were constantly changing, and it was clear that a room full of interns (even in India) wasn't going to be a practical solution. While more traditional tools (in particular, Perl with regular expressions) were a possibility, it was also not something that was worth paying a programmer full-time to write scripts, especially since they were likely to need regular adjustments to keep up. At the same time, teaching even smart college interns how to write regular expressions didn't seem like the right answer either, and that's when the light bulb went on "you know, a general purpose tool to help extract info from websites and reformat the data for storage in a database would be really, really useful right about now..."
Q: What good is it? (aka, "So what?")
A: Data mining by web scraping is an emerging power tool for infonauts of all persuasions. O'Reilly recently introduced a book titled "Spidering Hacks" that features 100 techniques for using data from diverse internet sources. The LA Times called Data Mining one of the 5 technologies of the future. You can make use of it today to help your work in other areas, ranging from web design and publishing, journalism, legal, financial and market research, system administration and even literature education and personal entertainment. Once you start working with Anthracite, we're confident you'll discover an endless number of practical uses for it, from automatically generating daily morning updates of business and legal research that would have taken hours by hand before to making a dynamic homepage for your office with updates from all your key sites, to simply reformatting a table from a webpage into a database. Track news, collect links that match particular keywords for later review, summarize webpages found by searches, there's really no end to what you can do with information in text, HTML and other formats.
Q: Doesn't my word processor do this?
A: While certainly the most powerful word processing programs have some of the same capabilities as Anthracite and every one of them should be able do global find and replace, Anthracite does something different from a standard interactive text editor (one in which you type in and change the text), it enables a user to build text processing systems that also connect to the internet and databases, plus uses a visual metaphor for describing the changes you want to apply to a variety of types of documents as opposed to direct manipulation of the contents. Anthracite makes it so that you don't have to make manual edits to documents, like picking out the phone numbers from an incoming e-mail, instead you describe where the phone numbers are or what they "look" like, and have Anthracite process the e-mail for you, then, for example, put the phone numbers into a database. Anthracite doesn't require that you have a word processor, but it also makes a great companion to one.
Q: How is Metafy pronounced?
A: Meta-Fi with a long I sound so that it sounds like the "Hi Fi" in Stereo.
Q: What's the Name About?
A: Anthracite is the top grade of coal that is mined.
Q: What is the Easter Egg
A: If we told you what would the fun be? Besides, there's at least one clue on the website.
top
[ Metafy Home ] [ Support ]
last update: 6/11/06