More experienced programmers will understand better why things work, but any Perl programmer will set this book down feeling empowered to turn the web into their own valet. No longer do you need to check multiple sites looking for interesting information. Instead, you can readily author code to do that for you and alert you when items of interest are found.
You can use these tools to free up personal time, to harvest information to inform business decisions, to automate tedious web application testing, and a zillion other things. The author's clear exploration of the relevant Perl modules leaves the reader with a good depth of understanding of what these modules do, when you might want to use which module, and how to use them for real world tasks. Before reading the book, I knew of these modules, but they were a rather intimidating pile. I'd used a few of them on occasion for rather limited projects, but was reluctant to invest the time required to read all of the documentation from the whole collection.
Mountains of method-level documentation do not a tutorial make. If you know Perl and you're sick of 'working the web' to get information and you want the web to work for you instead, then you need this book. I had a personal project that was on the back burner for a couple of years because it just sounded too hard. The weekend after I finished this book, I wrote what I had previously thought to be the hard part of that project and it was both easy and fun. This book makes hard things not just possible, but actually easy.
See all 16 reviews. Most recent customer reviews. Published on August 26, Published on July 13, Published on August 18, Published on March 15, Published on November 5, Published on September 1, Amazon Giveaway allows you to run promotional giveaways in order to create buzz, reward your audience, and attract new followers and customers.
Learn more about Amazon Giveaway. Set up a giveaway. Feedback If you need help or have a question for Customer Service, contact us. Would you like to report poor quality or formatting in this book? Click here Would you like to report this content as inappropriate? Click here Do you believe that this item violates a copyright? There's a problem loading this menu right now. Get fast, free shipping with Amazon Prime.
Your recently viewed items and featured recommendations. Currently, three things have to happen for a user to upload a file via a form. And third, the form element has to have its attributes set like so: Suppose, for example, that you were automating interaction with an HTML form that looked like this: But the file-upload part involves some doing. So it ends up looking like this: For the "normal" fields the first and third fields , the header basically expresses that this is ordinary data for a particular field name, and the body expresses the form data.
Take a look at the header again: But note that LWP also tells the remote host the basename of the file we're uploading by default i. MediaTypes module has never heard of the. In case you want to change the name that LWP presents to the remote server, you can provide that name as a second item in the arrayref: Note, however, that when LWP constructs and sends the request, it currently has to read into memory all files you're sending in this request.
If you're sending a megabyte MP3 file, this might be a problem! One especially neat trick is that you don't even need to have a file to upload to send a "file upload" request. To send content from a string in memory instead of from a file on disk, use this syntax: Limits on Forms The examples in this chapter use approaches to form-data submission that work well for almost all form systems that you'd run into, namely, systems where the form data is meant to be keyed into HTML forms that do not change.
In most cases, the extent of change is merely a hidden form variable containing a session ID. These you can code around by using LWP to download the form, extracting the session ID or other hidden fields, and submitting those along with your other values. In a few remaining cases where the form in question is predictable enough for a program to manipulate it, but unpredictable enough that your program needs to carefully scrutinize its contents each time before choosing what form data to submit, you may be able put to good use either of the two CP AN modules that provide an abstracted interface to forms and the fields in them, HTML:: That is, it parses HTML source that you give it and builds an object for the form, each form containing an object for each input element in the form.
Form is quite similar, except it takes as input an HTML:: In practice, however, those modules are needed in very few cases, and the simpler strategies in this chapter will be enough for submitting just about any form on the Web and processing the result. Extracting Links from a Bookmark File Example: Extracting Temperatures from Weather Underground The preceding chapters have been about getting things from the Web. But once you get a file, you have to process it.
However, most of the interesting processable information on the Web is in HTML, so much of the rest of this book will focus on getting information out of HTML specifically. In this chapter, we will use a rudimentary approach to processing HTML source: This technique is powerful and most web sites can be mined in this fashion. We present the techniques of using regular expressions to extract data and show you how to debug those regular expressions.
Perl & LWP
Automating Data Extraction Suppose we want to extract information from an Amazon book page. The first problem is getting the HTML. Browsing Amazon shows that the URL for a book page is http: So to fetch the Perl Cookbook's page, for example: The final program appears in Example 6- 1. It would be trickier, but more useful, to have the program accept book titles instead of just ISBNs. A more elaborate version of this basic program is one of O'Reilly's actual market research tools.
Regular Expression Techniques Web pages are designed to be easy for humans to read, not for programs. Humans are very flexible in what they can read, and they can easily adapt to a new look and feel of the web page. But if the underlying HTML changes, a program written to extract information from the page will no longer work.
Your challenge when writing a data-extraction program is to get a feel for the amount of natural variation between pages you'll want to download. The following are a set of techniques for you to use when creating regular expressions to extract data from web pages. If you're an experienced Perl programmer, you probably know most or all of them and can skip ahead to Section 6.
Anchor Your Match An important decision is how much surrounding text you put into your regular expression. Put in too much of this context and you run the risk of being too specific — the natural variation from page to page causes your program to fail to extract some information it should have been able to get. Similarly, put in too little context and you run the risk of your regular expression erroneously matching elsewhere on the page.
Whitespace Many HTML pages have whitespace added to make the source easier to read or as a side effect of how they were produced. For example, notice the spaces around the number in this line: You could check, or you could simply be flexible in what you accept: You can construct a character class to represent "any whitespace but newlines": Minimal and Greedy Matches If you want to extract everything between two tags, there are two approaches: Capture To extract information from a regular expression match, surround part of the regular expression in parentheses.
In scalar context, the match continues from where the last match left off. Use this to extract information one match at a time. Use this to extract all matches at once. Develop from Components There are many reasons to break regular expressions into components — it makes them easier to develop, debug, and maintain. Use Multiple Steps A common conceit in programmers is to try to do everything with one regular expression. Don't be afraid to use two or more. This has the same advantages as building your regular expression from components: For example, the front page of http: Although the source code for the Shared Source CLI wasn't yet available, the BOF offered a preview of what's to come, as well as details about its implementation and the motivation behind it.
This suggests a main loop of: Most dynamic web sites are generated from templates, the comments help the people who maintain the templates keep track of the various sections. Extracting the URL, title, and summary is straightforward. It's even a simple matter to use the standard Text:: Wrap module to reformat the summary to make it easy to read: Remove the tags with: Simple; use Text: Troubleshooting Both when developing and maintaining data extraction programs, things can go wrong. Suddenly, instead of an article summary, you see a huge mass of HTML, or you don't get any output at all.
Several things might cause this. There are two basic types of problems: A false positive is when your regular expression identifies something it thinks is the information you're after, but it isn't really. For example, if the O'Reilly Network used the it em template and summary format for things that aren't articles, the summary extraction program in Example would report headlines that aren't really headlines.
There are two ways to deal with false positives. You can tighten your regular expression to prevent the uninteresting piece of HTML from matching.
The other way to prevent a false positive is to inspect the results of the match to ensure they're relevant to your search. For example, in Example A false negative is where your program fails to find information for which it is looking. There are also two ways to fix this. The first is to relax your regular expression. The second way is to make another pass through the document with a separate regular expression or processing technique, to catch the data you missed the first time around.
For example, extract into an array all the things that look like news headlines, then remove the first element from the array if you know it's always going to be an advertisement instead of an actual headline. Often the hardest part of debugging a regular expression is locating which part isn't matching or is matching too much.
There are some simple steps you can take to identify where your regular expression is going wrong. First, print the text you're matching against. Print it immediately before the match, so you are totally certain what the regular expression is being applied to. You'd be surprised at the number of subtle ways the page your program fetches can differ from the page for which you designed the regular expression. Second, put capturing parentheses around every chunk of the regular expression to see what's matching. This lets you find runaway matches, i. In such situations, it's typically because either the thing being quantified was too general e.
If the regular expression you've created isn't matching at all, repeatedly take the last chunk off the regular expression until it does match. The last bit you removed was causing the match to fail, so inspect it to see why. For example, let's find out why this isn't matching: The way around this is to remove the minimal matching — how much could it match?
When Regular Expressions Aren? In particular, nested structures for example, lists containing lists, with any amount of nesting possible and comments are tricky. While you can use regular expressions to extract the components of the HTML and then attempt to keep track of whether you're in a comment or to which nested array you're adding elements, these types of programs rapidly balloon in complexity and become maintenance nightmares. TreeBuilder all demonstrated in the next chapter , and forego your regular expressions.
I'm told that this isn't the same format as is used in newer Netscapes. But, antiquarian that I am, I still use Netscape 4. It will be read and overwritten. There are three important things we should note here: That means we don't have to bother with making URLs absolute not yet, at least. This practically begs us to use a Perl regexp! Example shows such a program. That is, different schemes, or different hosts. Suppose, in fact, that a representative section looks like this: But none of those things are true here!
Regexps are still usable, though — it's just a matter of applying them to a whole document instead of to individual lines and also making the regexp a bit more permissive: Example shows this basic idea fleshed out to include support for fetching a remote document, matching each link in it, making each absolute, and calling a checker routine currently a placeholder on it.
And any other initialization we might need to do I I should check http: Now that we're satisfied that our program is matching and absolutizing links correctly, we can drop in the check url routine from the Example Let's write a program to tell us which of the two O'Reilly offices, Cambridge and Sebastopol, is warmer and by how many degrees. First, we fetch the pages with the temperatures.
A quick look around the Weather Underground site indicates that the best way to get the temperature for a place is to fetch a URL like: The program begins by fetching those pages: Viewing the source to one of the pages reveals the relevant portion as: The complete program is shown in Example You're forced to worry about spaces and newlines, single and double quotes, HTML comments, and a lot more. The next step up from a regular expression is an HTML tokenizer. In this chapter, we'll use HTML:: The above source code is parsed as this series of tokens: A program that extracts information by working with a stream of tokens doesn't have to worry about the idiosyncrasies of entity encoding, whitespace, quotes, and trying to work out where a tag ends.
TokeParser object gives you one token at a time, much as a filehandle gives you one line at a time from a file. The HTML can be tokenized from a file or string. The tokenizer decodes entities in attributes, but not entities in text. Create a token stream object using one of these two constructors: This code processes every token in a document: The lowercase attribute names are the keys of the hash. The first three values are the most interesting ones, for most purposes. For example, parsing this HTML: Most programs that process HTML simply ignore comments.
Declarations rarely occur in HTML, and when they do, they are rarely of any interest. Almost all programs that process HTML ignore declarations. Most programs extracting information from HTML ignore processing instructions. TokeParser to write useful programs. Many problems are quite simple and require only one token at a time.
Perl & LWP - O'Reilly Media
Programs to solve these problems consist of a loop over all the tokens, with an i f statement in the body of the loop identifying the interesting parts of the HTML: Checking Image Tags Example 7- 1 complains about any img tags in a document that are missing alt, height, or width attributes: Size from CPAN to check or insert the height and width attributes. TokeParser as a simple code filter. Here's one that passes through every tag that it sees by just printing its source as HTML:: TokeParser passes it in , except for img start-tags, which get replaced with the content of their alt attributes: Token Sequences Some problems cannot be solved with a single-token approach.
Often you need to scan for a sequence of tokens. For example in Chapter 4. To solve this, we need to check the next few tokens while being able to put them back if they're not what we expect. To put tokens back into the stream, use the unget token method: For example, to solve our Amazon problem: If it is, then that's the sales rank. If any of the tests fail, put the tokens back on the stream and go back to processing.
BBC Headlines Suppose, for example, that your morning ritual is to have the help come and wake you at about 1 1 a. On one tray there's a croissant, some pain au chocolat, and of course some cafe au lait, and on the other tray, your laptop with a browser window already open on each story from BBC News's front page http: However, the help have been getting mixed up lately and opening the stories on The Guardian's web site, and that's a bit awkward, since clearly The Guardian is an after-lunch paper.
You'd say something about it, but one doesn't want to make a scene, so you just decide to write a program that the help can run on the laptop to find all the BBC story URLs. There are lots of headlines in code such as these: Studying this, you realize that this is how you find the story URLs: The following HTML is typical: So we get another token and see if it's an A-href token.
It's not it's the text token Top stories , so we put it back into the stream useful in case some other pattern we're looking for involves that being the first token , and we keep looping. Later, we see another B-h3, we get another token, and we inspect it to see if it's an A-href token. This time it is, so we process its href value and resume looping. There's no reason for us to put that a-href back, so the next iteration of the loop will resume with the next token being Bank of England mulls rate cut. If, at any point, we see an unexpected token or hit the end of the stream, we restore what we've pulled off held in the temporary array gnext , and continue to try other rules.
But if all the expectations in this rule are met, we make it to the part that processes this bunch of tokens here it's just a single line, which prints the URL , and then call next Token to start another iteration of this loop without restoring the tokens that have matched this pattern. Each such rule, then, can pull from the stream however many tokens it needs to either match or reject the pattern it's after.
Either it matches and starts another iteration of this loop, or it restores the stream to exactly the way it was before this rule started pulling from it. However, the i f block for the next pattern which requires looking two tokens ahead shows how the same framework can be accommodating: Add this right after the first if-block ends.
Bundling into a Program With all that wrapped up in a pure function scan bbc stream , we can test it by first saving the contents of http: As I was writing it and testing bits of it, I could run and re-run the program, scanning the same local file. Response objects just happen to offer a method that returns a reference to the content. To actually complete the task of getting the printed URLs to each open a new browser instance, well, this depends on your browser and OS, but for my MS Windows laptop and Netscape, this Perl program will do it: TokeParser Methods Example 7- 1 illustrates that often you aren't interested in every kind of token in a stream, but care only about tokens of a certain kind.
We will explain these methods in detail in the following sections. Otherwise, this returns an empty string. For example, if you are parsing this snippet: If you specify a tag, you get all the text up to the next time that tag occurs or until the end of the file, if that tag never occurs. For however many text tokens are found, their text values are taken, entity sequences are resolved, and they are combined and returned. All the other sorts of tokens seen along the way are just ignored. This sounds complex, but it works out well in real use.
For example, imagine you've got this snippet: Note that this never introduces whitespace where it's not there in the original. So if you're parsing this: Some tags receive special treatment: For further information on altering and expanding this feature, see perldoc HTML:: TokeParser in the documentation for the get text method, and possibly even the surprisingly short HTML:: If you just want to turn off such special treatment for all tags: In no other case does an object require us to access its internals directly like this, because it has no method for more normal access.
For more information on this particular syntax, see perldoc perlref s documentation on hash references. Returning to our news example: The caveat that get text does not introduce any new whitespace applies also to get trimmed text. The first two values are the most interesting ones, for most purposes. Note that the tag name s that you provide as parameters must be in lowercase. If gettag reads to the end of the stream and finds no matching tag tokens, it will return undef. For example, this code's get tag looks for img start-tags: Its task was to find links to stories, in either of these kinds of patterns: But it ignores the actual link text, which starts with the next token in the stream.
If we want that text, we could get the next token by calling get text: Bank of England mulls rate cut http: For some applications, this makes no difference, but for neatness sake, let's keep headlines to one line each. Wrap, to wrap them at 72 columns. There's a trickier problem that occurs often with gettext or gettrimmedtext. What if the HTML we're parsing looks like this?
But we don't want only the first text token in the headline, we want the whole headline. If you're taking the output of gettext or gettrimmedtext and sending it to a system that understands only U. Unidecode might be called for to turn the 6 into an o. TokeParser matter at all, but is the sort of problem that commonly arises when extracting content from HTML and putting it into other formats.
But in real life, you do not proceed tidily from the problem to an immediate and fully formed solution. And ideally, the task of data extraction is simple: In practice, however, you write programs bit by bit and in fits and starts, and with data extraction specifically; this involves a good amount of trying one pattern, finding that its matching is too narrow or too broad, trying to amend it, possibly having to backtrack and try another pattern, and so on. Moreover, even equally effective patterns are not equal; some patterns are easier to capture in code than others, and some patterns are more temporary than others.
In this section, I'll try to make these points by walking though the implementation of a data extraction task, with all alternatives considered, and even a misstep or two. Fresh Air is on NPR stations each weekday, and on every show, different guests are interviewed. The show's web site lists which guests appear on the show each day and has links to the RealAudio files for each segment of each show.
If your particular weekday schedule doesn't have you listening to Fresh Air every night or afternoon, you would find it useful to have a program tell you who had been on in the past month, so you could make a point of listening to the RealAudio files for the guests you find interesting.
Such a data-extraction program could be scheduled with crontab to run on the first or second day of every month, to harvest the past month's program data. Getting the Data The first step is to figure out what web pages we need to request to get the data in any form. With the BBC extractor, it was just a matter of requesting the single page http: Instead, you can view the program description for each show, one day at a time.
Moreover, the URL for each such page looks like this, which displays the program info for July 2, Harvesting all the data is a simple matter of iterating over all the days of the month or whatever period you want to cover , skipping weekends because the program listings are only for weekdays , substituting the proper date numbers into that URL.
Once each page is harvested, the data can be extracted from it.
Already the outlines of the program's design are becoming clear: Scanning the content isn't a distinct enough task that it has to be part of the same block of code as the code that actually harvests the URL. Instead, it can simply be a routine that is given a new stream from which it is expected to extract data. Moreover, that is the hard part of the program, so we might as well do that first the stuff with date handling and URL interpolation is much less worrisome, and can be put off until last.
So, to figure out the format of the data we want to harvest, consider a typical program listing page in its rendered form in a browser. We establish that this is a "typical" page shown in Figure by flipping through the listings and finding that they all pretty much look like that.
That stands to reason, as the URL tells us that they're being served dynamically, and all through the same. So we have good reason to hope that whatever code we work up to extract successfully from one typical page, would hopefully work for all of them. The only remarkable difference is in the number of segments per show: Also, the descriptions can be several paragraphs, sometimes much shorter.
His fiction and non-fiction work has appeared in The New Yorker and the New York Times Magazine, He's also the author of two other novels, and a selection of short stories.. She's casted films, as well as acted in films and on television. Archived Shows Select a show by date: Guests Find a show by guest or commentator: Philadelphia, PA Document: Fresh Air web page: We don't want the "Listen to" part, since it'd be pointlessly repetitive to have a whole month's worth of listings where every line starts with "Listen to".
We can completely ignore all that code and just try to figure out how to extract the "Listen Sifting through the HTML source, we see that those links are represented with this code note that most lines begin with at least two spaces: First Code Because we want links, let's get links, like this: Fresh Air Online index. We also get the mailto: Narrowing In Now, we could try excluding every kind of thing we know we don't want. We could exclude the mail to: However, tomorrow the people at Fresh Air might add this to their general template: It is a valid approach to come up with criteria for the kinds of things we don 't want to see, but it's usually easier to come up with criteria to capture what we do want to see.
So this is what we'll do. We could characterize the links we're after in several ways: The URL they point to looks like http: Notably, the URL's scheme is http, it's on the server www. Now, of these, the first criterion is most reminiscent of the sort of things we did earlier with the BBC news extractor. But in either case, you'll have skipped over all the tokens between the current point in the stream and the next tag you find, and once you've skipped them, you can't get them back. This is feasible, but it's sounding like the hardest of the criteria to formalize, at least under HTML:: But testing whether a tag sequence contains another is easy with HTML:: TreeBuilder, as we see in later chapters.
So we'll try to make do without this one criterion and consider it a last resort. It's also a problem with gettext and gettrimmedtext. Unless you did something particularly perverse, such as read a huge chunk of the stream with getto ken and then staffed it back in with unget token while still keeping a copy around.
If you're even contemplating something like that, it's a definite sign that your program is outgrowing what you can do with HTML:: TokeParser, and you should either write a new searcher method that's like get text but that can restore tokens to the buffer, or more likely move on to a parsing model based on HTML:: The next criteria numbers 3 and 4 in the list above are easy to formalize. These involve characteristics of the URL. We simply add a line to our while loop, like so: Currently, we can say "it's interesting only if the URL ends in.
But what if, tomorrow, some code like the following is added to the normal template? On the other hand, if we do check those additional facts about the URL, and tomorrow all the. Then we'll be annoyed that we did make our link extractor check those additional things about the URL.
It could even be something served across a protocol other than HTTP! In other words, no part of the URL is reliably stable. On one hand, National Public Radio is not normally characterized by lavish budgets for web design and redesign, and re-redesign , so you can expect some measure of stability.
But on the other hand, you never know! Rewrite for Features My core approach in these cases is to pick some set of assumptions and stick with it, but also to assume that they will fail. So I write the code so that when it does fail, the point of failure will be easy to isolate. I do this is with debug levels, also called trace levels. Consider this expanded version of our code: Because the DEBUG constant is declared with value 0, all the tests of whether DEBUG is nonzero are obviously always false, and so all these lines are never run; in fact, the Perl compiler removes them from the parse tree of this program, so they're discarded the moment they're parsed.
However, using all caps is a matter of convention. Listen to Current Show http: And when we deploy the above program with some code that harvests the pages instead of working from the local test page, the DEBUG lines will continue to do nothing. But suppose that, months later, the program just stops working. That is, it runs, but prints nothing, and we don't know why. Or has some part of the format changed? About to parse stream with base http: And we can see that it's happening on those "ramfiles" links, and it's not rejecting their host, because they are on www.
But it rejects their paths. Why don't our ramfiles paths match that regexp anymore? Ah ha, because they don't end in. This is evident at the end of the lines beginning "Path is no good. Similarly, if the audio files moved to a different server, we'd be alerted to their host being "no good" now, and we could adjust the regexp that checks that. We had to make some fragile assumptions to tell interesting links apart from uninteresting ones, but having all these DEBUG statements means that when the assumptions no longer hold, we can quickly isolate the problem.
Images and Applets Speaking of assumptions, what about the fact that back to our pre-. I Listen to Current Show http: So where's it coming from? Try the same search on the source, and you'll see: That might be a useful feature normally, but it's bothersome now. So we turn it off by adding this line just before our while loop starts reading from the stream: TokeParser manpage where you can also read about how to do things with the textif y feature other than just turn it off. With that change made, our program prints this: That is a false value in Perl, so it causes the fallthrough to? Listen to Monday - July 2, http: Link Text Now that everything else is working, remember that we didn't want all this "Listen to" stuff starting every single link.
Moreover, remember that the presence of a "Listen to" at the start of the link text was one of our prospective criteria for whether it's an interesting link. We didn't implement that, but we can implement it now: And incidentally, you might notice that with all these little changes we've made, our program now works perfectly! Live Data All it needs to actually pull data from the Fresh Air web site, is to comment out the code that calls the local test file and substitute some simple code to get the data for a block of days.
Here's is the whole program source, with those changes and additions: Feel free to skip. For example, instead of using the various tricks to keep the first image-ALT link from printing, we could simply have kept a count of the good links seen so far in the current stream and ignored the first one. Our actual solution is more proper in this case, but sometimes counting items is the best or only way to get a problem solved.
More importantly, we could have done without all the code that tests the link URL and used one regexp to implement our last criterion, i. But, as with our earlier consideration of how much of the URL to check, it comes down to the question: The answer depends on how concise you want the code to be, how much time you want to spend thinking up assumptions, and, most importantly, what happens if it breaks. If I've crontabbed this program to harvest Fresh Air listings every month and mail me the results, if it breaks, I'll get some sort of anomalous output mailed to me whether with too few links, or too many and it's no big deal because, working or not, it's just so I can listen to interesting radio programs.
But your data extraction program may instead serve many people who will be greatly inconvenienced if it stops working properly. You have to decide on a case-by-case basis whether your program should be more likely to clam up and miss interesting data in new formats, or pass through new kinds of data despite the risk that they might be irrelevant or just plain wrong.
- Customers who bought this item also bought?
- Books & Videos!
- Full text of "Perl & LWP".
- Radio-TV Newswriting: A Workbook.
In particular, the token model obscures the hierarchical nature of markup. Nested structures such as lists within lists or tables within tables are difficult to process as just tokens. Such structures are best represented as trees, and the HTML:: Element class does just this. This chapter teaches you how to use the HTML:: TreeBuilder module to construct trees from HTML, and how to process those trees to extract information.
HTML tree — In the language of trees, each part of the tree such as html, li, ice cream. There are two kinds of nodes in an HTML tree: Would you like to report poor quality or formatting in this book? Click here Would you like to report this content as inappropriate? Click here Do you believe that this item violates a copyright? Your recently viewed items and featured recommendations.
View or edit your browsing history. Get to Know Us.
Not Enabled Word Wise: Not Enabled Enhanced Typesetting: Not Enabled Average Customer Review: Be the first to review this item Amazon Bestsellers Rank: Would you like to report this content as inappropriate? Do you believe that this item violates a copyright? Delivery and Returns see our delivery rates and policies thinking of returning an item? See our Returns Policy. Visit our Help Pages.