Grok the Web

A Programmer's Guide to the New Software Development Paradigm

by Andrew Schulman


Chapter 4

Snarfing Forms

Last revised: April 16, 1997


Starting from the importance of "View Document Source" for learning how to control/program the web, this chapter in particular shows how to go behind the HTML forms one sees every day, to create your own new forms. This requires a small but important conceptual leap: form and CGI aren't tied, there can be a form on one machine (yours), interfacing to a CGI on another. The author of an HTML form need not be the same as (or even know, or even have the permission of) the author of the underlying CGI. For this to work, you must uncover the "implicit API" provided by the CGI program. Because it's only implicit, this API can change at any time, so the chapter discusses brittleness/"bit-rot" problems too. A key point is the importance of <INPUT TYPE=hidden>.

(Examples: http://free.submit-it.com/cgi-bin/submit.pl; also anthology of forms at Quaterdeck site.)

Interesting thing on forms (e.g., "mailto" in ACTION) at http://hakatai.mcli.dist.maricopa.edu/director/tips/shocktip/cgi.html

If you've used web for any length of time, you've saved away bookmarks to sites you visit often. Or if using Windows, maybe have some "Internet shortcuts" on your desktop. If just a few sites you use all the time, this remarkable: don't have to remember that long URL, can give your own name to it.

But if accumulate more than a handful, gets unwieldy. Next step is probably to make your own custom home page, not on web for others to look at, but on your hard disk so that can get to it easily. Surprising how few people do this (this accounts for huge traffic to www.netscape.com, which is default home page for Netscape browsers: accounts for about 12% of net traffic, according to WWW6 paper!). Show them how. Let's saw that every morning you visit different sites with news of the computer business. Rather than wade through bookmarks, put it all on your own HTML page, not on web but on your hard disk. For example:

<B>News</B> *
<A HREF="http://nytsyn.com/live/Latest_columns/">NY Times Syndicate</A> *
<A HREF="http://www.zdnet.com/home/filters/news.html">Ziff-Davis</A> *
<A HREF="http://www.techweb.com/techweb/newsroom/newsroom.html">CMP TechWeb</A> *
<A HREF="http://www.sjmercury.com/business/">SJMerc</A> *
<A HREF="http://www.news.com">c|net</A> *
<A HREF="http://interactive3.wsj.com/edition/current/summaries/front.htm">WSJ</A> *
<A HREF="http://cnnfn.com/">cnnfn</A><P>
Surprising how few people realize they can use HTML on their hard disk, for their own use without ever placing the page on the web. Just for organizing their own stuff. Note that URLs can even refer to things on your own machine: file:/// URLs. This interesting possibilities and problems (security, AUX: bug, etc.).

So, you've now got your own private Yahoo, in a way. Now you add the search engines you frequently use:

<B>Search</B> *
<A HREF="http://www.altavista.digital.com">AltaVista</A> *
<A HREF="http://metacrawler.cs.washington.edu/">MetaCrawler</A> *
<A HREF="http://www.yahoo.com">Yahoo</A><P>
Clicking on AltaVista link takes you, of course, to AltaVista, where you fill out search form (which HTML interface to CGI program running at AltaVista), and go.

What could be simpler? Plenty! If you do this often, why click on a link to go to a form? Why not just have the form itself, located on your own machine?

<FORM method=GET 
action="http://www.altavista.digital.com/cgi-bin/query">
<INPUT TYPE=hidden NAME=pg VALUE=q>
<INPUT TYPE=hidden NAME=what VALUE=web>
<INPUT TYPE=hidden NAME=fmt VALUE=".">
<INPUT NAME=q size=30 maxlength=200 VALUE="">
<INPUT TYPE=submit VALUE="AltaVista">
</FORM>

<FORM METHOD=GET ACTION="http://search.yahoo.com/bin/search">
<INPUT SIZE=30 NAME=p> <INPUT TYPE=submit VALUE="Yahoo">
</FORM>
<P>
This requires a small but important conceptual leap: form and CGI aren't tied, can have form on one machine, CGI on another. Can change form, within certain limits, replacing some fill-in fields with hard-wired information. (Can even dispense with form entirely in some cases, and drive CGI entirely with a URL. See below.)

Show them how to move AltaVista form to a new web page on their own machine. Then how to modify the form. Note problem: if underlying CGI changes, site has presumably changed their HTML form to correspond, but yours will break. Price of HTML/CGI separation: brittleness.

Show a few other examples: Yahoo, stock ticker, phone book.

<FORM METHOD="POST" ACTION="http://qs-alt.secapl.com/cgi-bin/qs">
<input type=hidden name="gif" value="1"> 
<input type=hidden name="time" value="0000000855270834"> 
<a href=http://www.secapl.com/secapl/quoteserver/ticks.html>
<b>Ticker Symbols</a> : </b> 
<i>(Up to 5 tickers may be entered separated by spaces)</i><br> <dd>

<INPUT NAME="tick" size=30 maxlength=50> <b><font color=0000ff>
<input type="submit" value=" Get Quotes "></font></b>
</FORM>
A lot of stuff here, but all we care about is:
<FORM METHOD="POST" ACTION="http://qs.secapl.com/cgi-bin/qs"> 
<INPUT name="tick" size=25 maxlength=50 type=submit value="MSFT"> 
Click here to check Microsoft stock
</FORM>
Take case of stock ticker, though: now you've got the form on your site, interfacing to program on their site. Amazing! Each morning you come in, type in "MSFT" into form on your machine, and goes fetches Microsoft info from their machine. Let's take a minute to savor this. But let's say you only check maybe "MSFT" and "NSCP" -- why type in every morning? (Or, if like some people, every hour.) Why not just have your own custom link, not form, that just fetches these particular stocks?
<A HREF="http://quote.yahoo.com/quotes? SYMBOLS=MSFT&detailed=t"> 

<A HREF="http://qs.secapl.com/cgi-bin/qs?tick=MSFT"> 
Show how to snarf form, turn into URL, when know how you're going to fill-in all fields in advance. (Of course, you can then put these on HTML page not just for your own benefit, but for others on web: give examples.) Discuss difference method=POST and method=GET. In most cases, POST can be turned into GET. But a site could have set up so only works with POST.

Okay, so now you need to know how to snarf forms, and change them. Do with stock-ticker example again, then more complicated: TechWeb, DejaNews, etc. Brittleness problem again (actual example of changes at TechWeb).

A lot of the "magic" here involves <INPUT TYPE="hidden">. How to turn open fields into hidden pre-filled-in fields.

More general point here about snarfing code on web: "View document source" is your friend, use it often.

Fedex invites snarfing its form, if I understand the following correctly: "Many large FedEx customers have put links to FedEx's Internet track and trace system into their websites so that employees, even customers, can track a shipment without ever visiting FedEx on the web. Over half of all electronic tracking requests are initiated from customer websites, he said" (http://www.yahoo.com/headlines/970425/tech/stories/fedex_1.html)