Grok the Web

A Programmer's Guide to the New Software Development Paradigm

by Andrew Schulman


Chapter 1

The Crude Beginnings of a New Operating System

Last revised: April 15, 1997


Introduction
Why Would You Want to Do That?
sidebar: Rich Text and Email Bombs
sidebar: Shopping Carts and Cookies
URLs: Handles to Data
sidebar: ISBN
sidebar: A FORM and Its ACTION are Distinct
HTML
Distributed Computation: CGI
Creating Synapses: HTTP
A New Type of Application
A Lesson for the Software Industry

Now that web addresses such as http://www.ups.com appear on the sides of buses and trucks, and http://www.msnbc.com and www.cnn.com (and, bizarrely, www.heavensgate.com) show up on the nightly TV news, these odd-looking web incantations are almost becoming familiar and taken for granted, like phone numbers. There are probably millions of people who know the term "URL", and perhaps a million or so for whom even HTML tags such as <IMG SRC> and <A HREF> hold no terrors. I recently overheard a bus driver give out his URL to one of his passengers; he was very casual in his use of the seemingly-arcane Uniform Resource Locator. He then proceeded to discuss the use of interlaced GIF files in an HTML <IMG> tag.

Rather than laugh at the seeming incongruity of a bus driver involving himself in the details of web-page construction, I hope that you'll instead agree this isn't incongruous at all: it is great! Web programming is accessible to millions of people in a way that, say, Windows programming never was, and never could be. HTML coding is so simple, that you might well dispute whether it is programming at all.

But in this chapter, I hope to have you stare at some fairly ordinary-looking HTML code, long enough to see, not only that it is programming, but also that it represents something very important: the beginnings of a new platform for software development.

To most readers of this book, the following source code for a web page likely seems no big deal:

<HTML>
<A HREF="http://www.amazon.com/exec/obidos/ISBN=1568843054">
<IMG SRC="http://www.idgbooks.com/images/smallcovers/1-56884-305-4.gif"
ALT="Click here to order"></A>
</HTML>
This web page (Example 1-1) is written, of course, in HTML, the Hypertext Markup Language. The <IMG> tag causes a web browser to display an inline image; in this case, it happens to be a GIF (Graphics Interchange Format) file of the cover of my previous book, located at the publisher's web site, www.idgbooks.com. (I'm going to be mentioning my previous book a lot in this chapter, but merely as an example that I know intimately; many other examples would work just as well, and I don't intend this as any sort of advertisement for Unauthorized Windows 95 -- especially when, as you'll see, I now find the intricacies of Windows programming a lot less interesting and important than I once did.)

If someone clicks on the image, the <A HREF> tag takes them to an order form for the book at an online bookstore, www.amazon.com.

As I said, this HTML code is No Big Deal. And when displayed by a web browser, it certainly doesn't look like much either. It's just a picture of a book cover; if the mouse hovers over the picture, the note "Click here to order" (from the <IMG ALT> statement) briefly appears, at least in some browsers (see Figure 1-1). If the reader has turned off graphics in their browser, that note is all that appears.

Figure 1-1: from
http://www.sonic.net/~undoc/book/ex11.html
Figure 1-1

But look again. This HTML page is located on one machine (www.sonic.net); the image is located on a second machine (www.idgbooks.com); clicking on the image takes you to a third machine (www.amazon.com). Using any number of tools (described later in this chapter), one can readily find out that one of these machines is running the NCSA/1.5.2 web server, another is running Apache/1.1.1, and the third is running Netscape-Commerce/1.12. They have little relation to each other. Yet they have been lashed together to form some sort of odd hybrid.

That all these comprise a single hyper-document may not be apparent from this example. After all, a reader must explicitly click on the image to go to the order form. So perhaps the following variation (Example 1-2) will make clearer what is going on here:

<HTML>
<FRAMESET FRAMEBORDER=no BORDER=0 COLS=135,*>
<FRAME SRC="http://www.idgbooks.com/images/smallcovers/1-56884-305-4.gif">
<FRAME SRC="http://www.amazon.com/exec/obidos/ISBN=1568843054">
</FRAMESET>
</HTML>
Using the <FRAMESET> and <FRAME> tags introduced by Netscape, and now supported by other web browsers such as the Microsoft Internet Explorer (MSIE), this small HTML page on one machine visually joins together the components from the other two machines. As can be seen in Figure 1-2, the "FRAMEBORDER=no BORDER=0" attributes make the join seamless. The "COLS=135,*" allocates space for the image in the first (borderless) frame; the amazon.com order form nicely flows into whatever space we've left over for it in the second frame. {{Aaargh! Doesn't work quite as well now that amazon.com redesigned its site: now they use frames too, so have smaller real estate in which to fit book description. Actually, can also access in old way, but need different URL? Put "change-style" in URL before ISBN. Yes, seems to work, but is toggle?? Check with GETURL: very instructive, they use TABLE, not FRAME. And /change-style/ISBN turns off graphics, change-style/ISBN/t/ turns it back on. So this is fine!}}

Figure 1-2: from
http://www.sonic.net/~undoc/book/ex12.html
Figure 1-2

Thus, two components from two different vendors, with no coordination between them, have been happily married together -- or, at least, hustled together in a shotgun wedding. How can this work without explicit coordination? In a way, that's the subject of this book, but briefly the answer is: industry standards. So long as these vendors run software that adheres to standards such as the Hypertext Transfer Protocol (HTTP), things work out pretty well, most of the time.

To put it one way, what these few lines of HTML represent is a "compound document," with its components located on different machines, running different brands of web server, and probably running different operating systems as well. Furthermore, as we'll see later, one of the components is not even a file, but the dynamically-generated output of a program. Thus, we not only have a compound document in Example 1-2, but also distributed computation: both of these were supposed to be "hard" problems, yet here they appear in the form of hypertext/hypermedia links, which even many non-programmers feel comfortable constructing -- which my bus driver enjoys constructing.

To put it another way, we have here the crude beginnings of a platform for a new type of software. Ok, make that very crude, because as we'll see there are plenty of problems. And those industry standards are odd things: every vendor seems to have its own version. But we really are witnessing the beginnings of a new type of software: less generic than what we've become used to, specific, tailored, focused, {{yuk! vague!}} with more emphasis on content; software that is truly "document-centric," to use a phrase that Microsoft popularized but could never seem to follow through on. And the ratio of effort to effect seems amazingly low, at least to someone coming from a C and Windows programming background, where, notoriously, it took one hundred lines of code just to display the string "hello world!"

Why Would You Want to Do That?

But I'm getting ahead of myself here. Many computer books, including the ones I've written, tend to explain how to do something, without ever explaining why you would want to do it in the first place. In my books on Windows programming, I would explain how to call undocumented Windows functions such as PrestoChangoSelector() or TabTheTextOutForWimps() (both genuine names, by the way). A natural question would be "Why would you want to do that?" Indeed, it was sometimes difficult to explain why someone would want to call a function named TabTheTextOutForWimps(). A lot of times, I knew about solutions, without knowing what problems they were good for.

Now I'm writing about very-widely documented HTML tags such as <IMG> and <FRAME>, but all the same "Why would you want to do that?" remains a good question. So what that two separate documents can be lashed together with frames into something that looks like a single document? Perhaps this is just another solution in search of a problem, another hammer to which everything looks like a nail?

Well, this time around I actually had the problem before I knew the solution. Let me step back and explain.

One of the frustrations of being an author is that there are always people who apparently would like to buy your book, but who can't seem to find it in any bookstores. When you're an author on email, you hear about it directly, like: "I've want to get your book, but none of the bookstores near me carry it. Could you tell me where to find it, or better yet, send me a copy? I'll send you a check by return mail." I receive a fair number of emails like this: as if the publisher's warehouse were located in my garage! As if an author has anything to do with selling his or her own book. That's the publisher's job, right? Well, as we'll see, the web has a way of turning this sort of assumption on its head. In fact, these customers were absolutely right to ask me. Now I need to support them.

I got tired of sending out individual emails telling people about different places they could try to find my previous book, so I decided to put up a web page with this information. At first, the page just had some names of likely bookshops, with their web addresses (Example 1-3):

{{B&N new site accessible via ISBN? Time, 14 April 1997; no, only AOL: http://biz.yahoo.com/prnews/97/03/18/aol_bks_x_1.html}}

<HTML>
A few places online where you can purchase <I>Unauthorized Windows 95</I> and related books:
<UL><LI><A HREF="http://www.amazon.com/exec/obidos/ats-query-page">amazon.com</A>
<LI><A HREF="http://www.books.com/scripts/search1.exe">Book Stacks</A> 
<LI><A HREF="http://www.compubooks.com/bin/shop/compubooks/e/shop.html#search">CompuBooks</A> 
<LI><A HREF="http://www.staceys.com">Stacey's</A> (San Francisco CA)
<LI><A HREF="http://www.softproeast.com/">Softpro</A> (Burlington MA; Denver CO)
<LI><A HREF="http://www.clbooks.com/cgi-bin/searchform">Computer Literacy</A> (San Jose CA)
<LI><A HREF="http://www.quantumbooks.com/find.html">Quantum Books</A> (Cambridge MA)
<LI><A HREF="http://www.powells.portland.or.us/cgi-bin/mk-search.pl">Powell's Technical Books</A> (Portland OR)
<LI><A HREF="http://www.compbook.co.uk/cgi-win/search.exe/single">Computer Bookshop</A> (London)
<LI><A HREF="http://secure.bookshop.co.uk/search.htm">Internet Book Shop</A> (UK)
<LI><A HREF="http://www.lmet.fr/Rechercht.html">Le monde en 'tique</A> (Paris)
<LI><A HREF="http://www.buchkatalog.de/kod-bin/isuche.exe?lang=deutsch&dbname=Buchkatalog&PARAM=LNKUSERID&Aktion=Suche">KNO-K&V Buch Katalog</A> (Germany)
<LI><A HREF="http://www.hotline.com.au/hotkey.html">Hotline Books</A> (Australia)
<LI><A HREF="http://www.yahoo.com/Business_and_Economy/Companies/Books/Computers/">
Other online stores for computer books</A> (from Yahoo!)
</UL>
</HTML>
I'd email them the web address of this page. If more people used HTML-enabled mail readers like the Netscape Messenger Mailbox, or if more vendors would support HTML-based email (Microsoft is in Outlook Express 4.0), I could email them the actual web page, as shown in Figure 1-3: note how my little email message is joined to the contents of the web page I'm sending, and how all the links in the email message are clickable.

Figure 1-3: ex13.html in mail message
Figure 1-3

Rich Text and Email Bombs

HTML-based email is one example of how HTML is becoming a "lowest common denominator" file format; perhaps it will eventually replace plain-vanilla ASCII. The possibilities for "richer" text bring with them additional worries, however.

For example, the ability to construct "unsafe URLs" (see chapter 2), coupled with the implicit URL loading that's triggered by the <IMG> tag, means we'll not only be seeing rich-text email, but also email "bombs." As a particularly nasty example, Windows users might consider the following line of HTML:

<IMG SRC="file://AUX">

A web browser or HTML-enabled email reader, on encountering the <IMG SRC> tag, will try to load the image whose URL is given in the SRC= attribute. A file:// URL points to a file on the recipient's local file system (again, see chapter 2). Unfortunately, Microsoft's MS-DOS Programmer's Reference notes that "If the auxiliary device is not present or not ready to receive or send data, a program that reads or writes data to the device may hold indefinitely." Since someone receiving this one-line HTML page is unlikely to have an AUX device ready to deliver data, much less an image, "hold indefinitely" is precisely what the browser or emailer does. In Windows 95, this can halt everything, leaving the unfortunate recipient with no choice but to turn their machine off and on again. Given that the Windows 95 file system is aggressively cached, this can cause serious data loss. Richard Smith of Phar Lap Software has been beating the drums about this one, and hopefully PC browser vendors will do something to plug this nasty security hole. (Yes, an HTML page's ability to crash the user's browser, and possibly the operating system, is a security hole.)

Add the facts that HTML can now include JavaScript code, and that the <APPLET>, <EMBED> and <OBJECT> tags can automatically load some piece of ("safe," yeah, right) binary code, and you might think twice about the mere act of opening an email from an unknown or untrusted source in an HTML-enabled mail reader, because merely opening such a document means executing the unknown code.

{{Not to be confused with "Good Times" email virus hoax/urban legend. See http://www.physics.uiuc.edu/~weitzen/humor/hoax.html, an FAQ which includes: "Is an email virus possible? The short answer is no, not the way Good Times was described. The longer answer is that this is a difficult question that's open to nitpicking." Goes on to acknowledge that "There are some email programs that can be set to automatically download a file attachment, decode it, and execute the file attachment. If you use such a program, you would be well advised to disable the option to automatically execute file attachments." That's essentially what we're dealing with here.}}

HTML may well become the ASCII of the 21st century (the first few years of it, anyway), and that will mean far more "powerful" documents, but blurring the old false division between documents and programs does come at a price: increased risk. Only those who believe in Santa Claus and free lunches will be surprised at this. Perhaps HTML-based email readers (and all HTML browsers, for that matter) will require an option to disable all implicit loading (IMG, FRAME, APPLET, etc.) and instead present the user with a selection menu.

{{http://www.infoworld.com/cgi-bin/displayTC.pl?/reviews/970324antivirus.htm: Norton AntiVirus for Internet Email Gateways; description at http://www.symantec.com/nav/wpnavieg.pdf}}

From the links on this page, I figured people could easily find the book. About the only assistance I provided was to try to give them a link for each bookstore's search form, rather than for the bookstore's home page (for example, http://www.compbook.co.uk/cgi-win/search.exe/single rather than just http://www.compbook.co.uk). Not exactly door-to-door service; it was more like dropping them off at the corner, and waving vaguely in the direction of my book.

Still, it seemed rather obvious how to fill out a search form at an online bookstore. Figure 1-4, for example, shows part of the search form at Le monde en 'tique, a wonderful computer bookstore in Paris:

Figure 1-4: from http://www.lmet.fr/Rechercht.html
Figure 1-4

Filling out a form like this generally brings you to a list of books by an author you've specified, or of books whose title includes words you've specified. From there, you click on the book you want, and you're taken to a page with an order form, and sometimes with reviews or descriptive material. For example, Figure 1-5 shows part of a page for my book at the Computer Literacy chain of bookstores in Silicon Valley; back in Figure 1-2, the right frame shows a similar page from amazon.com's large online bookstore.

Figure 1-5: from http://www.clbooks.com/cgi-bin/displayinfo
Figure 1-5

Having just said that such pages typically contain an "order form," you can see from both Figure 1-2 and Figure 1-5 that, in fact, there's really just a button that says "Add this book to your shopping basket" or "Put in shopping cart," or perhaps something cute like "Put this in my book bag."

Shopping Carts and Cookies

Why isn't there an order form located on the page for each book? Because all these online stores want to make it relatively easy for you to buy more than one book at a time. This requires the ability to browse around, select a book or video cassette or CD or whatever to buy, browse around some more, pick more items to buy, and when you're done shopping, to proceed to the check-out counter.

This presents an interesting technical problem, because HTTP, the Hypertext Transfer Protocol upon which much of the web is currently based, is a "stateless" protocol. {{Explain "state"!}} A client, such as a web browser, connects to a server, asks for a page, gets the page, and then in HTTP/1.0 is even disconnected by the server. Even if the server does keep the connection open for additional requests from the client (as in HTTP/1.1 and the "Connection: Keep-Alive" option of HTTP/1.0), the server is still not supposed to "remember" anything from one request to another. Meanwhile, a virtual shopping cart clearly requires memory, or "state."

By now, the solution to the shopping-cart problem is common, and shows the flexibility of Internet standards. An application can either carry around "state" in hidden fields of an HTML form (chapter 4 will discuss "<INPUT TYPE=hidden>" in detail), or it can use the "persistent cookies" solution introduced by Netscape (and now supported by other vendors, including Microsoft). "Cookies" are really just an extension to HTTP headers. So shopping carts did not require a major overhaul of web standards; shopping carts could be built right on top of stateless HTTP by making clever use of HTML, or by adding some new HTTP headers for cookies.

{{Privacy concerns over cookies: http://www.news.com/News/Item/0,4,8770,00.html.}}

{{Hidden form fields regarded as kludge: Orfali/Harkey, "Client/Server Programming with JAVA and CORBA," ch. comparing to CGI, p. 226: "the kludge is to use hidden fields within a form to maintain state on the client side.... in essence, the CGI program stores the state of the transaction in the forms it sends back to the client instead of storing it in its own memory. What do you think of this workaround? We did warn you that it was going to be a real work of art." But what actually so bad? Must be the sheer simplicity they dislike?}}

{{Transition here confusing!}}

Certainly, there are well-known problems with the HTTP protocol. It could be argued that HTTP has become a victim of its own success, was too simple and didn't "scale well," is not a "good network citizen," and is now in part responsible for the "World Wide Wait." According to HREF="http://www.w3.org/pub/WWW/Protocols/HTTP/1.0/HTTPPerformance.html, "The effects of HTTP/1.0's use of TCP on the Internet have resulted in major problems caused by congestion and unnecessary overhead." Yet the solutions are by no means obvious. Another paper, http://www.w3.org/pub/WWW/Protocols/HTTP/Performance/Pipeline.html, reports on tests done with persistent connections, pipelined requests, and data compression, using as sample data a combination of the Microsoft and Netscape home pages ("Microscape"!). The results are somewhat surprising; for example: "An HTTP/1.1 implementation that does not implement pipelining will perform worse (have higher elapsed time) than an HTTP/1.0 implementation using multiple connections." In other cases, improvements were merely modest. While completely new binary protocols such as HTTP-NG have been proposed, any successor to HTTP is likely to be a text-based superset of HTTP.

Incidentally, if you want to learn more about web technologies such as cookies or HTTP/1.1, a good place to start is the web itself. In particular, the Yahoo! site has imposed some organization on the chaos of the web, by introducing logically-structured URLs. For example, you can find some key links to information about cookies at:

http://www.yahoo.com/Computers_and_Internet/Internet/World_Wide_Web/HTTP/ Protocol_Specification/Persistent_Cookies/

This page includes a link to a tutorial "How to make cookies and shopping cart" (http://www.ids.net/~oops/tech/make-cookie.html), by a company that has made a business out of supplying shopping-cart software to other vendors on the web (http://www.rent-a-cart.com).

Figure 1-6 shows part of a shopping basket at amazon.com. (The number "2503-2540908-490803" in the URL is not a credit card number, incidentally, but a transaction ID assigned automatically by the store.) Somewhat awkwardly, you can use your browser's "Back" button to step back and buy more books. Eventually, you say you're done, and you press a button that says something like "Buy items now." In the case of amazon.com and other stores, this takes you to a secure server (note the "https://" in the URL in Figure 1-7), which provides a form into which you can enter your credit-card number (though, as seen in Figure 1-7, it's also easy enough to instead phone in your credit-card number). When a URL contains https:// rather than http://, the Secure Sockets Layer (SSL) encrypts all inbound and outbound packets; we'll look at SSL in more detail later.

Figure 1-6: from http://www.amazon.com/exec/obidos/shopping-basket
Figure 1-6

Figure 1-7: from https://www.amazon.com/exec/obidos/order2
Figure 1-7

If you've made it this far (it's definitely a more complex process than one would like), you enter the shipping address, press a last confirmation button, and you're done. You soon receive an email confirmation, your credit card is debited, you receive another email when the items are shipped (this email will often include a package-tracking number; see below), and with any luck, a couple of days later you receive the items themselves. Some times I've had two-day turnaround (order over the web on Monday night, Wednesday the package arrives), other times I've had to wait six weeks.

I've tried to give a fairly complete picture here of the user's experience of online commerce. It's definitely more complicated than it should be: all those fields to enter and buttons to press. Using the browser's Back button as a major navigational device feels unnatural. Still, online shopping often beats a trip to the local bookstore or CD shop (for example, you can search for books whose title you only vaguely remember, without trying to get an overworked clerk to help you).

More important, from my perspective as an author trying to support my books online, I could tap into this whole process with just a few lines of HTML. As Figure 1-8 shows, it can even be made to appear seamlessly as part of my own page (the entire right side is part of the form for sending credit-card information to amazon.com's secure server; incidentally, if you're wondering why the "https://" URL does not show up in the browser's location bar, see the sidebar on "Naive URL Re-Use" in chapter 2).

Figure 1-8: from http://www.sonic.net/~undoc/book/ex12.html
Figure 1-8

So why would I, someone who has messed around for the past ten years in the hidden, undocumented, internals of MS-DOS and Windows, want to get involved with such a high-level, almost non-programming activity like HTML coding? It was things like those credit-card images in Figure 1-7 that did it. Once I saw how easy it was to tap into the hard work that others had done, and adapt it for my own applications, I was hooked. It became clear that web sites have the potential to be reusable software components (see chapter 5, "The Tools Approach to the Web").

True, the results are somewhat primitive, and the web has many problems. But these only make things more interesting: problems are, of course, merely opportunities in disguise -- so long as the fundamentals work well. The display in Figure 1-8, enabled by the few lines of code in Example 1-2, shows, I think, how well the fundamentals of the web do indeed work.

If you've developed software before, you should be absolutely thrilled, amazed, and excited that the ridiculously-simple code in Example 1-2 produces the display in Figure 1-8. True, the code for the credit-card form on the right side of Figure 1-8 nowhere appears in Example 1-2; all I've done is link to someone else's code, which I've neglected to show. Am I therefore minimizing the actual amount of work that's required to use HTML to build an application such as that shown in Figure 1-8? Not at all. Aside from the fact that forms building in HTML is trivial, the real point is precisely that I didn't have to concern myself with the work involved in producing the form: it behaves as a truly reusable software component, even re-formatting to fit the frame into which I've coerced it.

URLs: Handles to Data

As you'll recall, my attempt to nudge along sales of my book on the web had got as far as posting the list of bookstores in Example 1-3. As I said, I would take potential customers and drop them off at the corner, and from there they could hopefully find their own way to my book.

At an online store called Book Stacks Unlimited, though, I noticed this at the bottom of the page for one of my books:

Link URL = http://www.books.com/scripts/view.exe?isbn~1568843054

The phrase "Link URL" was highlighted; clicking on it produced this explanation:

If you have a reference about a book on your Web site, you can now link it to our bookstore. Why would you want to do this? Well if you're a publisher or an author, it's the perfect way to sell your titles online without having to deal with any of the ordering or shipping details.

But what if you're neither the publisher or author, and you just happen to have put something up on your site about a good book? Partner with Book Stacks and we will share 8% of all of the resulting sales with you.

I'll get into that 8% business later; for now, I just want to focus on the fact that, using the ISBN number (see below), one could link into the store's database, directly to the page for almost any book in print. How odd, that each book in print had its own "home page," so to speak, on the web. One could have a link, not just to a bookstore, but directly to an order form for a specific book. I didn't have to drop potential customers off at the corner; I could give them door-to-door service.

Perhaps this all seems obvious, but to me it was a revelation. I've seen others have this same sort of blinding revelation, the day it dawns on them how powerful URLs really are. Note that it takes only about 50 bytes of text to pinpoint my book. This same pointer works from any web-capable machine in the world. The HTML in Example 1-1, which additionally pinpoints a graphic of the book, located on a completely different machine, takes a total of perhaps 200 bytes. Once you have laboriously found your way to some resource on the web that you want to reuse, it is frequently trivial to then save away a handle to that specific resource: not only save it away as a "bookmark" in your web browser (yawn), but far more significantly, incorporate it as a link, image, form, or frame into your own projects.

Now, I know there are notorious problems with URLs, and we'll examine these in chapter 2, but first take a moment to consider how well they often do work. The following line is quite amazing, if you stop to think about it:

Buy it!

<A HREF="http://www.books.com/scripts/view.exe?isbn~1568843054">Buy it!</A>

Nor was this ability to link directly into very specific parts of their database unique to the Book Stacks site. Several other online bookstores provided the same capability. ISBN linking didn't work at every store listed in Example 1-3, but it worked at enough of them that I could now drive potential buyers directly to the book at several different stores:

http://www.amazon.com/exec/obidos/ISBN=1568843054

http://www.viamall.com/softpro/1-56884-305-4.html

http://www.hotline.com.au/cgi-bin/title_search?keyword=1568843054 (Australia)

You'll immediately notice that, while all four of the URLs refer to the same ISBN number 1-56884-305-4, the syntaxes are otherwise quite different. Evidently, there are few rules for the formation of URLs, once you get past the protocol://hostname part. Each site has set up its system differently. This is to be expected, but it means that rules learned for one system don't help much trying to link into a different system. As we'll discuss in more detail in chapter 2, "educated guesses" can be a poor way of generating URLs.

In fact, one might suspect that the efficiency of the URLs for my book, and what little regularity they have, owes more to the ISBN numbering scheme (see sidebar) than it does to the URL scheme.

ISBN

The International Standard Book Number (ISBN) consists of ten digits and is made up of four parts: Group identifier, Publisher identifier, Title identifier, and Checksum digit. (The Group identifier, assigned by the ISBN Agency in Berlin, stands for language and country groups; for instance, English-speaking countries use 0 or 1.) {{Is ORA prefix 1565 or 15659?}} The ISBN system was established in 1968; according to http://www.reedref.com/Standards/isbn.html, "virtually every item sold in a bookstore requires an ISBN as increasing numbers of publishing systems base their entire inventory on the ISBN."

If it seems remarkable that each book in print has an address on the web, it's additionally noteworthy that www.amazon.com recently appears to have added about a million out of print books to its catalog. This includes even some books published before the introduction of ISBN numbers in 1968: for example, http://www.amazon.com/exec/obidos/ISBN=0849534313 has a publication date of June 1923!? I asked info@amazon.com about this, and was told: "We used a variety of data sources and selected only books for which ISBNs were available as a method of discriminating easier to fill out of print titles. If we had chosen books w/o ISBNs, the task would be substantially harder to acquire and receive these titles."

{{Placed order for some OOP books on March 17, as of XXX hadn't received any notice. Placed orders for these same books with ABE on April 20, next day received email about three of them: two (Edelman, Disney) being sent to me, another not available (Boktin; but other copies at ABE). B&N also says carry OOP books? Time, 17 April 1997}}

{{Books out of print at http://www.reedref.com/bop/. The Lane Egypt ISBN comes from Bowker, see http://www.reedref.com/bop/BOPformsrch.html}}

{{OOP books at http://www.abebooks.com/cgi/abe.exe/routera^_pr=inventoryKeys^phase=1; listed in http://www.bookwire.com/index/antiquarian-booksellers.html. Looks excellent! Also http://www.antiquarian.com/bookworm/: Antiquarian BookWorm. Also InterLoc (http://daniel.interloc.com): last week went "public access," previously had been limited to dealers and libraries. Huge database! Placing search form in one frame, results in another (including email form) makes interactive.}}

But it really isn't just items with ISBN numbers that work this way. As another example, I mentioned earlier that after one of these online stores ships an item to you, they'll typically send you an email containing a package-tracking number. If they ship via United Parcel Service (UPS), for example, you can go to the form at http://www.ups.com/tracking/tracking.html, enter in the package-tracking number, and find out where your package is ("what do they mean, they already delivered it?! oh, there it is, under the newspapers on the back porch"). The HTML source for the form looks like this:

<!-- from http://www.ups.com/tracking/tracking.html -->
<FORM METHOD="POST" ACTION="http://wwwapps.ups.com/tracking/tracking.cgi">
<P><I>Tracking number:</I> <INPUT TYPE="TEXT" NAME="tracknum" SIZE="40"></P>
<P><INPUT TYPE="submit" VALUE="Track this package"> <INPUT TYPE="reset" VALUE="Clear form and start over"></P>
</FORM>
In other words, this form takes the text field named "tracknum", and POSTs it to http://wwwapps.ups.com/tracking/tracking.cgi. That's it.

The question is, can we dispense with this form and instead form a URL to our package? For reasons we'll get into in chapter 2, taking data meant to be POSTed, and instead tacking it onto the end of the URL (equivalent to METHOD=GET) doesn't always work. In this case, however, it does work; you can directly place the tracking number in a URL:

Where's my package?!

<A HREF="http://wwwapps.ups.com/tracking/tracking.cgi?tracknum=1Z742E220310270799"> Where's my package?!</A>

Figure 1-9
Figure 1-9

{{Fedex similar, e.g., http://www.fedex.com/cgi-bin/track_it?trk_num=4364020496&dest_cntry=U.S.A.&ship_date=040797}}

A FORM and Its ACTION are Distinct

You may be wondering why the form is located at http://www.ups.com, yet the URL I manufactured goes to http://wwwapps.ups.com. These really are different machines: http://www.netcraft.com/cgi-bin/Survey/whats?host=www.ups.com&port=80 reveals that www.ups.com is running Netscape-Enterprise/2.01, while http://www.netcraft.com/cgi-bin/Survey/whats?host=wwwapps.ups.com&port=80 reveals that wwwapps.ups.com is running Netscape-Communications/1.1.

Note that the location of a form -- the "front-end" or user interface to a program -- and the location of the program that acts as the "back end" to this form, need have nothing in common. Because the ACTION= attribute of a <FORM> tag often contains a relative URL, it is frequently forgotten that ACTION= can be a full URL. This is an important point for chapter 4, "Snarfing Forms," which will show how a form at one site can act as the user interface for software running at a completely different site.

For now, it's worth mentioning one example: if METHOD=post, a form's ACTION can even have a "mailto:" URL; the ACTION will just be an email message sent to the specified address. For example:

<FORM ACTION="mailto:andrew@ora.com" 
	ENCTYPE="application/x-www-form-urlencoded"
	METHOD=POST>
var1: <INPUT TYPE=text NAME="var1">
var2: <INPUT TYPE=text NAME="var2"> <INPUT TYPE=submit> </FORM>
Typing "this is a test" into the var1 text field, and "666" into the var2 text field, will result in something like the following mail message:
From: John Doe 
X-Mailer: Mozilla 4.0b2 (Win95; I)
MIME-Version: 1.0
To: andrew@ora.com
Subject: Form posted from Mozilla
Content-type: application/x-www-form-urlencoded

var1=this+is+a+test&var2=666
While this is actually sometimes useful -- you can have a form POST data to yourself to see exactly how the posted data would look to a program -- the point here is merely that, quite obviously in this example, the front-end form and its back-end ACTION are distinct.

Now, I don't know about you, but it strikes me as very strange that every package has its own globally-accessible URL, its own "home page" on the web, as it were (see Figure 1-9). {{See more like this in chapter 2: "An Address for Everything" on URLs}}

{{rough from here on}}

Want to make sure reader understands that I'm not showing them how to "surf" the web. That not purpose of URLs. URL is like address of a function that returns data. "Returns"? Or just displays? Reader may dispute this, along with whole assumption of this chapter that sending back a doc to display in a browser is somehow equivalent to returning a value to a caller. Seem to be claiming that e.g. www.altavista.digital.com is really a world-wide altavista() function, callable via some sort of RPC. Yes, I am. Note that HTTP is not displaying doc. Really does return data to a client. What caller does with it, is it's business. Browsers happen to be clients we're most familiar with, but client not necessarily browser. Browsers are clients that display, using HTML tags as instructions on how to display. Another piece of software, however, might fetch doc (in same way as browser) and then do something different -- such as extract piece of info. See Clinton Wong book.

Named pipes?

Example make this clearer: DOS batch file that, given ISBN, returns book's price. Uses geturl utility from chapter 5, together with grep. To anticipate, geturl works like this:

usage: geturl [options] <http://whatever or -stdin>
  options:
  -noloc : don't do HTTP relocations (default on)
  -base <addr> : use addr as base for all relative URLs
  -head : do HTTP HEAD (default GET)
  -post <data> : do HTTP POST of data
  -input <file> : get all HTTP headers from file
  -stdin : get URLs from stdin
  -split : break HTML output into lines on tags
C:\>type isbn2price.bat
@echo off
geturl -split http://www.amazon.com/exec/obidos/ISBN=%1 | grep List:

C:\>isbn2price 1568843054
List: $39.99
Since word "List:" might appear elsewhere in text (annotations, see below), for robustness should make this "egrep "^List: \$[0-9.]*$" ?? See ORA book on regular expressions. If don't have grep, then can use DOS find, but grep better.
geturl -split http://www.amazon.com/exec/obidos/ISBN=%1 | \windows\command\find "List:"
Note that geturl.exe automatically does Location: redirections

No error handling here (not handling bad ISBN, out of print, etc.). Can use egrep to search for multiple strings at once. Using -split to separate all HTML tags from text, but still long lines (e.g., for out of print book below): {{Use ^ and $ to make egrep search more robust}}

C:\>type isbn2price.bat
@echo off
geturl -split http://www.amazon.com/exec/obidos/ISBN=%1 | \
   egrep "(List:)|(Sorry)|(Availability)|(wrong)"
echo.

C:\>isbn2pri 1
Something seems to have gone wrong...

C:\>isbn2pri 0198226357
Availability: This item is out of print, but if you place an order we
may be able to find you a used copy within 2-6 months. We can't
guarantee a specific condition, binding, or edition. If we find a
copy, we will notify you via e-mail and request your approval of the
price. We'll also notify you if we can't find a copy.  PLEASE NOTE:
Each out of print item is shipped and billed separately.

C:\>isbn2pri 0060922753
List: $12.50
Availability: This item usually shipped within 2-3 days.
Maybe for this chapter, stop here, and refer to C code in chapter 5. To really make clear that data on web is reusable component, need to show C function that takes isbn, returns price. Build this on top of new http function in http.c. But need to then put redirection inside http().

More sophisticated version: isbn2price() function in AWK. Could as easily done in perl or C. {{Make search patterns more robust with ^ and $}}

In AWK, can implement grep in few lines {{show grep.awk?}}, so obviously can turn any grep pattern into AWK program.

C:\>isbn2pri 1568843054
ISBN #1568843054: List price: $39.99 (US)

C:\>isbn2pri 0140177388
ISBN #0140177388: List price: $5.95 (US)

C:\>isbn2pri 038508031X
ISBN #038508031X: List price: $11.00 (US)

C:\>isbn2pri 0849534313
ISBN #0849534313: Out of print

C:\>isbn2pri 1
ISBN #1: Incorrect ISBN format?
# isbn2price.awk

BEGIN {
    CANT_FIND = -1;
    INTERNAL_ERROR = -2;
    OUT_OF_PRINT = -3;
    SOMETHING_WRONG = -4;
    }

function isbn2price(isbn)
{
    cmd = "geturl http://www.amazon.com/exec/obidos/ISBN=" isbn;
    while (cmd | getline)
    {
        if ($0 ~ /^List: \$/)
        {
            nf = split($2, arr, "$");
            return arr[2];
        }
        else if ($0 ~ /^Sorry,/)
            return CANT_FIND;
        else if ($0 ~ /^Availability: This item is out of print, but/)
            return OUT_OF_PRINT;
        else if ($0 ~ /Something seems to have gone wrong.../)
            return SOMETHING_WRONG;
    }

    # still here
    print "List error: amazon.com format may have changed!" > stderr;
    return INTERNAL_ERROR;
}

function print_price(isbn)
{
    printf("ISBN #%s: ", isbn);
    price = isbn2price(isbn);
    if (price == INTERNAL_ERROR) print "Internal error";
    else if (price == CANT_FIND) print "Couldn't locate book";
    else if (price == OUT_OF_PRINT) print "Out of print"; 
    else if (price == SOMETHING_WRONG) print "Incorrect ISBN format?";
    else printf("List price: $%.02f (US)\n", price);
}

BEGIN {
    if (ARGC < 2)
        exit;
    for (i=1; i<ARGC; i++)
        print_price(ARGV[i]);
    ARGC = 1;
    }
Notice that pattern/action language good. Expect certain strings, what to do when get. cf. language "Expect" for building apps like this??)

Name/title to ISBN:

C:\>type amazsearch.bat
@echo off
geturl -post author=%1&author-mode=full&title=%2&title-mode=word \
   http://www.amazon.com/exec/obidos/ats-query/ | egrep "ISBN(=|:)"

C:\>amazsearch Schulman Unauthorized
<a href="/exec/obidos/ISBN=1568841698/0712-5505930-011155">Unauthorized Windows
95 : A Developer's Guide to Exploring the Foundations of Windows 'Chicago'</a>;
<a href="/exec/obidos/ISBN=1568843054/0712-5505930-011155">Unauthorized Windows
95 : Developer's Resource Kit/Book and 2 Disks</a>;
<a href="/exec/obidos/ISBN=1568847076/0712-5505930-011155">Unauthorized Windows
95 CD-ROM</a>;

C:\>amazsearch Kisseloff Box
<a href="/exec/obidos/ISBN=0140252657/2304-8300827-761158">The Box : An Oral His
tory of Television, 1920-1961</a>;
<a href="/exec/obidos/ISBN=0670864706/2304-8300827-761158">The Box : An Oral His
tory of Television, 1929-1961</a>;

C:\>amazsearch Waldmeir Mammon
ISBN: 0933951647<br>
{{Need to output something more useful if response is actual page, not list of pages?}}

Of course, dependent upon amazon.com page now. {{Especially risky because as see later, any user can add annotations to amazon page. Annotation might include, either accidentally or maliciously, the patterns our software is depending on!}}

But such dependencies nothing new: cf. DLL versionitis; MS so anal about linkages in COM, etc., but fact is, never worked too well with DLLs (WinSock, etc.). Loosely coupled systems. Do need to do something about changes to underlying site: add error page with mailto: URL?

{{Make following into sidebar??}}

In case seems contrived, here's another, genuine example: Amazon Top 50 computer books (http://www.amazon.com/exec/obidos/bestsellers/computer50), SHOW screen shot, wanted to know publisher ranking, could throw together app very quickly:

Source code for top 50 list looks like this (number after ISBN is temp shopping cart # that assigned via redirection):

<B>1.</B> <a href="/exec/obidos/ISBN=1568302894/1270-0422490-440341">Creating Killer Web Sites : The Art of Third-Generation Site Design</a>;
 David Siegel; Paperback; $27.00; <i>Descriptive information available.</i><p>
<B>2.</B> <a href="/exec/obidos/ISBN=0062514792/1270-0422490-440341">What Will Be : How the New World of Information Will Change Our Lives</a>;
 Michael L. Dertouzos; Hardcover; $15.00; <i>Descriptive information available.</i><p>
<B>3.</B> <a href="/exec/obidos/ISBN=1562057154/1270-0422490-440341">Designing Web Graphics .2</a>;
 Lynda Weinman; Paperback; $33.00; <i>Descriptive information available.</i><p>
<B>4.</B> <a href="/exec/obidos/ISBN=1565921496/1270-0422490-440341">Programming Perl</a>;
 Larry Wall, et al; Paperback; $23.97; <i>Descriptive information available.</i><p>
...
First, little program to generate list of publishers, in order. First pass, get list of URLs:
geturl -split http://www.amazon.com/exec/obidos/bestsellers/computer50 | \
   grep ISBN= > amaz50.lst
amaz50.lst looks like this:
<a href="/exec/obidos/ISBN=0062514792/2076-6105754-531263">
<a href="/exec/obidos/ISBN=1568302894/2076-6105754-531263">
<a href="/exec/obidos/ISBN=1565921496/2076-6105754-531263">
<a href="/exec/obidos/ISBN=0471117099/2076-6105754-531263">
<a href="/exec/obidos/ISBN=0201633612/2076-6105754-531263">
...
Now submit this to geturl -stdin, get back 50 pages, one for each book. Relative URLs, so use GETURL -base option to fetch URL for each of these pages at www.amazon.com; GETURL can peel off A HREF:
grep -base http://www.amazon.com -stdin < amaz50.lst
In each page, info like (show extract from amaz50.big for Programming Perl).

Want "Published by" line. (Later, want price too.) "Published by" might appear in annotations, so grep for "Published by" at beginning of line, <BR> at end:

geturl -split http://www.amazon.com/exec/obidos/bestsellers/computer50 | \
   grep ISBN= > amaz50.lst
geturl -base http://www.amazon.com -stdin < amaz50.lst | \
   grep "^Published by .* <br>"
{{But standard publisher prefixes! E.g., ORA is 56592. Silly to fetch doc from amazon for each one?}}

Output looks like following, in-order list of publishers of top 50 computer books:

Published by Harpercollins<br>
Published by Hayden Books<br>
Published by O'Reilly & Associates<br>
Published by John Wiley & Sons<br>
Published by Addison-Wesley Pub Co<br>
Published by Harvard Business School Pr<br>
Published by Ap Professional<br>
Published by Addison-Wesley Pub Co<br>
Published by Microsoft Pr<br>
Published by O'Reilly & Associates<br>
Published by Prentice Hall<br>
Published by Microsoft Pr<br>
Published by O'Reilly & Associates<br>
Published by Harpercollins (Paper)<br>
...
Need more robust program, also get prices, so did in AWK. See a50.awk; still need way to sort last three tables; produces HTML output (see also chapter 7 on web browser as display engine). Get GUI via printf!!

Now have simple sort function, but still having problems. Must be some major bug in a50.awk!

function sort_table(table) {
	local x, tx, tmp, arr, DELIM, y;
	DELIM = "!@#@!";  # some random pattern we don't expect to see in table
	for (x in table) {
		tx = table[x];
		tmp[tx] = (tx in tmp) ? (tmp[tx] DELIM x) : x;
		}
	for (x in tmp) {
		split(tmp[x], arr, DELIM);
		for (y in arr)
			sorted[x][y] = arr[y];
		}
	return sorted;
	}

function do_html_table(title, array) {
	local x, y, srt;
    print "<H1>", title, "</H1>";
    print "<TABLE BORDER=1>";
	srt = sort_table(array);
	REVERSE = 8;
	SORTTYPE += REVERSE;
    for (x in srt)
		for (y in srt[x]) {
			print srt[x][y], x > stderr;
        	print "<TR><TD ALIGN=RIGHT>", srt[x][y], "<TD>", x, "</TR>";
			}
    print "</TABLE>";
    }
# a50.awk

# TODO: figure out a good way to SORT table!
function do_html_table(title, array) {
    print "<H1>", title, "</H1>";
    print "<TABLE BORDER=1>";
    for (x in array)
        print "<TR><TD ALIGN=RIGHT>", array[x], "<TD>", x, "</TR>";
    print "</TABLE>";
    }

BEGIN {
    amaz50big = "amaz50.big";
    if (filesize(amaz50big) == -1) {
        cmd = "geturl -split ";
        cmd += "http://www.amazon.com/exec/obidos/bestsellers/computer50";
        cmd += " | grep ISBN= | geturl -base http://www.amazon.com -stdin";
        cmd += " > " amaz50big;
        system(cmd);
        }

    while (getline < amaz50big) {
        if ($0 ~ /^Published by .*<br>/i) {
            sub("Published by ", "", $0);
            sub("<br>", "", $0);
            publisher[++i] = $0;
            count[$0]++;
            }
        else if ($0 ~ /^<NOBR>List: \$[0-9.]*<\/NOBR>/i) {
            sub(/<\/?NOBR>/i, "", $2);
            sub(/\$/, "", $2);
            price[++j] = $2;
            }
        else if ($0 ~ /^<TITLE>.*<\/TITLE>$/i) {
            gsub(/<\/?TITLE>/i, "", $0);
            title[++k] = $0;
            }
        }
    close(amaz50big);

    if ((i != j) || (i != k)) {
        print "Something wrong! Publisher/price mismatch!";
        exit;
        }
    top = i;

    print "<H1>Top 50 computer books</H1>";
    print "<TABLE BORDER=1>";
    for (x in publisher)
        print "<TR><TD ALIGN=RIGHT>", x, "<TD>", publisher[x], "<TD>", 
              price[x], "<TD>", title[x], "</TR>";
    print "</TABLE>";

    # figure out "weight" for each publisher (ranking, price)
    for (x in publisher) {
        p = publisher[x];
        rank = (top - x) / 10;
        weight[p] += rank;
        wtpr[p] += (rank * (price[x] / 10));
        }

    do_html_table("Number of books per publisher", count);
    do_html_table("Weighted rankings", weight);
    do_html_table("Weighted rankings, with price", wtpr);
    }
Show sample output after running a50 > a50.html, compare with earlier shot of amazon page. (Note that could retain links in table from original amazon file.) Load a50.html into browser: would like to be automatic (see chap xxx on "local CGI").

Next thing is to host this app on the web (port to Unix)

{{Can also apply to Amazon Top 100 (not just computer books), or any list at amazon? http://www.amazon.com/exec/obidos/subst/amazon500-1.html is top 100, has link to amazon500-2.html, etc.}}

{{end possible sidebar}}

Treating web as file system (though often not actual file at server; see CGI below), apply tools, write programs to manipulate just as one writes programs to manipulate files on user's hard disk.

More on programming like this: Clinton Wong, Web Client Programming with Perl: Automating Tasks on the Web. Also see avsubmit example later. Will look at this more in ch. 2 on URLs, ch. 5

Realized that web site returns data, doesn't display it. WOW! That data can be manipulated in other ways, besides merely displaying. Many others have had the same realization. For example, Jon Udell, Byte magazine's executive editor for new media, wrote a great article on this topic for the November 1996 issue of Byte (of course, the article is available on the web: http://www.byte.com/art/9611/sec9/art1.htm:

On-Line Componentware: I use AltaVista to build BYTE's Metasearch application and realize that every Web site is a software component.

Software components can turn up in the unlikeliest places. In our May 1994 cover story ("Componentware," http://www.byte.com/art/9405/sec5/sec5.htm), for instance, we pointed out that object-oriented programming (OOP) technology had failed to produce a rich harvest of plug-and-play software objects. However, we showed that Visual Basic custom control (VBX) technology -- a hastily conceived mechanism for Visual Basic plug-ins -- had, to everyone's surprise, jump-started a thriving component-software industry.

Fast-forward to 1996. I want to prototype a Web-search application that embraces BYTE and five fellow McGraw-Hill publications. I have only a few hours to spend on the task. What component can I pull off the shelf and use? Java or ActiveX components? They're coming, but they're not here yet. Distributable search engines? They exist, but deployment across six Web sites will take more than the allotted few hours.

As I drove home from work, I suddenly knew where to find the right component for the job. It was sitting in plain view at http://www.altavista.digital.com/. That's right -- Digital Equipment's AltaVista, a public Web site, is also the software component that let me prototype the McGraw-Hill Metasearch application before I went to bed that night.

A powerful capability for ad hoc distributed computing arises naturally from the architecture of the Web.

Need summary here

HTML

{{Need subtitle; chapter 3 is "Compound Documents: HTML"}}


Distributed Computation: CGI


Creating Synpases: HTTP


A New Type of Application


A Lesson for the Software Industry