As we discussed earlier, every document on the World Wide Web has a unique address. (Imagine the chaos if they didn't.) The document's address is known as its Uniform Resource Locator (URL).[2]
[2] ``URL'' usually is pronounced ``you are ell,'' not ``earl.''
Several HTML tags include a URL attribute value, including hyperlinks, inline images, and forms. All use the same URL syntax to specify the location of a Web resource, regardless of the type or content of that resource. That's why it's known as a Uniform Resource Locator.
Since they can be used to represent almost any resource on the Internet, URLs come in a variety of flavors. All URLs, however, have the same top-level syntax:
scheme: scheme_specific_part
The scheme describes the kind of object the URL references; the scheme_specific_part is, well, the part that is peculiar to the specific scheme. The important thing to note is that the scheme is always separated from the scheme_specific_part by a colon (:) with no intervening spaces.
URLs are written using the displayable characters in the US-ASCII character set. If you need to use a character in a URL that is not part of this character set, you must encode the character using a special notation. The encoding notation replaces the desired character with three characters: a percent sign and two hexadecimal digits whose value corresponds to the position of the character in the ASCII character set.
This is easier than it sounds. One of the most common encoded special characters is the space character, whose position in the character set is 20 hexadecimal. To encode a space in a URL, replace it with %20:
http://www.kumquat.com/new%20pricing.html
This URL actually retrieves a document named new pricing.html from the server.
In addition to the nonprinting characters, you'll need to encode reserved and unsafe characters in your URLs as well.
Reserved characters are those characters that have a specific meaning within the URL itself. For example, many URLs use the slash character (/) to separate elements of a pathname within the URL. If you need to include a slash in a URL that is not intended to be an element separator, you'll need to encode it as %2F:
http://www.calculator.com/compute?3%2f4
This URL actually references the resource named compute on the www.calculator.com server and passes the string 3/4 to it, as delineated by the question mark (?). Presumably, the resource is actually a server-side program that performs some arithmetic function on the passed value and returns a result.
Unsafe characters are those that have no special meaning within the URL, but may have a special meaning in the context in which the URL is written. For example, the double-quote character (quot) is used to delimit URLs in many HTML tags. If you were to include a double-quote directly in a URL, you would probably confuse the HTML browser. Instead, encode the double-quote as %22 to avoid any possible conflict.
Other reserved and unsafe characters that should always be encoded are shown in Table 6.1.
Character | Description | Usage | Encoding |
---|---|---|---|
; | Semicolon | Reserved | %3B |
/ | Slash | Reserved | %2F |
? | Question mark | Reserved | %3F |
: | Colon | Reserved | %3A |
@ | At sign | Reserved | %40 |
= | Equal sign | Reserved | %3D |
& | Ampersand | Reserved | %26 |
< | Less than sign | Unsafe | %3C |
> | Greater than sign | Unsafe | %3E |
quot | Double quote | Unsafe | %22 |
# | Hash symbol | Unsafe | %23 |
% | Percent | Unsafe | %25 |
{ | Left curly brace | Unsafe | %7B |
} | Right curly brace | Unsafe | %7D |
| | Vertical bar | Unsafe | %7C |
| | Backslash | Unsafe | %5C |
^ | Caret | Unsafe | %5E |
~ | Tilde | Unsafe | %7E |
[ | Left square bracket | Unsafe | %5B |
] | Right square bracket | Unsafe | %5D |
` | Back single quote | Unsafe | %60 |
In general, you should always encode a character if there is some doubt as to whether it can be placed ``as-is'' in a URL. As a rule of thumb, any character other than a letter, number, or any of the characters $-_.+!*'(), should be encoded.
It is never an error to encode a character, unless that character has a specific meaning in the URL. For example, encoding the slashes in an http URL will cause them to be used as regular characters, not as pathname delimiters, breaking the URL.
The http URL is, by far, the most common within the World Wide Web. It is used to access documents stored on an http server, and it has two formats:
http://server:port/path#fragment http://server:port/path?search
Some of the parts are optional. In fact, the most common form of the http URL simply is:
http://server/path
designating the unique server and the directory path and name of a document.
The server is the unique Internet name or Internet Protocol (IP) numerical address of the computer system that stores the Web resource. Like us, we suspect you'll mostly use more easily remembered Internet names for the servers in your URLs.[3] The name consists of several parts, including the server's actual name and the successive names of its network domain, each part separated by a period. Typical Internet names look like www.ora.com or hoohoo.ncsa.uiuc.edu.[4]
[3] Each Internet-connected computer has a unique address; a numeric (IP) address, of course, because computers deal only in numbers. Humans prefer names, so the Internet folks provide us with a collection of special servers and software (Domain Name Service or DNS) that automatically resolve Internet names into IP addresses. InterNIC, a nonprofit agency, registers domain names mostly on a first-come, first-serve basis, and distributes new names to DNS servers worldwide.
[4] In the United States and for some Canadian establishments, the three-letter suffix of the domain name identifies the type of organization or business that operates that portion of the Internet. For instance, ``com'' is a commercial enterprise; ``edu'' is an academic institution; and ``gov'' identifies a government-based domain. Outside the United States, a less-descriptive suffix is assigned; typically a two-lettter abbreviation of the country name: ``jp'' for Japan and ``de'' for Deutschland, for instance. That convention indicates the traditional distribution of the Internet and presumably will change dramatically as the network proliferates in the rest of the world.
It has become something of a convention that webmasters name their servers www for quick and easy identification on the Web. For instance, O'Reilly & Associates's Web server's name is www, which along with the publisher's acronym-based domain name, becomes the very easily remembered website www.ora.com. Similarly, Sun Microsystems's Web server is named www.sun.com; Apple Computer's is www.apple.com and even Microsoft makes their Web server easily memorable as www.microsoft.com. The naming convention has very obvious benefits which you, too, should take advantage of if you are called upon to create a Web server for your organization.
You may also specify the address of a server using its numerical IP address. The address is a sequence of four numbers, zero to 255, separated by periods. Valid IP addresses look like 137.237.1.87 or 192.249.1.33.
It'd be a dull diversion to tell you now what the numbers mean or how to derive an IP address from a domain name, particularly since you'll rarely if ever use one in a URL. Rather, this is a good place to hyperlink: Pick up any good Internet networking treatise for rigorous detail on IP addressing, such as Ed Krol's The Whole Internet User's Guide and Catalog, published by O'Reilly & Associates.
The port is the number of the communication port to which the client browser connects to the server. It's a networking thing: servers do many things besides serve up Web documents and resources to client browsers: electronic mail, FTP document fetches, filesystem sharing, and so on. Although all that network activity may come into the server on a single wire, it's typically divided into software-managed ``ports'' for service-specific communications--something analogous to boxes at your local post office.
The default URL port for Web servers is 80. Special secure Web servers (SHTTP or SSL) run on port 443. Most Web servers today use port 80; you need only to include a port number along with an immediately preceding colon in your URL if the target server does not use port 80 for Web communication.
When the Web was in its infancy many months ago, pioneer webmasters ran their Wild Wild Web connections on all sorts of port numbers. For technical and security reasons, system-administrator privileges are required to install a server on port 80. Lacking such privileges, these webmasters chose other, more easily accessible, port numbers.
Now that Web servers have become acceptable and are under the care and feeding of responsible administrators, documents being served on some port other than 80 or 443 should make you wonder if that server is really on the up and up. Most likely, the maverick server is being run by a clever user unbeknownst to the server's bona fide system administrators.
The document path is the UNIX-style hierarchical location of the file in the server's storage system. The pathname consists of one or more names separated by forward slashes (/). All but the last name represent directories leading down to the document; the last name is usually that of the document itself.
It has become a convention that for easy identification, HTML document names end with the suffix .html (they're otherwise plain ASCII text files, remember?). You can easily identify a PC-based server: DOS's restrictions on filenames mean you can have only the three-letter .htm name suffix for HTML documents.
Although the server name in a URL is not case-sensitive, the document pathname may be. Since most Web servers are run on UNIX-based systems and UNIX file names are case sensitive, the document pathname will be case-sensitive, too. Web servers running on DOS machines are not case-sensitive, so the document pathname is not, but since it is impossible to know the operating system of the server you are accessing, always assume that the server has case-sensitive pathnames and take care to get the case correct when typing your URLs.
Certain conventions regarding the document pathname have arisen. If the last element of the document path is a directory, not a single document, the server usually will send back either a listing of the directory contents or the HTML index document in that directory. You should end the document name for a directory with a trailing forward slash (/) character, but in practice, most servers will honor the request even if the character is omitted.
If the directory name is just a forward slash alone or sometimes nothing at all, you will retrieve the first (top-level) HTML document or so-called home page in the uppermost root directory of the server. Every well-designed http server should have an attractive, well-designed ``home page''; it's a shorthand way for users to access your Web collection since they don't need to remember the document's actual filename, just your server's name. That's why, for example, you can type http://www.ora.com into Netscape's ``Open'' dialog and get O'Reilly's home page.
Another twist: if the first component of the document path starts with the tilde character (~), it means the rest of the pathname begins from the personal HTML directory in the home directory of the specified user on the server machine. For instance, the URL http://www.kumquat.com/~chuck/ would retrieve the top-level page from Chuck's document collection.
Different servers have different ways of locating documents within a user's home directory. Many search for the documents in a directory named public_html. UNIX-based servers are fond of the name index.html for home pages.
The fragment is a named identifier that points to some key section of a document. In URL specifications, it follows the server and pathname and is separated by the hash (#) symbol. A fragment identifier indicates to the browser that it should begin displaying the target document at the indicated fragment name. As we describe in more detail below, you insert fragment names into a document with the <a> tag and the name attribute. Like pathnames, a fragment name may be any sequence of characters.
The fragment name and the preceding hash symbol are optional; omit them when referencing a document without defined fragments.
Formally, the fragment element only applies to target files that are HTML documents. If the target of the URL is some other document type, the fragment name may be misinterpreted by the browser.
Fragments are useful for long documents. By identifying key sections of your document with a fragment name, you make it easy for readers to link directly to that portion of the document, avoiding the tedium of scrolling or searching through the document to get to the section that interests them.
As a rule of thumb, we recommend that every section header in your documents be accompanied by an equivalent fragment name. By consistently following this rule, you'll make it possible for readers to jump to any section in any of your documents. Fragments also make it easier to build tables of contents for your document families.
The search component of the http URL, along with its preceding question mark, is optional. It indicates that the path is a searchable or executable resource on the server. The content of the search component is passed to the server as parameters that control the search or execution function.
The actual encoding of parameters in the search component is dependent upon the server and the resource being referenced. The parameters for searchable resources are covered later in this chapter, when we discuss searchable documents. Parameters for executable resources are discussed in Chapter 8, Forms.
Here are some sample http URLs:
http://www.ora.com/catalog.html http://www.ora.com/ http://www.kumquat.com:8080/ http://www.kumquat.com/planting/guide.html#soil_prep http://www.kumquat.com/find_a_quat?state=Florida
The first example is an explicit reference to a bona fide HTML document named catalog.html that is stored in the root directory of the www.ora.com server. The second references the top-level home page on that same server. That home page may or may not be catalog.html. Sample three, too, assumes there is a home page in the root directory of the www.kumquat.com server, and that the Web connection is to the nonstandard port 8080.
The fourth example is the URL for retrieving the Web document named guide.html from the planting directory on the www.kumquat.com server. Once retrieved, the browser should display the document beginning at the fragment named soil_prep.
The last example invokes an executable resource named find_a_quat with the parameter named state set to the value Florida. Presumably, this resource generates an HTML response that is subsequently displayed by the browser.
The ftp URL is used to retrieve documents from an FTP (File Transfer Protocol)[5] server. It has the format:
[5] FTP is an ancient Internet protocol that dates back to the Dark Ages, around 1975 or so. It was designed as a simple way to move files between machines and remains popular and useful to this day. Some people who are unable to run a true Web server will place their documents on a server that speaks FTP instead.
ftp://user:password@server:port/path;type=typecode
FTP is an authenticated service, meaning that you must have a valid user name and password in order to retrieve documents from a server. However, most FTP servers also support restricted, nonauthenticated access known as anonymous FTP. In this mode, anyone can supply the username ``anonymous'' and be granted access to a limited portion of the server's documents. Most FTP servers also assume (but may not grant, of course) anonymous access if the user name and password are omitted.
If you are using an ftp URL to access a site that requires a user name and password, include the user and password components in the URL, along with the colon (:) and ``at'' sign (@). More commonly, you'll be accessing an anonymous FTP server, and the user and password components can be omitted.
If you keep the user component along with the ``at'' sign, but omit the password and the preceding colon, most browsers will prompt you for a password after connecting to the FTP server. This is the recommended way of accessing authenticated resources on an FTP server, since it prevents others from seeing your password.
We recommend you never place an ftp URL with a user name and password in any HTML document. The reasoning is simple: anyone can retrieve the document, extract the user name and password from the URL, log into the FTP server, and tamper with its documents.
The ftp server and port are bound by the same rules as the server and port in an http URL, as described above. The server must be a valid Internet domain name or IP address of an FTP server. The port specifies the port on which the server is listening for requests.
If the port and its preceding colon are omitted, the default port of 21 is used. It is necessary to specify the port only if the FTP server is running on some port other than 21.
The path component represents a series of directories, separated by slashes (/) leading to the file to be retrieved. By default, the file is retrieved as a binary file; this can be changed by adding the typecode (and the preceding ;type=) to the URL.
If the typecode is set to d, the path is assumed to be a directory. The browser will request a listing of the directory contents from the server and display this listing to the user. If the typecode is any other letter, it is used as a parameter to the FTP type command before retrieving the file referenced by the path. While some FTP servers may implement other codes, most servers accept i to initiate a binary transfer and a to treat the file as a stream of ASCII text.
Here are some sample ftp URLs:
ftp://www.kumquat.com/sales/pricing ftp://bob@bobs-box.com/results;type=d ftp://bob:secret@bobs-box.com/listing;type=a
The first example retrieves the file named pricing from the sales directory on the anonymous FTP server at www.kumquat.com. The second logs into the FTP server on bobs-box.com as user bob, prompting for a password before retrieving the contents of the directory named results and displaying them to the user. The last example logs into bobs-box.com as bob with the password secret and retrieves the file named listing, treating its contents as ASCII characters.
The file URL specifies a file stored on a machine without indicating the protocol used to retrieve the file. As such, it has limited use in a networked environment. Its real benefit, however, is that it can reference a file on the user's machine, and is particularly useful for referencing personal HTML document collections, such as those ``under construction'' and not yet ready for general distribution, or HTML document collections on CD-ROM. It has the format:
file://server/path
The file server, like the http server described above, must be the Internet domain name or IP address of the machine containing the file to be retrieved. No assumptions are made as to how the browser might contact the machine to obtain the file; presumably the browser can make some connection, perhaps via a Network File System or FTP, to obtain the file.
If the server is omitted, or the special name localhost is used, the file is assumed to reside on the same machine upon which the browser is running. In this case, the browser simply accesses the file using the normal facilities of the local operating system. In fact, this is the most common usage of the file URL. By creating document families on a diskette or CD-ROM and referencing your hyperlinks using the file://localhost/ URL, you create a distributable, standalone document collection that does not require a network connection to use.
This is the path of the file to be retrieved on the desired server. The syntax of the path may differ based upon the operating system of the server; be sure to encode any potentially dangerous characters in the path.
The file URL is easy:
file://localhost/home/chuck/document.html file:///home/chuck/document.html file://marketing.kumquat.com/monthly_sales.html
The first URL retrieves /home/chuck/document.html from the user's local machine. The second is identical to the first, except we've omitted the localhost reference to the server; the server name defaults to the local server. Do notice, however, the extra forward slash is required for this alternate form.
The third example uses some protocol to retrieve monthly_sales.html from the marketing.kumquat.com server.
The news URL accesses either a single message or an entire newsgroup within the Usenet news system. It has two forms:
news:newsgroup news:message_id
An unfortunate limitation in news URLs is that they don't allow you to specify a server for the newsgroup. Rather, users specify their news-server resource in their browser preferences. At one time, not long ago, Internet newsgroups were nearly universally distributed; all news servers carried all the same newsgroups and their respective articles, so one news server was as good as any. Today, the sheer bulk of disk space needed to store the daily volume of newsgroup activity is often prohibitive for any single news server, and there's also local censorship of newsgroups. Hence, you cannot expect that all newsgroups, and certainly not all articles for a particular newsgroup, will be available on the user's news server.
Moreover, many users' browsers may not be correctly configured to read news. We recommend you avoid placing news URLs in your documents except in rare cases.
There are several thousand newsgroups devoted to nearly every conceivable topic under the sun and beyond. Each group has a unique name, composed of hierarchical elements separated by periods. For example,
comp.infosys.www.announce
is the World Wide Web announcements newsgroup. To access this group, use the URL:
news:comp.infosys.www.announce
Every message on a news server has a unique message identifier (ID) associated with it. This ID has the form
unique_string@server
The unique_string is a sequence of ASCII characters; the server is usually the name of the machine from which the message originated. The unique_string must be unique among all the messages that originated from the server. A sample URL to access a single message might be:
news:12A7789B@news.kumquat.com
In general, message IDs are cryptic sequences of characters not readily understood by humans. Moreover, the lifespan of a message on a server is usually measured in days, after which the message is deleted and the message ID is no longer valid. The bottom line: single message news URLs are difficult to create, become invalid quickly, and are generally not used.
The nntp URL goes beyond the news URL to provide a complete mechanism for accessing articles in the Usenet news system. It has the form:
nntp://server:port/newsgroup/article
The nntp server and port are defined similarly to the http server and port, described above. The server must be the Internet domain name or IP address of a nntp server; the port is the port on which that server is listening for requests.
If the port and its preceding colon are omitted, the default port of 119 is used.
The newsgroup is the name of the group from which an article is to be retrieved, as defined in the description of the news URL, above.
The article is the numeric id of the desired article within that newsgroup. Although the article number is easier to determine than a message id, it falls prey to the same limitations of single message references using the news URL, above. Specifically, articles do not last long on most nntp servers, and nntp URLs quickly become invalid as a result.
A sample nntp URL might be
nntp://news.kumquat.com/alt.fan.kumquats/417
This URL retrieves article 417 from the alt.fan.kumquats newsgroup on news.kumquat.com. Keep in mind that the article will only be served to machines that are allowed to retrieve articles from this server. In general, most nntp servers restrict access to those machines on that same local area network.
The mailto URL causes an electronic mail message to be transmitted to a named recipient. It has the format:
mailto:address
The address is any valid email address, usually of the form:
user@server
Thus, a typical mailto URL might look like:
mailto:cmusciano@aol.com
The telnet URL opens a telnet session with a desired server, allowing the user to log in and use the machine. Often, the connection to the machine automatically starts a specific service for the user; in other cases, the user must know the commands to type to use the system. The telnet URL has the form:
telnet://user:password@server:port/
The telnet user and password are used exactly like the user and password components of the ftp URL, described above. In particular, the same caveats apply regarding protecting your password and never placing it within a URL.
Just like the ftp URL, if you omit the password from the URL, the browser should prompt you for a password just before contacting the telnet server.
If you omit both the user and password, the telnet occurs without supplying a user name. For some servers, telnet automatically connects to a default service when no user name is supplied. For others, the browser may prompt for a username and password when making the connection to the telnet server.
The telnet server and port are defined similarly to the http server and port, described above. The server must be the Internet domain name or IP address of a telnet server; the port is the port on which that server is listening for requests.
If the port and its preceding colon are omitted, the default port of 23 is used.
Gopher is a Web-like document retrieval system that achieved some popularity on the Internet just before the World Wide Web took off, completely replacing Gopher. Some Gopher servers still exist, though, and the gopher URL lets you access Gopher documents. The gopher URL has the form:
gopher://server:port/path
The gopher server and port are defined similarly to the http server and port, described above. The server must be the Internet domain name or IP address of a gopher server; the port is the port on which that server is listening for requests.
If the port and its preceding colon are omitted, the default port of 70 is used.
The path can take one of three forms:
type/selector type/selector%09search type/selector%09search%09gopherplus
The type is a single character value denoting the type of the gopher resource. If the entire path is omitted from the gopher URL, the type defaults to 1.
The selector corresponds to the path of a resource on the gopher server. It may be omitted, in which case the top-level index of the gopher server is retrieved.
If the gopher resource is actually a gopher search engine, the search component provides the string for which to search. The search string must be preceded by an encoded horizontal tab (%09).
If the gopher server supports Gopher+ resources, the gopherplus component supplies the necessary information to locate that resource. The exact content of this component varies based upon the resources on the gopher server. This component is preceded by an encoded horizontal tab (%09). If you want to include the gopherplus component but omit the search component, you must still supply both encoded tabs within the URL.
URLs come in two flavors: absolute and relative. An absolute URL is the complete address of a resource and has everything your system needs to find a document and its server on the Web. At the very least, an absolute URL contains the scheme and all required elements of the scheme_specific_part of the URL. It may also contain any of the optional portions of the scheme_specific_part.
With a relative URL you provide an abbreviated document address that, when automatically combined with a ``base address'' by the system, becomes a complete address for the document. Within the relative URL, any component of the URL may be omitted. The browser automatically fills in the missing pieces of the relative URL using corresponding elements of a base URL. This base URL is usually the URL of the document containing the relative URL, but may be another document specified with the <base> tag. [<base>, 6.7.1]
A common form of a relative URL is missing the scheme and server name. Since many related documents are on the same server, it makes sense to omit the scheme and server name from the relative URL. For instance, assume the base document was last retrieved from the server www.kumquat.com. The relative URL, then:
another-doc.html
is equivalent to the absolute URL:
http://www.kumquat.com/another-doc.html
Table 6.2 shows how the base and relative URLs in the example are combined to form an absolute URL.
Protocol | Server | Directory | File | |
---|---|---|---|---|
Base URL | http | www.kumquat.com | / | |
Relative URL | downarrow | downarrow | downarrow | another-doc.html |
downarrow | downarrow | downarrow | downarrow | downarrow |
Absolute URL | http | www.kumquat.com | / | another-doc.html |
Another common form of a relative URL omits the leading slash and one or more directory names from the beginning of the document pathname. The directory of the base URL is automatically assumed to replace these missing components. It's the most common abbreviation because most HTML authors place their collection of documents and subdirectories of support resources in the same directory path as the home page. For example, you might have a special/ subdirectory containing FTP files referenced in your HTML document. Let's say that the absolute URL for that HTML document is:
http://www.kumquat.com/planting/guide.html
A relative URL for the file README.txt in the special/ subdirectory, then, looks like this:
ftp:special/README.txt
You'll actually be retrieving:
ftp://www.kumquat.com/planting/special/README.txt
Visually, the operation looks like that in Table 6.3:
Protocol | Server | Directory | File | |
---|---|---|---|---|
Base URL | http | www.kumquat.com | /planting | guide.html |
Relative URL | ftp | downarrow | special | README.txt |
downarrow | downarrow | downarrow | downarrow | downarrow |
Absolute URL | ftp | www.kumquat.com | /planting/special | README.txt |
Relative URLs are more than just a typing convenience. Because they are relative to the current server and directory, you can move the entire set of documents to another directory or even another server and never have to change a single relative link. Imagine the difficulties if you had to go into every source HTML document and change the URL for every link every time you move it. We'd loathe using hyperlinks! Use relative URLs wherever possible.