HTML: The Definitive Guide

Previous Chapter 6 Next
 

6.2 Referencing Documents: The URL

As we discussed earlier, every document on the World Wide Web has a unique address. (Imagine the chaos if they didn't.) The document's address is known as its Uniform Resource Locator (URL).[2]

[2] ``URL'' usually is pronounced ``you are ell,'' not ``earl.''

Several HTML tags include a URL attribute value, including hyperlinks, inline images, and forms. All use the same URL syntax to specify the location of a Web resource, regardless of the type or content of that resource. That's why it's known as a Uniform Resource Locator.

Since they can be used to represent almost any resource on the Internet, URLs come in a variety of flavors. All URLs, however, have the same top-level syntax:

scheme: scheme_specific_part

The scheme describes the kind of object the URL references; the scheme_specific_part is, well, the part that is peculiar to the specific scheme. The important thing to note is that the scheme is always separated from the scheme_specific_part by a colon (:) with no intervening spaces.

Writing a URL

URLs are written using the displayable characters in the US-ASCII character set. If you need to use a character in a URL that is not part of this character set, you must encode the character using a special notation. The encoding notation replaces the desired character with three characters: a percent sign and two hexadecimal digits whose value corresponds to the position of the character in the ASCII character set.

This is easier than it sounds. One of the most common encoded special characters is the space character, whose position in the character set is 20 hexadecimal. To encode a space in a URL, replace it with %20:

http://www.kumquat.com/new%20pricing.html

This URL actually retrieves a document named new pricing.html from the server.

Handling reserved and unsafe characters

In addition to the nonprinting characters, you'll need to encode reserved and unsafe characters in your URLs as well.

Reserved characters are those characters that have a specific meaning within the URL itself. For example, many URLs use the slash character (/) to separate elements of a pathname within the URL. If you need to include a slash in a URL that is not intended to be an element separator, you'll need to encode it as %2F:

http://www.calculator.com/compute?3%2f4

This URL actually references the resource named compute on the www.calculator.com server and passes the string 3/4 to it, as delineated by the question mark (?). Presumably, the resource is actually a server-side program that performs some arithmetic function on the passed value and returns a result.

Unsafe characters are those that have no special meaning within the URL, but may have a special meaning in the context in which the URL is written. For example, the double-quote character (quot) is used to delimit URLs in many HTML tags. If you were to include a double-quote directly in a URL, you would probably confuse the HTML browser. Instead, encode the double-quote as %22 to avoid any possible conflict.

Other reserved and unsafe characters that should always be encoded are shown in Table 6.1.

Table 6-1: Reserved and Unsafe Characters and Their URL Encodings
Character Description Usage Encoding
; Semicolon Reserved %3B
/ Slash Reserved %2F
? Question mark Reserved %3F
: Colon Reserved %3A
@ At sign Reserved %40
= Equal sign Reserved %3D
& Ampersand Reserved %26
< Less than sign Unsafe %3C
> Greater than sign Unsafe %3E
quot Double quote Unsafe %22
# Hash symbol Unsafe %23
% Percent Unsafe %25
{ Left curly brace Unsafe %7B
} Right curly brace Unsafe %7D
| Vertical bar Unsafe %7C
| Backslash Unsafe %5C
^ Caret Unsafe %5E
~ Tilde Unsafe %7E
[ Left square bracket Unsafe %5B
] Right square bracket Unsafe %5D
` Back single quote Unsafe %60

In general, you should always encode a character if there is some doubt as to whether it can be placed ``as-is'' in a URL. As a rule of thumb, any character other than a letter, number, or any of the characters $-_.+!*'(), should be encoded.

It is never an error to encode a character, unless that character has a specific meaning in the URL. For example, encoding the slashes in an http URL will cause them to be used as regular characters, not as pathname delimiters, breaking the URL.

The http URL

The http URL is, by far, the most common within the World Wide Web. It is used to access documents stored on an http server, and it has two formats:

http://server:port/path#fragment
http://server:port/path?search

Some of the parts are optional. In fact, the most common form of the http URL simply is:

http://server/path

designating the unique server and the directory path and name of a document.

The http server

The server is the unique Internet name or Internet Protocol (IP) numerical address of the computer system that stores the Web resource. Like us, we suspect you'll mostly use more easily remembered Internet names for the servers in your URLs.[3] The name consists of several parts, including the server's actual name and the successive names of its network domain, each part separated by a period. Typical Internet names look like www.ora.com or hoohoo.ncsa.uiuc.edu.[4]

[3] Each Internet-connected computer has a unique address; a numeric (IP) address, of course, because computers deal only in numbers. Humans prefer names, so the Internet folks provide us with a collection of special servers and software (Domain Name Service or DNS) that automatically resolve Internet names into IP addresses. InterNIC, a nonprofit agency, registers domain names mostly on a first-come, first-serve basis, and distributes new names to DNS servers worldwide.

[4] In the United States and for some Canadian establishments, the three-letter suffix of the domain name identifies the type of organization or business that operates that portion of the Internet. For instance, ``com'' is a commercial enterprise; ``edu'' is an academic institution; and ``gov'' identifies a government-based domain. Outside the United States, a less-descriptive suffix is assigned; typically a two-lettter abbreviation of the country name: ``jp'' for Japan and ``de'' for Deutschland, for instance. That convention indicates the traditional distribution of the Internet and presumably will change dramatically as the network proliferates in the rest of the world.

It has become something of a convention that webmasters name their servers www for quick and easy identification on the Web. For instance, O'Reilly & Associates's Web server's name is www, which along with the publisher's acronym-based domain name, becomes the very easily remembered website www.ora.com. Similarly, Sun Microsystems's Web server is named www.sun.com; Apple Computer's is www.apple.com and even Microsoft makes their Web server easily memorable as www.microsoft.com. The naming convention has very obvious benefits which you, too, should take advantage of if you are called upon to create a Web server for your organization.

You may also specify the address of a server using its numerical IP address. The address is a sequence of four numbers, zero to 255, separated by periods. Valid IP addresses look like 137.237.1.87 or 192.249.1.33.

It'd be a dull diversion to tell you now what the numbers mean or how to derive an IP address from a domain name, particularly since you'll rarely if ever use one in a URL. Rather, this is a good place to hyperlink: Pick up any good Internet networking treatise for rigorous detail on IP addressing, such as Ed Krol's The Whole Internet User's Guide and Catalog, published by O'Reilly & Associates.

The http port

The port is the number of the communication port to which the client browser connects to the server. It's a networking thing: servers do many things besides serve up Web documents and resources to client browsers: electronic mail, FTP document fetches, filesystem sharing, and so on. Although all that network activity may come into the server on a single wire, it's typically divided into software-managed ``ports'' for service-specific communications--something analogous to boxes at your local post office.

The default URL port for Web servers is 80. Special secure Web servers (SHTTP or SSL) run on port 443. Most Web servers today use port 80; you need only to include a port number along with an immediately preceding colon in your URL if the target server does not use port 80 for Web communication.

When the Web was in its infancy many months ago, pioneer webmasters ran their Wild Wild Web connections on all sorts of port numbers. For technical and security reasons, system-administrator privileges are required to install a server on port 80. Lacking such privileges, these webmasters chose other, more easily accessible, port numbers.

Now that Web servers have become acceptable and are under the care and feeding of responsible administrators, documents being served on some port other than 80 or 443 should make you wonder if that server is really on the up and up. Most likely, the maverick server is being run by a clever user unbeknownst to the server's bona fide system administrators.

The http path

The document path is the UNIX-style hierarchical location of the file in the server's storage system. The pathname consists of one or more names separated by forward slashes (/). All but the last name represent directories leading down to the document; the last name is usually that of the document itself.

It has become a convention that for easy identification, HTML document names end with the suffix .html (they're otherwise plain ASCII text files, remember?). You can easily identify a PC-based server: DOS's restrictions on filenames mean you can have only the three-letter .htm name suffix for HTML documents.

Although the server name in a URL is not case-sensitive, the document pathname may be. Since most Web servers are run on UNIX-based systems and UNIX file names are case sensitive, the document pathname will be case-sensitive, too. Web servers running on DOS machines are not case-sensitive, so the document pathname is not, but since it is impossible to know the operating system of the server you are accessing, always assume that the server has case-sensitive pathnames and take care to get the case correct when typing your URLs.

Certain conventions regarding the document pathname have arisen. If the last element of the document path is a directory, not a single document, the server usually will send back either a listing of the directory contents or the HTML index document in that directory. You should end the document name for a directory with a trailing forward slash (/) character, but in practice, most servers will honor the request even if the character is omitted.

If the directory name is just a forward slash alone or sometimes nothing at all, you will retrieve the first (top-level) HTML document or so-called home page in the uppermost root directory of the server. Every well-designed http server should have an attractive, well-designed ``home page''; it's a shorthand way for users to access your Web collection since they don't need to remember the document's actual filename, just your server's name. That's why, for example, you can type http://www.ora.com into Netscape's ``Open'' dialog and get O'Reilly's home page.

Another twist: if the first component of the document path starts with the tilde character (~), it means the rest of the pathname begins from the personal HTML directory in the home directory of the specified user on the server machine. For instance, the URL http://www.kumquat.com/~chuck/ would retrieve the top-level page from Chuck's document collection.

Different servers have different ways of locating documents within a user's home directory. Many search for the documents in a directory named public_html. UNIX-based servers are fond of the name index.html for home pages.

The http document fragment

The fragment is a named identifier that points to some key section of a document. In URL specifications, it follows the server and pathname and is separated by the hash (#) symbol. A fragment identifier indicates to the browser that it should begin displaying the target document at the indicated fragment name. As we describe in more detail below, you insert fragment names into a document with the <a> tag and the name attribute. Like pathnames, a fragment name may be any sequence of characters.

The fragment name and the preceding hash symbol are optional; omit them when referencing a document without defined fragments.

Formally, the fragment element only applies to target files that are HTML documents. If the target of the URL is some other document type, the fragment name may be misinterpreted by the browser.

Fragments are useful for long documents. By identifying key sections of your document with a fragment name, you make it easy for readers to link directly to that portion of the document, avoiding the tedium of scrolling or searching through the document to get to the section that interests them.

As a rule of thumb, we recommend that every section header in your documents be accompanied by an equivalent fragment name. By consistently following this rule, you'll make it possible for readers to jump to any section in any of your documents. Fragments also make it easier to build tables of contents for your document families.

The http search parameter

The search component of the http URL, along with its preceding question mark, is optional. It indicates that the path is a searchable or executable resource on the server. The content of the search component is passed to the server as parameters that control the search or execution function.

The actual encoding of parameters in the search component is dependent upon the server and the resource being referenced. The parameters for searchable resources are covered later in this chapter, when we discuss searchable documents. Parameters for executable resources are discussed in Chapter 8, Forms.

Sample http URLs

Here are some sample http URLs:

http://www.ora.com/catalog.html
http://www.ora.com/
http://www.kumquat.com:8080/
http://www.kumquat.com/planting/guide.html#soil_prep
http://www.kumquat.com/find_a_quat?state=Florida

The first example is an explicit reference to a bona fide HTML document named catalog.html that is stored in the root directory of the www.ora.com server. The second references the top-level home page on that same server. That home page may or may not be catalog.html. Sample three, too, assumes there is a home page in the root directory of the www.kumquat.com server, and that the Web connection is to the nonstandard port 8080.

The fourth example is the URL for retrieving the Web document named guide.html from the planting directory on the www.kumquat.com server. Once retrieved, the browser should display the document beginning at the fragment named soil_prep.

The last example invokes an executable resource named find_a_quat with the parameter named state set to the value Florida. Presumably, this resource generates an HTML response that is subsequently displayed by the browser.

The ftp URL

The ftp URL is used to retrieve documents from an FTP (File Transfer Protocol)[5] server. It has the format:

[5] FTP is an ancient Internet protocol that dates back to the Dark Ages, around 1975 or so. It was designed as a simple way to move files between machines and remains popular and useful to this day. Some people who are unable to run a true Web server will place their documents on a server that speaks FTP instead.

ftp://user:password@server:port/path;type=typecode

The ftp user and password

FTP is an authenticated service, meaning that you must have a valid user name and password in order to retrieve documents from a server. However, most FTP servers also support restricted, nonauthenticated access known as anonymous FTP. In this mode, anyone can supply the username ``anonymous'' and be granted access to a limited portion of the server's documents. Most FTP servers also assume (but may not grant, of course) anonymous access if the user name and password are omitted.

If you are using an ftp URL to access a site that requires a user name and password, include the user and password components in the URL, along with the colon (:) and ``at'' sign (@). More commonly, you'll be accessing an anonymous FTP server, and the user and password components can be omitted.

If you keep the user component along with the ``at'' sign, but omit the password and the preceding colon, most browsers will prompt you for a password after connecting to the FTP server. This is the recommended way of accessing authenticated resources on an FTP server, since it prevents others from seeing your password.

We recommend you never place an ftp URL with a user name and password in any HTML document. The reasoning is simple: anyone can retrieve the document, extract the user name and password from the URL, log into the FTP server, and tamper with its documents.

The ftp server and port

The ftp server and port are bound by the same rules as the server and port in an http URL, as described above. The server must be a valid Internet domain name or IP address of an FTP server. The port specifies the port on which the server is listening for requests.

If the port and its preceding colon are omitted, the default port of 21 is used. It is necessary to specify the port only if the FTP server is running on some port other than 21.

The ftp path and transfer type

The path component represents a series of directories, separated by slashes (/) leading to the file to be retrieved. By default, the file is retrieved as a binary file; this can be changed by adding the typecode (and the preceding ;type=) to the URL.

If the typecode is set to d, the path is assumed to be a directory. The browser will request a listing of the directory contents from the server and display this listing to the user. If the typecode is any other letter, it is used as a parameter to the FTP type command before retrieving the file referenced by the path. While some FTP servers may implement other codes, most servers accept i to initiate a binary transfer and a to treat the file as a stream of ASCII text.

Sample ftp URLs

Here are some sample ftp URLs:

ftp://www.kumquat.com/sales/pricing
ftp://bob@bobs-box.com/results;type=d
ftp://bob:secret@bobs-box.com/listing;type=a

The first example retrieves the file named pricing from the sales directory on the anonymous FTP server at www.kumquat.com. The second logs into the FTP server on bobs-box.com as user bob, prompting for a password before retrieving the contents of the directory named results and displaying them to the user. The last example logs into bobs-box.com as bob with the password secret and retrieves the file named listing, treating its contents as ASCII characters.

The file URL

The file URL specifies a file stored on a machine without indicating the protocol used to retrieve the file. As such, it has limited use in a networked environment. Its real benefit, however, is that it can reference a file on the user's machine, and is particularly useful for referencing personal HTML document collections, such as those ``under construction'' and not yet ready for general distribution, or HTML document collections on CD-ROM. It has the format:

file://server/path

The file server

The file server, like the http server described above, must be the Internet domain name or IP address of the machine containing the file to be retrieved. No assumptions are made as to how the browser might contact the machine to obtain the file; presumably the browser can make some connection, perhaps via a Network File System or FTP, to obtain the file.

If the server is omitted, or the special name localhost is used, the file is assumed to reside on the same machine upon which the browser is running. In this case, the browser simply accesses the file using the normal facilities of the local operating system. In fact, this is the most common usage of the file URL. By creating document families on a diskette or CD-ROM and referencing your hyperlinks using the file://localhost/ URL, you create a distributable, standalone document collection that does not require a network connection to use.

The file path

This is the path of the file to be retrieved on the desired server. The syntax of the path may differ based upon the operating system of the server; be sure to encode any potentially dangerous characters in the path.

Sample file URLs

The file URL is easy:

file://localhost/home/chuck/document.html
file:///home/chuck/document.html
file://marketing.kumquat.com/monthly_sales.html

The first URL retrieves /home/chuck/document.html from the user's local machine. The second is identical to the first, except we've omitted the localhost reference to the server; the server name defaults to the local server. Do notice, however, the extra forward slash is required for this alternate form.

The third example uses some protocol to retrieve monthly_sales.html from the marketing.kumquat.com server.

The news URL

The news URL accesses either a single message or an entire newsgroup within the Usenet news system. It has two forms:

news:newsgroup
news:message_id

An unfortunate limitation in news URLs is that they don't allow you to specify a server for the newsgroup. Rather, users specify their news-server resource in their browser preferences. At one time, not long ago, Internet newsgroups were nearly universally distributed; all news servers carried all the same newsgroups and their respective articles, so one news server was as good as any. Today, the sheer bulk of disk space needed to store the daily volume of newsgroup activity is often prohibitive for any single news server, and there's also local censorship of newsgroups. Hence, you cannot expect that all newsgroups, and certainly not all articles for a particular newsgroup, will be available on the user's news server.

Moreover, many users' browsers may not be correctly configured to read news. We recommend you avoid placing news URLs in your documents except in rare cases.

Accessing entire newsgroups

There are several thousand newsgroups devoted to nearly every conceivable topic under the sun and beyond. Each group has a unique name, composed of hierarchical elements separated by periods. For example,

comp.infosys.www.announce

is the World Wide Web announcements newsgroup. To access this group, use the URL:

news:comp.infosys.www.announce

Accessing single messages

Every message on a news server has a unique message identifier (ID) associated with it. This ID has the form

unique_string@server

The unique_string is a sequence of ASCII characters; the server is usually the name of the machine from which the message originated. The unique_string must be unique among all the messages that originated from the server. A sample URL to access a single message might be:

news:12A7789B@news.kumquat.com

In general, message IDs are cryptic sequences of characters not readily understood by humans. Moreover, the lifespan of a message on a server is usually measured in days, after which the message is deleted and the message ID is no longer valid. The bottom line: single message news URLs are difficult to create, become invalid quickly, and are generally not used.

The nntp URL

The nntp URL goes beyond the news URL to provide a complete mechanism for accessing articles in the Usenet news system. It has the form:

nntp://server:port/newsgroup/article

The nntp server and port

The nntp server and port are defined similarly to the http server and port, described above. The server must be the Internet domain name or IP address of a nntp server; the port is the port on which that server is listening for requests.

If the port and its preceding colon are omitted, the default port of 119 is used.

The nntp newsgroup and article

The newsgroup is the name of the group from which an article is to be retrieved, as defined in the description of the news URL, above.

The article is the numeric id of the desired article within that newsgroup. Although the article number is easier to determine than a message id, it falls prey to the same limitations of single message references using the news URL, above. Specifically, articles do not last long on most nntp servers, and nntp URLs quickly become invalid as a result.

Sample nntp URLs

A sample nntp URL might be

nntp://news.kumquat.com/alt.fan.kumquats/417

This URL retrieves article 417 from the alt.fan.kumquats newsgroup on news.kumquat.com. Keep in mind that the article will only be served to machines that are allowed to retrieve articles from this server. In general, most nntp servers restrict access to those machines on that same local area network.

The mailto URL

The mailto URL causes an electronic mail message to be transmitted to a named recipient. It has the format:

mailto:address

The address is any valid email address, usually of the form:

user@server

Thus, a typical mailto URL might look like:

mailto:cmusciano@aol.com

The telnet URL

The telnet URL opens a telnet session with a desired server, allowing the user to log in and use the machine. Often, the connection to the machine automatically starts a specific service for the user; in other cases, the user must know the commands to type to use the system. The telnet URL has the form:

telnet://user:password@server:port/

The telnet user and password

The telnet user and password are used exactly like the user and password components of the ftp URL, described above. In particular, the same caveats apply regarding protecting your password and never placing it within a URL.

Just like the ftp URL, if you omit the password from the URL, the browser should prompt you for a password just before contacting the telnet server.

If you omit both the user and password, the telnet occurs without supplying a user name. For some servers, telnet automatically connects to a default service when no user name is supplied. For others, the browser may prompt for a username and password when making the connection to the telnet server.

The telnet server and port

The telnet server and port are defined similarly to the http server and port, described above. The server must be the Internet domain name or IP address of a telnet server; the port is the port on which that server is listening for requests.

If the port and its preceding colon are omitted, the default port of 23 is used.

The gopher URL

Gopher is a Web-like document retrieval system that achieved some popularity on the Internet just before the World Wide Web took off, completely replacing Gopher. Some Gopher servers still exist, though, and the gopher URL lets you access Gopher documents. The gopher URL has the form:

gopher://server:port/path

The gopher server and port

The gopher server and port are defined similarly to the http server and port, described above. The server must be the Internet domain name or IP address of a gopher server; the port is the port on which that server is listening for requests.

If the port and its preceding colon are omitted, the default port of 70 is used.

The gopher path

The path can take one of three forms:

type/selector
type/selector%09search
type/selector%09search%09gopherplus

The type is a single character value denoting the type of the gopher resource. If the entire path is omitted from the gopher URL, the type defaults to 1.

The selector corresponds to the path of a resource on the gopher server. It may be omitted, in which case the top-level index of the gopher server is retrieved.

If the gopher resource is actually a gopher search engine, the search component provides the string for which to search. The search string must be preceded by an encoded horizontal tab (%09).

If the gopher server supports Gopher+ resources, the gopherplus component supplies the necessary information to locate that resource. The exact content of this component varies based upon the resources on the gopher server. This component is preceded by an encoded horizontal tab (%09). If you want to include the gopherplus component but omit the search component, you must still supply both encoded tabs within the URL.

Absolute and Relative URLs

URLs come in two flavors: absolute and relative. An absolute URL is the complete address of a resource and has everything your system needs to find a document and its server on the Web. At the very least, an absolute URL contains the scheme and all required elements of the scheme_specific_part of the URL. It may also contain any of the optional portions of the scheme_specific_part.

With a relative URL you provide an abbreviated document address that, when automatically combined with a ``base address'' by the system, becomes a complete address for the document. Within the relative URL, any component of the URL may be omitted. The browser automatically fills in the missing pieces of the relative URL using corresponding elements of a base URL. This base URL is usually the URL of the document containing the relative URL, but may be another document specified with the <base> tag. [<base>, 6.7.1]

Relative schemes and servers

A common form of a relative URL is missing the scheme and server name. Since many related documents are on the same server, it makes sense to omit the scheme and server name from the relative URL. For instance, assume the base document was last retrieved from the server www.kumquat.com. The relative URL, then:

another-doc.html

is equivalent to the absolute URL:

http://www.kumquat.com/another-doc.html

Table 6.2 shows how the base and relative URLs in the example are combined to form an absolute URL.

Table 6-2: Forming an Absolute URL
  Protocol Server Directory File
Base URL http www.kumquat.com /  
Relative URL downarrow downarrow downarrow another-doc.html
downarrow downarrow downarrow downarrow downarrow
Absolute URL http www.kumquat.com / another-doc.html

Relative document directories

Another common form of a relative URL omits the leading slash and one or more directory names from the beginning of the document pathname. The directory of the base URL is automatically assumed to replace these missing components. It's the most common abbreviation because most HTML authors place their collection of documents and subdirectories of support resources in the same directory path as the home page. For example, you might have a special/ subdirectory containing FTP files referenced in your HTML document. Let's say that the absolute URL for that HTML document is:

http://www.kumquat.com/planting/guide.html

A relative URL for the file README.txt in the special/ subdirectory, then, looks like this:

ftp:special/README.txt

You'll actually be retrieving:

ftp://www.kumquat.com/planting/special/README.txt

Visually, the operation looks like that in Table 6.3:

Table 6-3: Forming an Absolute FTP URL
  Protocol Server Directory File
Base URL http www.kumquat.com /planting guide.html
Relative URL ftp downarrow special README.txt
downarrow downarrow downarrow downarrow downarrow
Absolute URL ftp www.kumquat.com /planting/special README.txt

Using relative URLs

Relative URLs are more than just a typing convenience. Because they are relative to the current server and directory, you can move the entire set of documents to another directory or even another server and never have to change a single relative link. Imagine the difficulties if you had to go into every source HTML document and change the URL for every link every time you move it. We'd loathe using hyperlinks! Use relative URLs wherever possible.


Previous Home Next
Hypertext Basics Book Index Creating Hyperlinks