What makes a cool URI?
A cool URI is one which does not change.
What sorts of URI change?
URIs don’t change: people change them.
There are no reasons at all in theory for people to change URIs (or stop
maintaining documents), but millions of reasons in practice.
In theory, the domain name space owner owns the domain name space and
therefore all URIs in it. Except insolvency, nothing prevents the domain name
owner from keeping the name. And in theory the URI space under your domain
name is totally under your control, so you can make it as stable as you like.
Pretty much the only good reason for a document to disappear from the Web is
that the company which owned the domain name went out of business or can no
longer afford to keep the server running. Then why are there so many dangling
links in the world? Part of it is just lack of forethought. Here are some
reasons you hear out there:
We just reorganized our website to make it better.
Do you really feel that the old URIs cannot be kept running? If so, you
chose them very badly. Think of your new ones so that you will be able to
keep then running after the next redesign.
We have so much material that we can’t keep track of what is out of date
and what is confidential and what is valid and so we thought we’d better just
turn the whole lot off.
That I can sympathize with – the W3C went through a period like that, when
we had to carefully sift archival material for confidentiality before making
the archives public. The solution is forethought – make sure you capture with
every document its acceptable distribution, its creation date and ideally its
expiry date. Keep this metadata.
Well, we found we had to move the files…
This is one of the lamest excuses. A lot of people don’t know that servers
such as Apache give you a lot of control over a flexible relationship between
the URI of an object and where a file which represents it actually is in a
file system. Think of the URI space as an abstract space, perfectly
organized. Then, make a mapping onto whatever reality you actually use to
implement it. Then, tell your server. You can even write bits of your server
to make it just right.
John doesn’t maintain that file any more, Jane does.
Whatever was that URI doing with John’s name in it? It was in his
directory? I see.
We used to use a cgi script for this and now we use a binary program.
There is a crazy notion that pages produced by scripts have to be located
in a “cgibin” or “cgi” area. This is exposing the mechanism of how you run
your server. You change the mechanism (even keeping the content the same )
and whoops – all your URIs change.
For example, take the National Science Foundation:
NSF Online Documents
http://www.nsf.gov/cgi-bin/pubsys/browser/odbrowse.pl
the main page for starting to look for documents, is clearly not going to
be something to trust to being there in a few years. “cgi-bin” and
“oldbrowse” and “.pl” all point to bits of how-we-do-it-now. By contrast, if
you use the page to find a document, you get first an equally bad
Report of Working Group on Cryptology and Coding Theory
http://www.nsf.gov/cgi-bin/getpub?nsf9814
for the document’s index page, but the html document itself by contrast is
very much better:
http://www.nsf.gov/pubs/1998/nsf9814/nsf9814.htm
Looking at this one, the “pubs/1998” header is going to give any future
archive service a good clue that the old 1998 document classification scheme
is in progress. Though in 2098 the document numbers might look different, I
can imagine this URI still being valid, and the NSF or whatever carries on
the archive not being at all embarrassed about it.
I didn’t think URLs have to be persistent – that was URNs.
This is the probably one of the worst side-effects of the URN discussions.
Some seem to think that because there is research about namespaces which will
be more persistent, that they can be as lax about dangling links as they like
as “URNs will fix all that”. If you are one of these folks, then allow me to
disillusion you.
Most URN schemes I have seen look something like an authority ID followed
by either a date and a string you choose, or just a string you choose. This
looks very like an HTTP URI. In other words, if you think your organization
will be capable of creating URNs which will last, then prove it by doing it
now and using them for your HTTP URIs. There is nothing about HTTP which
makes your URIs unstable. It is your organization. Make a database which maps
document URN to current filename, and let the web server use that to actually
retrieve files.
If you have gotten to this point, then unless you have the time and money
and contacts to get some software design done, then you might claim the next
excuse:
We would like to, but we just don’t have the right tools.
Now here is one I can sympathize with. I agree entirely. What you need to
do is to have the web server look up a persistent URI in an instant and
return the file, wherever your current crazy file system has it stored away
at the moment. You would like to be able to store the URI in the file as a
check, and constantly keep the database in tune with actuality. You’d like to
store the relationships between different versions and translations of the
same document, and you’d like to keep an independent record of the checksum
to provide a guard against file corruption by accidental error. And web
servers just don’t come out of the box with these features. When you want to
create a new document, your editor asks you for a URI instead of telling
you.
You need to be able to change things like ownership, access, archive level
security level, and so on, of a document in the URI space without changing
the URI.
Too bad. But we’ll get there. At W3C we use Jigedit functionality
(Jigsaw server used for editing) which does track versions, and we
are experimenting with document creation scripts. If you make tools, servers
and clients, take note!
This is an outstanding reason, which applies for example to many W3C pages
including this one: so do what I say, not what I do.
Why should I care?
When you change a URI on your server, you can never completely tell who
will have links to the old URI. They might have made links from regular web
pages. They might have bookmarked your page. They might have scrawled the URI
in the margin of a letter to a friend.
When someone follows a link and it breaks, they generally lose confidence
in the owner of the server. They also are frustrated – emotionally and
practically from accomplishing their goal.
Enough people complain all the time about dangling links that I hope the
damage is obvious. I hope it also obvious that the reputation damage is to
the maintainer of the server whose document vanished.
So what should I do? Designing URIs
It is the the duty of a Webmaster to allocate URIs which you will be able
to stand by in 2 years, in 20 years, in 200 years. This needs thought, and
organization, and commitment.
URIs change when there is some information in them which changes. It is
critical how you design them. (What, design a URI? I have to design URIs?
Yes, you have to think about it.). Designing mostly means leaving information
out.
The creation date of the document – the date the URI is issued – is one
thing which will not change. It is very useful for separating requests which
use a new system from those which use an old system. That is one thing with
which it is good to start a URI. If a document is in any way dated, even
though it will be of interest for generations, then the date is a good
starter.
The only exception is a page which is deliberately a “latest” page for,
for example, the whole organization or a large part of it.
http://www.pathfinder.com/money/moneydaily/latest/
is the latest “Money daily” column in “Money” magazine. The main reason
for not needing the date in this URI is that there is no reason for the
persistence of the URI to outlast the magazine. The concept of “today’s
Money” vanishes if Money goes out of production. If you want to
link to the content, you would link to it where it appears separately in the
archives as
http://www.pathfinder.com/money/moneydaily/1998/981212.moneyonline.html
(Looks good. Assumes that “money” will mean the same thing throughout the
life of pathfinder.com. There is a duplication of “98” and an “.html” you
don’t need but otherwise this looks like a strong URI).
What to leave out
Everything! After the creation date, putting any information in the name
is asking for trouble one way or another.
- Authors name– authorship can change with new versions. People
quit organizations and hand things on. - Subject. This is tricky. It always looks good at the time but
changes surprisingly fast. I discuss this more below. - Status– directories like “old” and “draft” and so on, not to
mention “latest” and “cool” appear all over file systems. Documents
change status – or there would be no point in producing drafts. The
latest version of a document needs a persistent identifier whatever its
status is. Keep the status out of the name. - Access. At W3C we divide the site into “Team access”, “Member
access” and “Public access”. It sounds good, but of course documents
start off as team ideas, are discussed with members, and then go public.
A shame indeed if every time some document is opened to wider discussion
all the old links to it fail! We are switching to a simple date code
now. - File name extension. This is a very common one. “cgi”, even
“.html” is something which will change. You may not be using HTML for
that page in 20 years time, but you might want today’s links to it to
still be valid. The canonical way of making links to the W3C site doesn’t
use the extension.(how?) - Software mechanisms. Look for “cgi”, “exec” and other give-away
“look what software we are using” bits in URIs. Anyone want to commit to
using perl cgi scripts all their lives? Nope? Cut out the .pl. Read the
server manual on how to do it. - Disk name – gimme a break! But I’ve seen it.
So a better example from our site is simply
http://www.w3.org/1998/12/01/chairs
a report of the minutes of a meeting of W3C chair people.
Topics and Classification by subject
I’ll go into this danger in more detail as it is one of the more difficult
things to avoid. Typically, topics end up in URIs when you classify your
documents according to a breakdown of the work you are doing. That breakdown
will change. Names for areas will change. At W3C we wanted to change “MarkUp”
to “Markup” and then to “HTML” to reflect the actual content of the section.
Also, beware that this is often a flat name space. In 100 years are you sure
you won’t want to reuse anything? We wanted to reuse “History” and
“Stylesheets” for example in our short life.
This is a tempting way of organizing a web site – and indeed a tempting
way of organizing anything, including the whole web. It is a great medium
term solution but has serious drawbacks in the long term
Part of the reasons for this lie in the philosophy of meaning. every term
in the language it a potential clustering subject, and each person can have a
different idea of what it means. Because the relationships between subjects
are web-like rather than tree-like, even for people who agree on a web may
pick a different tree representation. These are my (oft repeated) general
comments on the dangers of hierarchical classification as a general
solution.
Effectively, when you use a topic name in a URI you are binding yourself
to some classification. You may in the future prefer a different one. Then,
the URI will be liable to break.
A reason for using a topic area as part of the URI is that responsibility
for sub-parts of a URI space is typically delegated, and then you need a name
for the organizational body – the subdivision or group or whatever – which
has responsibility for that sub-space. This is binding your URIs to the
organizational structure. It is typically safe only when protected by a date
further up the URI (to the left of it): 1998/pics can be taken to mean for
your server “what we meant in 1998 by pics“, rather than “what in 1998
we did with what we now refer to as pics.”
Don’t forget the domain name.
Remember that this applies not only to the “path” part of a URI but to the
server name. If you have separate servers for some of your stuff, remember
that that division will be impossible to change without destroying many many
links. Some classic “look what software we are using today” domain names are
“cgi.pathfinder.com”, “secure”, “lists.w3.org”. They are made to make
administration of the servers easier. Whether it represents divisions in your
company, or document status, or access level, or security level, be very,
very careful before using more than one domain name for more than one type of
document. remember that you can hide many web servers inside one apparent web
server using redirection and proxying.
Oh, and do think about your domain name. If your name is not soap, will
you want to be referred to as “soap.com” even when you have switched your
product line to something else. (With apologies to whoever owns soap.com at
the moment).
Conclusion
Keeping URIs so that they will still be around in 2, 20 or 200 or even
2000 years is clearly not as simple as it sounds. However, all over the Web,
webmasters are making decisions which will make it really difficult for
themselves in the future. Often, this is because they are using tools whose
task is seen as to present the best site in the moment, and no one has
evaluated what will happen to the links when things change. The message here
is, however, that many, many things can change and your URIs can and should
stay the same. They only can if you think about how you design them.
See also:
(back to Etiquette for server administrators, on
to Structure of your work)
Footnote
How can I remove the file extensions…
…from my URIs in a practical file-based web server?
If you are using, for example, Apache, you can set it up to do content
negotiation. You keep the file extension (such as .png) on the file (e.g.
mydog.png
), but refer to the web resource without it. Apache
then checks the directory for all files with that name and any extension, and
it can also pick the best one out of a set (e.g. GIF and PNG). (You do
not have to put different types of file in different directories, in
fact the content negotiation won’t work if you do.)
- Set up your server to do content negotiation
- Make references always to the URI without the extension
References which do have the extension on will still work but will not
allow your server to select the best of currently available and future
formats.
(In fact, mydog
, mydog.png
and
mydog.gif
are each valid web resources. mydog
is
content-type-generic. mydog.png
and mydog.gif
are
content-type-specific.)
Of course, if you are building your own server, then using a database to
relate persistent identifiers to their current form is a very clean idea —
though beware the unbounded growth of your database.
Hall of flame — story 1: Channel 7
During 1999, http://www.whdh.com/stormforce/closings.shtml
was a page I found documenting school closings due to snow. An alternative to
waiting for them to scroll past the bottom of the TV screen! I put a pointer
to it from my home page. Come the first big storm of 2000, and I check the
page. It says,
“Closings as of .
There are currently no closings in effect. Please check back when the
weather warrants”
Can’t be such a big storm. Funny the date is missing. But then if I go to
the home page of the site, there is a big button “school closings” which
takes me to http://www.whdh.com/stormforce/
which has a list of
many closed schools.
Well, maybe they changed the system which got the closings from the
definitive list – but they did not need to change the URI.
Hall of flame — story 2: Microsoft Netmeeting
One of the smarts which came with a growing dependency on the web was that
applications could have built-in links back to the manufacturer’s web site.
This has been used and abused to a great extent, but – you do have to keep
the URL the same. Just the other day I tried a link from Microsoft’s
Netmeeting 2/something client under a menu “Help/Microsoft on the Web/Free
stuff” and got an Error 404 – not found response from the server. They have
probably fixed it by now…
Historical note: At the end of the 20th century when this was written,
“cool” was an epithet of approval particularly among young, indicating
trendiness, quality, or appropriateness. In the rush to stake our DNS
territory involved the choice of domain name and URI path were sometimes
directed more toward apparent “coolness” than toward usefulness or longevity.
This note is an attempt to redirect the energy behind the quest for
coolness.
Leave A Comment