Tim's soapbox - stupid web robots

Why is it that most of the errors I see in my web logs are down to robots programmed by morons? Even now, many years after the internet went public, many robot crawlers (like those unleased by Yahoo) cannot manage to work out the right URI to follow. I'm forever seeing 404 errors that are entirely their fault. For example…

They don't properly understand absolute references

They continually get something like href="/styles/default.css" wrong. A link like that, on my website, means that you find the resource at “http://www.cameratim.com/styles.default.css”, no matter where on the site you currently are. It doesn't matter whether you're in the /computing/ or /business/ sub-sections, there's just one /styles/ location, at the top. It is not a “styles” subdirectory off the current location. If an address path starts with a slash, it's an absolute address relative to the top-level directory.

When at http://www.example.com/computing/ and you're referred to “/business/” (href="/business"), you go to “http://www.example.com/business/” not “http://www.example.com/computing/business/”.

They don't properly understand relative references

They continually get something like href="./sample.jpeg" or href="../example.text" wrong. That's a link to a “sample.jpeg” resource in the current directory, or an “example.text” resource in a parent directory. The dot–slash and double-dot–slash refer to something relative to the current location, you request the resource name written after them, you do not put dots and slashes into the requested URI as part of the address, nor do you try and get the resource from the top-level directory.

Likewise, they continually get something like href="samples/" wrong. It's referring to a subdirectory of the current location.

If it doesn't start with a slash, it's a relative address. Relative to the current URI path. Paths starting with a dot–slash, or no dots and slashes, mean from within the current location (the dot–slash is optional). Paths starting with a couple of dots (followed by a slash) refer to a parent directory, and sequences of them go back further down the chain (grand-parents). Requests should work out whether to remove parent paths, prepend new path information (without including any dots) to the current location, requesting the correct URI, themselves.

If you're at “http://www.example.com/goofy-robot/” and you're referred to “./yahoo” (href="./yahoo"), then you go to “http://www.example.com/goofy-robot/yahoo”.

If you're at “http://www.example.com/goofy-robot/” and you're referred to “msn” (href="msn"), then you go “to http://www.example.com/goofy-robot/msn”.

If you're at “http://www.example.com/idiotic/goofy-robots” and you're referred to “stupid” (href="stupid"), then you go “to http://www.example.com/idiotic/stupid”.

If you're at “http://www.example.com/web-robots/cannot/get/it/right/” and you're referred to “../../../../programmed-by-fools” (href="../../../../programmed-by-fools"), then you go “to http://www.example.com/web-robots/programmed-by-fools”.

No, there's nothing wrong with the syntaxes of any of the addresses that I've used, they're all perfectly correct usage (they don't exist on that website, though—the example.com website is set up for documention examples, so you don't upset real servers with bogus examples), and there's no need to use one type or another for web robots. Anybody who says that you should only use absolute or relative addressing is an idiot. Both those styles, absolute and relative, are completely correct, have clear and precise definitions without any possible alternative interpretations, and any robot that get that wrong is utterly broken. Robots have to follow the rules, not make up their own.

The same rules apply whether it's actually traversing a file structure, or the server generates content based on the request. And the same applies whether it's a robot from a search engine, or some add-on to a web browser that lets you take a copy of a website. That software should be programmed properly, by those who wrote it.

They stupidly change addresses

This page is “http://www.cameratim.com/personal/soapbox/stupid-web-robots” without any other characters on the end of it. No slashes should be added to it, yet I continually see 404 errors for pages where someone, or something, has stupidly appended a slash to the end of the address. Nor should .html be added to the end of it, because that's not how I advertise the address (in any links from other pages). That gives me the freedom to change the type of page stored on the server (from HTML to some other format, or from a single page to a directory of sub-pages), without having to change the address, or any links leading to it.

Use addresses as you're told they are, don't think that you know better and change them into what you think they ought to be.

Stupid web robots

They don't properly understand absolute references

They don't properly understand relative references

They stupidly change addresses