Stupid web robots

Why is it that most of the errors I see in my web logs are down to robots programmed by morons?  Even now, many years after the internet went public, many robot crawlers (like those unleased by Yahoo) cannot manage to work out the right URI to follow.  I'm forever seeing 404 errors that are entirely their fault.  For example…

They don't properly understand absolute references

They continually get something like href="/styles/default.css" wrong.  A link like that, on my website, means that you find the resource at “http://www.cameratim.com/styles.default.css”, no matter where on the site you currently are.  It doesn't matter whether you're in the /computing/ or /business/ sub-sections, there's just one /styles/ location, at the top.  It is not a “styles” subdirectory off the current location.  If an address path starts with a slash, it's an absolute address relative to the top-level directory.

When at http://www.example.com/computing/ and you're referred to “/business/” (href="/business"), you go to “http://www.example.com/business/” not “http://www.example.com/computing/business/”.

They don't properly understand relative references

They continually get something like href="./sample.jpeg" or href="../example.text" wrong.  That's a link to a “sample.jpeg” resource in the current directory, or an “example.text” resource in a parent directory.  The dot–slash and double-dot–slash refer to something relative to the current location, you request the resource name written after them, you do not put dots and slashes into the requested URI as part of the address, nor do you try and get the resource from the top-level directory.

Likewise, they continually get something like href="samples/" wrong.  It's referring to a subdirectory of the current location.

If it doesn't start with a slash, it's a relative address.  Relative to the current URI path.  Paths starting with a dot–slash, or no dots and slashes, mean from within the current location (the dot–slash is optional).  Paths starting with a couple of dots (followed by a slash) refer to a parent directory, and sequences of them go back further down the chain (grand-parents).  Requests should work out whether to remove parent paths, prepend new path information (without including any dots) to the current location, requesting the correct URI, themselves.

If you're at “http://www.example.com/goofy-robot/” and you're referred to “./yahoo” (href="./yahoo"), then you go to “http://www.example.com/goofy-robot/yahoo”.

If you're at “http://www.example.com/goofy-robot/” and you're referred to “msn” (href="msn"), then you go “to http://www.example.com/goofy-robot/msn”.

If you're at “http://www.example.com/idiotic/goofy-robots” and you're referred to “stupid” (href="stupid"), then you go “to http://www.example.com/idiotic/stupid”.

If you're at “http://www.example.com/web-robots/cannot/get/it/right/” and you're referred to “../../../../programmed-by-fools” (href="../../../../programmed-by-fools"), then you go “to http://www.example.com/web-robots/programmed-by-fools”.

No, there's nothing wrong with the syntaxes of any of the addresses that I've used, they're all perfectly correct usage (they don't exist on that website, though—the example.com website is set up for documention examples, so you don't upset real servers with bogus examples), and there's no need to use one type or another for web robots.  Anybody who says that you should only use absolute or relative addressing is an idiot.  Both those styles, absolute and relative, are completely correct, have clear and precise definitions without any possible alternative interpretations, and any robot that get that wrong is utterly broken.  Robots have to follow the rules, not make up their own.

The same rules apply whether it's actually traversing a file structure, or the server generates content based on the request.  And the same applies whether it's a robot from a search engine, or some add-on to a web browser that lets you take a copy of a website.  That software should be programmed properly, by those who wrote it.


Contents
Main sections:
homepage
business info
personal info
contact details
video production
photography
computing
reviews
misc info
website info/help
links
index
search
Personal info
Introduction
biography
recipes
soapbox
Soapbox
Introduction
republic
voting
copyright
crap technology
compact flouros
PC keyboards
web browsers
digital TV
equal opps.
identity cards
Invasion Day
media education
Microsoft sucks
no news
open standards
piracy
simple e-mail
simple HTML
WWW morons
stupid web robots
software
spelling
anti-spam
F.O. spammers!
end telemarketing
TV shouts at me
water