HTML (Tim's webpage authoring guide)

The most usual way to author webpages, is to write them using HTML (an abbreviation for “Hyper-Text Mark-up Language”). Other means are possible, but HTML is the most compatible, and flexible, approach.

What HTML is

It's a structural language, meaning that it is used to define what parts of the page, are particular types of information (“marking-up” various content on the page, as being things like headings, paragraphs of text, lists of items, etc.). This allows user-agents (e.g. web browsers), the ability to render certain parts of the page in specific ways; and other non-browser user agents that act as “content processors” (e.g. search engines, and indexing tools) to assess the content of the page (e.g. draw up a list of headings used on the page, providing you with an assessment of the page contents).

By way of example of why you should use HTML properly that many people don't think about, many websites rely on being listed in search engines for the general public to discover them. If a site uses HTML in a nonsensical manner, then search engines may not index it, or the listing may be so incoherent to people searching for information that they just ignore it.

It's not a page layout language (that's the first misconception that authors really must unlearn, usually because a lot of people are very poor at understanding the difference between cause and effect), it gives meaning to the content. Allowing the information to be used in a multitude of ways—being displayed in a web browser is merely one of them. Having a page look different than simple plain text, is a “side-effect” of the processing that can be applied to HTML. How a page will actually “look,” is determined by various different things. Such as:

The user's configuration of their browser.
The browser's support for the HTML elements used on the page, and the particular elements and attributes that the author used. Not all browsers support all elements, or all the possible extra attributes applied to some elements; and authors will often use elements and attributes that are non-standard, and only work on some browsers.
The manner that a browser supports the elements, and the particular browser used. While it may be defined that a browser should render individual items in a list, in such a way that it's clearly apparent which are different items; other things, like the space between each item, are a styling issue controlled by the particular browser, and sometimes modifiable with extra styling information from the author. As various styling issues don't have explicit definitions about how they should be done, different browsers will do things in different ways.
The extra styling information, provided by the author; whether the browser can use the information; and whether it's been configured to follow that information, or ignore it.
Errors on the page.

It's important to understand that presentation “style” is different from presentation “structure,” and that the same page displayed on different browsers will look different (people use different browsers than you, and configure them in different ways, and have different screen modes, etc.). Attempts to overcome this, usually fail, and sometimes very badly, often producing pages that don't work well, or at all, for many people. Then, authors usually make the mistake of trying to inappropriately insist that the user uses the browser that they dictate, rather than re-write their page, so that it should work on any browser. You cannot make any assumptions about how people will end up seeing your page, nor make assumptions about how they should best see your page—you will be wrong.

One of the main features of HTML, is that it was designed to remove much of the presentational look of the page away from the author, and put it in the hands of the person viewing the page; so that people can read pages in a manner that's most suitable to them and their system; unlike the problems associated with trying to read plain-text and word processor files that someone else wrote using colours, fonts, and margins which are often terribly unsuitable, and difficult to adjust.

A quick example of this (presenting the page in a manner that's suitable to the viewer), is to resize your browser window, or change the font size used in your display, and you'll notice that the text reflows to fill the space across the screen (so that you don't have to scroll both horizontally, and vertically, to read the page—like trying to read the page through a magnifying glass). This is because the layout of the page is not controlled by how the words were typed into the document, it's delineated by the HTML elements, and the browser tries to fit them into the available viewing space, as best as possible.

This is just one aspect of other document formats which has always been a problem (they've been formatted to suit someone else), and most of them didn't have any good solution to resizing the document. Many of the other document formats even required special viewer applications, and could only be viewed on certain types of computers. Attempting to override this characteristic (that the user comes first, and one document can be read on nearly anything), often creates pages which can be very difficult to read, for many people.

Another of the prime design aspects of HTML is that it's a common document format, that can be read on almost any type of computer. The format is no secret, the specifications are public knowledge, and it's designed not to have special requirements which would be hard to meet.

Relying on presentational style, to define certain aspects of the page (e.g. colouring certain words), means that your page will not make sense to people who cannot, or do not, utilise the styling information that you've used. The best way to determine whether any document makes sense (webpages, or otherwise), is to read it out loud. If the words don't convey the meaning, in themselves, then your message is lost.

Relying on the current behaviour of some browsers, when they render certain HTML elements (e.g. that “blockquoted” text is often indented), is also setting yourself up for failure (that behaviour is not a defined standard, is easily changed, is handled differently by some browsers, and future versions of current browsers may behave differently, still). The “look” of something, is “suggested” by styling information extra to HTML (e.g. Cascading Style Sheets).

Even worse, is “abusing” HTML. For example, using “blockquote” because you wanted to indent something, but what you're marking up isn't actually a block of quoted text, is misrepresenting the data that it marks up; and makes machine assessment of your data impossible.

You really must learn what HTML is, before you're going to be able to use it properly.

How HTML is formed

As already mentioned, HTML is a structural language, where different parts of the page (different types of things), are contained within different “elements.” The first element being the entire HTML document, itself (everything). Then, inside that, are the various sub-sections of the page: Firstly, the “head” element, which contains information used for the page, not not necessarily directly seen anywhere (this will become more apparent, when you see some examples of what's used inside it). Then, the “body” element, which contains the information seen on the page. And, both of those have elements inside them, too. Such as, the page “title” element, inside the head element, and “heading” and “paragraph” elements, inside the “body” element (where the page body contents belong).

e.g.

HTML

head

title

body

A Heading

A paragraph.

Another paragraph.

Another Heading

Yet another paragraph.

(Note: The above example is visual, and requires your browser to support CSS. For those that can't see the diagram, it's just a pictorial representation of what's been described in the last paragraph.)

If you ever studied basic chemistry, you'll remember that elements form the basic parts of a complex object. HTML has a similar concept, where an HTML page is constructed from various elements (hence, the naming of its component parts; as “elements”). Though, for HTML, it's a case of elements inside elements, rather like physically placing containers inside containers; and because of that, there are rules about which elements can fit inside other elements. Most of which are logically apparent, once you understand the concept behind the structure.

For instance, it's logical that headings and paragraphs are part of a page body, so that's where they go (inside the “body”); and it's logical that headings and paragraphs are separate (albeit related) things, so they're never placed inside each other, but are placed one after another (inside the body).

Breaking the rules (of how to write HTML) produces unpredictable results, because browsers are designed (to the most part) to follow the rules. It's the only real way that several different software coders (e.g. Opera, Mozilla, Netscape, etc.) can produce different applications (i.e. their own web browsers), that can properly handle the same data (HTML) as other browsers. They're written (when designed properly) to follow the rules of the data that they're supposed to display (HTML), as there isn't any other common specification to follow (i.e. there isn't a set of rules for how to make a browser). Bending the rules (when you write HTML) where some of them are a bit flexible, can also be unwise; for the same reasons (you're relying on different coders handling the variations in the same way, which isn't too likely).

In some case, browsers will ignore or “work around” broken content; in other cases broken content will fail very badly. What you can observe happening in your browser, with badly authored pages, is no indication of how they'll work for someone else. Different browsers do behave differently than each other, and there's nothing wrong with that. Different versions of the same browser behave differently, and some identical browsers even behave differently (because there's differences between the systems that they're installed on, and how they're configured). Write pages correctly, use error checkers to double-check your work, and forget about trying to pander to the flaws of some web browsers.

HTML elements

Most elements are formed by placing opening and closing tags at the beginning and end of the element (like brackets). The tags are formed by placing so-called “pointy brackets” around the element name (e.g. paragraphs are marked-up as a “p” element), and the closing tag having a slash before the element name, inside those brackets. The entire thing (the opening tag, the enclosed content, and the closing tag), are what forms the “element”.

e.g.

<p>This is a paragraph
    element.</p>

Notice the opening “<p>” and closing “</p>” tags (the “paragraph element” tags), bracketing a paragraph of text, and how the closing “</p>” tag, has a slash before the ”p”.

There are a few elements which don't have opening and closing tags, and therefore don't contain anything; these are the so-called “empty elements”, such as the “br” element, to introduce a line break (where it's inserted), and the “img” element, to insert an image (where the img element's inserted). They're what you might call, “exceptions to the rule”.

e.g.

<p>A paragraph, with some<br>line
    breaks<br>in a few<br>places.</p>

They're considered to be “empty” because they don't have any “content” between opening and closing tags.

Elements cannot “overlap”. This is an error condition, and produces unpredictable results. However, some elements can be placed “inside” other elements, depending on the circumstances.

Element nesting
Wrong:	`<b>bold <i>and</b> italic</i> text`
Okay:	`<i>An italicised sentence, with a <b>bold</b> word in it.</i>`

Element attributes

Various elements have optional, or necessary, “attributes,” to provide additional information (such as; alignment attributes in a paragraph, to centre the text). These attributes are placed inside the opening tag.

e.g.

<p align="center">This is a centred
    paragraph.</p>

In this example, the “p” element has an “align” attribute, and this align attribute, has a “value” of “center”. It could have had other values, instead, such as “left” or “right” to align the text to the left or right of the page.

That particular example showed an attribute which probably wasn't essential (the paragraph could probably still be completely coherent even if it wasn't centred on the page). An example of what could be described as an essential attribute, would be a web address inside a linking element (it's the attribute and its value that makes it a “link”).

Please note that there isn't an HTML term called an “essential” attribute. I've merely used the word to describe a type of situation where the element can't do anything useful without the presence of the particular attribute.

e.g.

Visit the <a
    href="homepage.html">homepage</a>.

This example is using the “a” element, with an “href” attribute; which contains the address that the link refers to (a “homepage” file).

Attributes often require the value to be enclosed within quotes, although there are situations where it isn't needed. However, remembering, or working out, when they're needed, is probably expending more effort than simply always “quoting” the attribute's value. And always quoting attributes should help ensure that some browsers don't foul up while attempting to render the element (browsers don't always follow the rules, so always quoting attributes is “playing it safe”).

Since the attribute uses quotes to contain the value, you strike a problem when you wish to make a quote symbol a part of the value. There are three solutions to that problem:

Use something else instead of quotes, like the ASCII “apostrophe” inside the value:

e.g. title="The 'quoted' word."
Use apostrophes to contain the value, instead; and use quotes inside the value:

e.g. title='The "quoted" word.'

This technique is a part of the HTML specificiations, though I've seen browsers foul up on it, particularly if there's a line break in the attribute (which is not a good idea, in itself).
Use the equivelent character reference code, inside the value, that represents the quote symbol:

e.g. title="The "quoted" word."

" is a character reference code, for the quotation symbol.

Character references

Because those “pointy brackets” (the less-than and greater-than signs) are used as part of the tags, it restricts your ability to use them directly, elsewhere in the page. If you attempt to do so there's a very good chance that they'll be misinterpreted as an HTML tag (even though there are rules about what constitutes a tag, some browsers won't follow the rules, and just blindly assume, no matter what, that they're part of a tag). Likewise, with wanting to use quotation symbols inside “quoted” content, and some other typographical symbols used for special purposes; you will strike problems with them being misinterpreted as HTML content, rather than displayable page content. Also, some symbols are difficult to type when there isn't a key for them on your keyboard (such as the © copyright and ™ trademark symbols), and impossible to directly include in a document when your current character encoding system doesn't have a code for the character you want .

So, to “display” such symbols you use an equivelent code for the character, instead. The HTML parser encounters the code, while parsing the document, and merely “displays” the character in the page, rather than mistakenly “interpret” it as part of HTML. These codes begin with an ampersand symbol & and end with a semi-colon ; with the entity name (or numerical reference) in between them (usually the entity name is obviously related to the entity, such as "Eacute" for a letter "E" with an acute accent).

e.g. É is the character entity reference for the É character.

Although it is possible to omit the closing semi-colon in some cases, remembering what those cases are, and hoping that the browser handles those cases properly, is more trouble than always using it.

Because the ampersand is used as a code, you can see that there's a problem with trying to use the ampersand directly in some circumstances (it may be misinterpreted as forming part of a code, rather than being treated merely as the ampersand character). It is actually possible to use it, by itself, when it really is used by itself (such as with a blank space either side of it). It's also possible to directly use it when it's jammed against other characters which aren't allowed in character entity references (such as other symbols), though some user agents don't follow the rules well, and need you to use a character reference (for the ampersand), so they don't make a mistake.

When it's used in a situation where it can be misinterpreted as being part of a character reference, you use the & character entity reference to “display” or “include” the ampersand, as an ampersand.

e.g.

Find <a
    href="/scripts/find?q=cats&amp;dogs">cats &
    dogs</a>.

Ampersands are actually allowed in URIs (e.g. as separators between parameters), this is just an issue to do with writing ampersands in HTML documents.

Some character references
Symbol	Name	Entity
&	ampersand	&
<	less-than sign	<
>	greater-than sign	>
"	quotation mark	"
©	copyright symbol	©
™	trademark logo	™

Note that character references are case sensitive, so they must be typed exactly as specified. And not all browsers support all of the character codes (so avoid using obscure ones).

There are also numerical character reference codes; where instead of an entity “name”, you type an reference “number” in the same manner (beginning with an ampersand, ending with a semi-colon, with the number between them).

Different browsers have differing levels of support for the codes, some know more of the numerical references than they do of the named character entities. So, sometimes it can be better to use the numerical codes (if you're hoping to use one of the more unusual symbols). However, when a browser is asked to display a code that it doesn't know, it may do one of several things. Here's just some of the possiblities:

Display a special symbol to show that there's an unknown character, there.
Display the code, itself.
Do nothing that gives the reader any clue that there's a problem.

If the reader sees © written into the middle of a sentence, they can make an educated guess that they should have seen a copyright symbol (and after multiple instances of seeing this, they're likely to remember it). Though, seeing © will mean nothing to them. Eventually, they may remember that a particular number means something, but there's a lot of numbers to remember, and they give no clue as to what they represent (you'd have to look up the specifications to work out what they referred to).

There's another problem with using numerical codes: Some people use the wrong ones, because their browser incorrectly shows a symbol for the incorrect number, and never realising that they've done it wrong, or not caring because “it works for them” (other people's browsers may not behave the same way).

This problem is caused by some browser software authors (e.g. Microsoft), not understanding what the codes are about; and instead of realising that © is referring to character number 169 in the HTML character set, they've incorrectly assumed that it referred to character number 169 in the font currently being used on the page (the font being used may not have the desired symbol at that spot, you cannot “dictate” the font that the user will see the page in, and even attempting to “suggest” using the font that worked for you is not the answer, as that's not how character entities work—there is no correlation between local character table positions and HTML character positions).

Valid HTML

The old saying of, “two wrongs don't make a right” is just as applicable in that situation (above). To get a symbol displayed, use the right code in the first place, and don't rely on other people also having broken tools (to match yours). Trying to use the wrong code, or incorrectly authoring any other aspect of HTML, is akin to measuring something with a ruler that claims to be marked in centimetres, but actually has the markings completely unrelated to centimetres, wrongly telling people that something is x centimetres long, then insisting that they also use a broken ruler, rather then you giving them the correct information in the first place.

Always remember that your browser is no indication of the correctness of a page (they typically have their own errors, ignore some errors on pages, and try and work their way around some errors on a page), other people's browsers may behave in a completely different manner.

If there's one thing that you must learn, it's get it right, in the first place. It's the only way to make your pages work properly, in as many browsers as it's possible to do so. Yes, some browsers are broken, and the answer is not to break your code to suit them, but to let them fail. Eventually, the people who make those browsers may fix the errors in them, but only if they're forced to. And no matter what, there's no guarantee that their next browser will act the same as the current one, anyway. Hoping that a browser can fix your faults, isn't intelligent; and it also means that browsers have to get more and more complex (and slower, and ultimately have more problems), as they've got to be programmed to handle several different incorrect ways of doing one particular task. Laziness and stupidity is not the way to do any job.

If you must make a page so that it works on older (or less featured) browsers, then do so by making a simpler page. Don't make a complex page, with all sorts of conditional testing, that's bound to make wrong assumptions; and will probably attempt to use special features, that cause even more problems.

To modernise the old “two wrongs” saying, you could state it like this: “Two stupids, don't make a sensible.”

So, how do you get it right? Well, you learn how to do it properly, in the first place; and you use error checkers on your HTML to look for any problems that you didn't notice.

Homepage, computing, web authoring guide: contents, glossary, index, previous page, next page.