Hypertext Markup Language (HTML) Tim Berners-Lee, CERN Internet Draft Daniel Connolly, Atrium IIIR Working Group June 1993 Hypertext Markup Language (HTML) A Representation of Textual Information and MetaInformation for Retrieval and Interchange Status of this Document This document is an Internet Draft. Internet Drafts are working documents of the Internet Engineering Task Force (IETF), its Areas, and its Working Groups. Note that other groups may also distribute working documents as Internet Drafts. Internet Drafts are working documents valid for a maximum of six months. Internet Drafts may be updated, replaced, or obsoleted by other documents at any time. It is not appropriate to use Internet Drafts as reference material or to cite them other than as a "working draft" or "work in progress". Distribution of this document is unlimited. The document is a draft form of a standard for interchange of information on the network which is proposed to be registered as a MIME (RFC1341) content type. Please send comments to timbl@info.cern.ch or the discussion list www-talk@info.cern.ch. This is version 1.2 of this draft. This document is available in hypertext on the World-Wide Web as http://info.cern.ch/hypertext/WWW/MarkUp/HTML.html Abstract HyperText Markup Language (HTML) can be used to represent Hypertext news, mail, online documentation, and collaborative hypermedia; Menus of options; Database query results; Simple structured documents with inlined graphics. Hypertext views of existing bodies of information The World Wide Web (W3) initiative links related information throughout the globe. HTML provides one simple format for providing linked information, and all W3 compatible programs are required to be capable of handling HTML. W3 uses an Internet Berners-Lee and Connolly 1 protocol (Hypertext Transfer Protocol, HTTP), which allows transfer representations to be negotiated between client and server, the result being returned in an extended MIME message. HTML is therefore just one, but an important one, of the representations used with W3. HTML is proposed as a MIME content type. HTML refers to the URL specification of RFCxxxx. Implementations of HTML parsers and generators can be found in the various W3 servers and browsers, in the public domain W3 code, and may also be built using various public domain SGML parsers such as [SGMLS] . HTML is an SGML document type with fairly generic semantics appropriate for representing information from a wide range of applications. It is more generic than many specific SGML applications, but is still completely device-independent. IN THIS DOCUMENT This document contains the following parts: Vocabulary used in this document, degrees of imperative. HTML and MIME with discussion of character sets. HTML and SGML and the relationship between them, and Structured text : an introduction for beginners to SGML. HTML Elements A list with description, example, and typical rendering. HTML Entities Entities used to describe characters. The HTML DTD The text of the SGML DTD for HTML Link relationship values . A provisional list. Not part of the standard. Registration Authority The authority for extending lists of valid vales. Acknowledgements and a change history of the document References to related documents Authors addresses Contact information. table of contents Berners-Lee and Connolly 2 Vocabulary This specification uses the words below with the precise meaning given. Representation The encoding of information for interchange. For example, HTML is a representation of hypertext. Rendering The form of presentation to information to the human reader. IMPERATIVES may The implementation is not obliged to follow this in any way. must If this is not followed, the implementation does not conform to this specification. shall as "must" should If this is not followed, though the implementation officially conforms to the standard, undesirable results may occur in practice. typical Typical rendering is described for many elements. This is not a mandatory part of the standard but is given as guidance for designers and to help explain the uses for which the elements were intended. NOTES Sections marked "Note:" are not mandatory parts of the specification but for guidance only. STATUS OF FEATURES Mainstream All parsers must recognize these features. Features are mainstream unless otherwise mentioned. Extra Standard HTML features which may safely be ignored by parsers. It is legal to ignore these, treat the contents as though the tags were not there. (e.g. EM, and any undefined elements) Obsolete Not standard HTML. Parsers should implement these features as far as possible in order to preserve back-compatibility with previous Berners-Lee and Connolly 3 versions of this specification. HTML AND MIME The definition of the HTML content subtype is MIME Type name text MIME subtype name: html Required parameters: none Optional parameters: charset Character sets The base character set (the SGML BASESET) for HTML is ISO Latin-1. This is the set referred to by any numeric character references . The actual character set used in the representation of an HTML document may be ISO Latin 1, or its 7-bit subset which is ASCII. There is no obligation for an HTML document to contain any characters above decimal 127. It is possible that a transport medium such as electronic mail imposes constraints on the number of bits in a representation of a document, though the HTTP access protocol used by W3 always allows 8 bit transfer. When an HTML document is encoded using 7-bit characters, then the mechanisms of character references and entity references may be used to encode characters in the upper half of the ISO Latin-1 set. In this way, documents may be prepared which are suitable for mailing through 7-bit limited systems. INTRODUCTION The HyperText Markup Language is defined in terms of the ISO Standard Generalized Markup Language []. SGML is a system for defining structured document types and markup languages to represent instances of those document types. Every SGML document has three parts: An SGML declaration, which binds SGML processing quantities and syntax token names to specific values. For example, the SGML declaration in the HTML DTD specifies that the string that opens a tag is and the maximum length of a name is 40 characters. A prologue including one or more document type declarations, which specifiy the element types, element relationships and attributes, and references that can be represented by markup. The HTML DTD specifies, for example, that the HEAD element contains at most one TITLE element. An instance, which contains the data and markup of the document. Berners-Lee and Connolly 4 We use the term HTML to mean both the document type and the markup language for representing instances of that document type. All HTML documents share the same SGML declaration an prologue. Hence implementations of the WorldWide Web generally only transmit and store the instance part of an HTML document. To construct an SGML document entity for processing by an SGML parser, it is necessary to prefix the text from ``HTML DTD'' on page 10 to the HTML instance. Conversely, to implement an HTML parser, one need only implement those parts of an SGML parser that are needed to parse an instance after parsing the HTML DTD. Structured Text An HTML instance is like a text file, except that some of the characters are interpreted as markup. The markup gives structure to the document. The instance represents a hierarchy of elements. Each element has a name , some attributes , and some content. Most elements are represented in the document as a start tag, which gives the name and attributes, followed by the content, followed by the end tag. For example:
NAME
cat -- concatenatefiles
EXAMPLE
cat
The content of the above PRE element is:
A B element
The string `` cat -- concatenate''
An A element
The string ``\n''
Another B element
The string ``\n cat . After the comment
delimiter, all text up to the next occurrence of -- is ignored.
Hence comments cannot be nested. Whitespace is allowed between the
closing -- and >. (But not between the opening
HTML Guide: Recommended Usage
There are a few other SGML markup constructs that are deprecated or
illegal.
Delimiter Signals...
Processing instruction. Terminated by >.
.
LINE BREAKS
A line break character is considered markup (and ignored) if it is
the first or last piece of content in an element. This allows you
to write either
some example text
or
some example text
and these will be processed identically.
Also, a line that's not empty but contains no content will be
ignored altogether. For example, the element
first line
third line
fourth line
contains only the strings
first line
third line
fourth line.
SPACES AND TABS
Space characters must be rendered as horizontal white space. In
HTML, multiple spaces should be rendered as single spaces.
The rendering of a horizontal tab (HT) character is not defined,
and HT should therefore not be used, except within a PRE (or
obsolete XMP, LISTING or PLAINTEXT) element.
Neither spaces nor tabs should be used to make SGML source layout
more attractive or easier to read.
Berners-Lee and Connolly 10
SUMMARY OF MARKUP SIGNALS
The following delimiters may signal markup, depending on context.
Delimiter Signals
Berners-Lee and Connolly 35
Berners-Lee and Connolly 36
Berners-Lee and Connolly 37
]>
Berners-Lee and Connolly 38
LINK RELATIONSHIP VALUES
Status: This list is not part of the standard. It is intended to
illustrate the use of link relationships and to provide a framework
for further development.
Additions to this list will be controlled by the HTML registration
authority . Experimental values may be used on the condition that
they begin with "X-".
Link relationship valies are NOT case sensitive. That is, "Made"
and "made" have th esame meaning.
These values of the REL attribute of hypertext links have a
significance defined here, and may be treated in special ways by
HTML applications.
These relationships relate whole documents (objects), rather than
particular anchors within them. If the relationship value is used
with a link between anchors rather than whole documents, the
semantics are considered to apply to the documents.
In the explanations which follows, A is the source document of the
link and B is the destination document specified by the HREF
attribute.
A relationship marked "Acyclic" has the property that no sequence
of links with that relationship may be followed from any document
back to itself. These types of links may therefore be used to
define trees.
Most relationships (except where noted) are between the objecs
themselves rather than the subjects of the objects. Objects may be
documents, images, people (with mailto: URIs for example.)
USEINDEX
B is a related index for a search by a user reading this document
who asks for an index search function.
A document may have any number of index links, causing several
indexes top be searched in a client-defined manner.
B must support SEARCH operations under its access protocol.
USEGLOSSARY
B is an index which should be used to resolve glossary queries in
the document. (Typically, a double-click on a word which is not
within an anchor).
A document may have any number of glossary links.
Berners-Lee and Connolly 39
ANNOTATION
The information in B is additional to and subsidiary to that in A.
Annotation is used by one person to write the equivalent of "margin
notes" or other criticism on another's document, for example.
Example: The relationship between a newsgroup and its articles.
Acyclic.
REPLY
Similar to Annotation, but there is no suggestion that B is
subsidiary to A: A and B are on equal footings.
Example: The relationship between a mail message and its reply, a
news article and its reply.
Acyclic.
EMBED
If this link is followed, the node at the end of it is embedded
into the display of the source document.
Acyclic.
PRECEDES
In an ordered structure defined by the author, A precedes B, B is
followed by A.
Acyclic.
Any document may only have one link of this relationship, and/or
one link of the reverse relationship.
Note: May be used to control navigational aids, generate printed
material, etc. In conjunction with " subdocument ", may be used to
define a tree such as a printed book made of hypertext document.
The document can only have one such tree.
SUBDOCUMENT
B is a lower part in the author's hierarchy to A. Acyclic. See
also Precedes .
PRESENT
Whenever A is presented, B must also be presented. This implies
that whenever A is retrieved, B must also be retrieved.
Berners-Lee and Connolly 40
SEARCH
When the link is followed, the node B should be searched rather
than presented. That is, where the client software allows it, the
user should immediately be presented with a search panel and
prompted for text. The search is then performed without an
intermediate retrieval or presentation of the node B
SUPERSEDES
B is a previous version of A.
Acyclic.
HISTORY
B is a list of versions of A
A link reverse link must exist from B to A and to all other known
versions of A.
MADE
Person (etc) described by node A is author of B
This information can be used for protection, and informing authors
of interest, for sending mail to authors, etc.
OWNS
The owner of an object carried resposabiliy for and authority over
the object.
This information may be used for finding people responsible for
incorrect information, etc. The creator (Made) and owner (Owns) of
an object may be different.
APPROVES
Approval of objects is a method of attributing value, or fiability,
to objects. One determination of a value of an object is a
function of the set of objects which approve it.
A reviewed journal, for example, may operate by approving articles.
This could be expressed by an approval link from the journal
itself to the article. In the view of the web as an
encyclopaedia, approval links one to filter information which has a
certain quality according to some standard.
SUPPORTS
A and B are objects representing assertions. The assertion A
supports the assertion B. This may be used to overlay a weak
Berners-Lee and Connolly 41
semantics of argument onto the web. For example, giving such
relaionships within discussions in this way will allow arguments
to be analysed by machine and followed by people with greater ease.
See also: refutes.
REFUTES
This is the opposite of "Supports", indicating that A is a
proposition which refutes proposition B.
INCLUDES
A includes B, B is part of A. For example, a person described by
document A is a part of the group described by document B.
Note: This relationship conveys semantics about objects described
by objects, rather than the documents themselves.
Acyclic.
INTERESTED
Person (etc) described by A is interested in node B.
This information can be used for notification of changes.
Typically, this is a request that, when object B changes in some
way, a new link is made to object A.
The phrase "object B changes" may be interpreted narrowly (as "B
itself changes") or widely (as "B or anythink linked to it or
related to it closely changes"). The amount of change considered
worth notifying people about is also subject to interpretation,
varying from bit changes in the source to a "new edition" statement
by the publisher.
REGISTRATION AUTHORITY
The HTML Registration Authority is responsible for maintaining
lists of:
Relationship names for link and anchor elements
It is proposed that a WWW consortium, the Internet Assigned Numbers
Authority, or their successors take this role.
Unregistered values may be used for experimental purposes if they
are start with "X-".
ACKNOWLEDGEMENTS
The HTML document type was designed initially at CERN in 1990 for
the World-Wide Web project. The DTD was written, and the
Berners-Lee and Connolly 42
specification tightened up, by Dan Connolly After much discussion
on the network and some enhancement in particular the addition of
inline images introduced by the NCSA "Mosaic" software for WWW, it
was released as an Internet draft in 1993.
This version of the specification follows from certain minor
changes made at the WWW Wizard's Workshop in Cambridge, Mass., in
July 1993, in particular the introduction of
,
,  .
Tim BL
REFERENCES
SGML ISO 8879:1986, Information Processing Text
and Office Systems Standard Generalized
Markup Language (SGML).
sgmls an SGML parser by James Clark
derived from the ARCSGML
parser materials which were written by
Charles F. Goldfarb. The source is available
on the ifi.uio.no FTP server in the directory
/pub/SGML/SGMLS .
WWW The World-Wide Web , a global information
initiative. For bootstrap information, telnet
info.cern.ch or find documents by
ftp://info.cern.ch/pub/www/doc
URL Universal Resource Locators. RFCxxx.
Currently available by anonymous FTP from
info.cern.ch in /pub/www/doc/url*.{ps,txt}
AUTHOR'S ADDRESSES
This document was prepared with the help and advice of many people
across the net. Dan Connolly prepared the DTD and the section on
HTML and SGML whilst with Convex Computer Corporation of 3000
Waterview Parkway Richardson, TX 75083. He is now with Atrium
Technology Inc., and is not a current editor of the document.
Tim Berners-Lee
Address CERN
1211 Geneva 23
Switzerland
Telephone: +41(22)767 3755
Fax: +41(22)767 7155
email: timbl@info.cern.ch
Daniel Connolly
Address: Atrium Technologies, Inc.
5000 Plaza on the Lake, Suite 275
Austin, TX 78746
Berners-Lee and Connolly 43
USA
email: connolly@atrium.com
Berners-Lee and Connolly 44