reports/rfc/draft-ietf-html-i18n-04.txt

   1
   2
   3
   4 Network Working Group                                       F. Yergeau
   5 Internet Draft                                                G. Nicol
   6 <draft-ietf-html-i18n-04.txt>                                 G. Adams
   7 Expires 2 December 1996                                      M. Duerst
   8                                                            27 May 1996
   9
  10
  11          Internationalization of the Hypertext Markup Language
  12
  13
  14 Status of this Memo
  15
  16    This document is an Internet-Draft.  Internet-Drafts are working doc-
  17    uments of the Internet Engineering Task Force (IETF), its areas, and
  18    its working groups. Note that other groups may also distribute work-
  19    ing documents as Internet-Drafts.
  20
  21    Internet-Drafts are draft documents valid for a maximum of six
  22    months. Internet-Drafts may be updated, replaced, or obsoleted by
  23    other documents at any time.  It is not appropriate to use Internet-
  24    Drafts as reference material or to cite them other than as a "working
  25    draft" or "work in progress".
  26
  27    To learn the current status of any Internet-Draft, please check the
  28    1id-abstracts.txt listing contained in the Internet-Drafts Shadow
  29    Directories on ds.internic.net (US East Coast), nic.nordu.net
  30    (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific
  31    Rim).
  32
  33    Distribution of this document is unlimited.  Please send comments to
  34    the HTML working group (HTML-WG) of the Internet Engineering Task
  35    Force (IETF) at <html-wg@w3.org>. Subscription address is <html-wg-
  36    request@w3.org>. Discussions of the group are archived at
  37    <URL:http://www.acl.lanl.gov/HTML_WG/archives.html>.
  38
  39
  40 Abstract
  41
  42    The Hypertext Markup Language (HTML) is a simple markup language used
  43    to create hypertext documents that are platform independent.  Ini-
  44    tially, the application of HTML on the World Wide Web was seriously
  45    restricted by its reliance on the ISO-8859-1 coded character set,
  46    which is appropriate only for Western European languages.  Despite
  47    this restriction, HTML has been widely used with other languages,
  48    using other coded character sets or character encodings, at the
  49    expense of interoperability.
  50
  51    This document is meant to address the issue of the
  52
  53
  54
  55                          Expires 2 December 1996        [Page 1]
  56 \f
  57 Internet Draft          HTML internationalization            27 May 1996
  58
  59
  60    internationalization (i18n, i followed by 18 letters followed by n)
  61    of HTML by extending the specification of HTML and giving additional
  62    recommendations for proper internationalization support.  A foremost
  63    consideration is to make sure that HTML remains a valid application
  64    of SGML, while enabling its use in all languages of the world.
  65
  66
  67 Table of contents
  68
  69    1.  Introduction .................................................. 2
  70      1.1. Scope ...................................................... 3
  71      1.2. Conformance ................................................ 3
  72    2. The document character set ..................................... 4
  73      2.1. Reference processing model ................................. 4
  74      2.2. The document character set ................................. 6
  75      2.3. Undisplayable characters ................................... 8
  76    3. The LANG attribute.............................................. 8
  77    4. Additional entities, attributes and elements ................... 9
  78      4.1. Full Latin-1 entity set .................................... 9
  79      4.2. Markup for language-dependent presentation ................. 9
  80    5. Forms ..........................................................15
  81      5.1. DTD additions ..............................................15
  82      5.2. Form submission ............................................15
  83    6. Miscellaneous ..................................................17
  84    7. HTML public text ...............................................18
  85      7.1. HTML DTD ...................................................18
  86      7.2. SGML declaration for HTML ..................................34
  87      7.3. ISO Latin 1 character entity set ...........................35
  88    Bibliography ......................................................38
  89    Authors' Addresses ................................................40
  90
  91
  92 1.  Introduction
  93
  94    The Hypertext Markup Language (HTML) is a simple markup language used
  95    to create hypertext documents that are platform independent.  Ini-
  96    tially, the application of HTML on the World Wide Web was seriously
  97    restricted by its reliance on the ISO-8859-1 coded character set,
  98    which is appropriate only for Western European languages.  Despite
  99    this restriction, HTML has been widely used with other languages,
 100    using other coded character sets or character encodings, through var-
 101    ious ad hoc extensions to the language [TAKADA].
 102
 103    This document is meant to address the issue of the internationaliza-
 104    tion of HTML by extending the specification of HTML and giving addi-
 105    tional recommendations for proper internationalization support.  It
 106    is in good part based on a paper by one of the authors on multilin-
 107    gualism on the WWW [NICOL].  A foremost consideration is to make sure
 108
 109
 110
 111                          Expires 2 December 1996        [Page 2]
 112 \f
 113 Internet Draft          HTML internationalization            27 May 1996
 114
 115
 116    that HTML remains a valid application of SGML, while enabling its use
 117    in all languages of the world.
 118
 119    The specific issues addressed are the SGML document character set to
 120    be used for HTML, the proper treatment of the charset parameter asso-
 121    ciated with the "text/html" content type and the specification of
 122    some additional elements and entities.
 123
 124
 125 1.1 Scope
 126
 127    HTML has been in use by the World-Wide Web (WWW) global information
 128    initiative since 1990.  This specification extends the capabilities
 129    of HTML 2.0 (RFC 1866), primarily by removing the restriction to the
 130    ISO-8859-1 coded character set [ISO-8859-1].
 131
 132    HTML is an application of ISO Standard 8879:1986, Information Pro-
 133    cessing Text and Office Systems -- Standard Generalized Markup Lan-
 134    guage (SGML) [ISO-8879]. The HTML Document Type Definition (DTD) is a
 135    formal definition of the HTML syntax in terms of SGML.  This specifi-
 136    cation amends the DTD of HTML in order to make it applicable to docu-
 137    ments encompassing a character repertoire much larger than that of
 138    ISO-8859-1, while still remaining SGML conformant.
 139
 140    Both formal and actual development of HTML are advancing very fast.
 141    The features described in this document are designed so that they can
 142    (and should) be added to other forms of HTML besides that described
 143    in RFC 1866. Where indicated, attributes introduced here should be
 144    extended to the appropriate elements.
 145
 146
 147 1.2 Conformance
 148
 149    This specification changes slightly the conformance requirements of
 150    HTML documents and HTML user agents.
 151
 152 1.2.1 Documents
 153
 154    All HTML 2.0 conforming documents remain conforming with this speci-
 155    fication.  However, the extensions introduced here make valid cer-
 156    tains documents that would not be HTML 2.0 conforming, in particular
 157    those containing characters or character references outside of the
 158    repertoire of ISO 8859-1, and those containing markup introduced
 159    herein.
 160
 161
 162
 163
 164
 165
 166
 167                          Expires 2 December 1996        [Page 3]
 168 \f
 169 Internet Draft          HTML internationalization            27 May 1996
 170
 171
 172 1.2.2. User agents
 173
 174    In addition to the requirements of RFC 1866, the following require-
 175    ments are placed on HTML user agents.
 176
 177       To ensure interoperability and proper support for at least
 178       ISO-8859-1 in an environment where character encoding schemes
 179       other than ISO-8859-1 are present, user agents must correctly
 180       interpret the charset parameter accompanying an HTML document
 181       received from the network.
 182
 183       Furthermore, conforming user-agents are required to at least parse
 184       correctly all numeric character references within the range of ISO
 185       10646-1 [ISO-10646].
 186
 187       Conforming user-agents are required to apply the BIDI presentation
 188       algorithm if they display right-to-left characters.  If there is
 189       no displayable right-to-left character in a document, there is no
 190       need to apply BIDI processing.
 191
 192 2. The document character set
 193
 194 2.1. Reference processing model
 195
 196    This overview explains a reference processing model used for HTML,
 197    and in particular the SGML concept of a document character set. An
 198    actual implementation may widely differ in its internal workings from
 199    the model given below, but should behave as described to an outside
 200    observer.
 201
 202    Because there are various widely differing encodings of text, SGML
 203    does not directly address the question of how characters are encoded
 204    e.g. in a file. SGML views the characters as a single set (called a
 205    "character repertoire"), and a "code set" that assigns an integer
 206    number (known as "character number") to each character in the reper-
 207    toire.  The document character set declaration defines what each of
 208    the character numbers represents [GOLD90, p. 451].  In most cases, an
 209    SGML DTD and all documents that refer to it have a single document
 210    character set, and all markup and data characters are part of this
 211    set.
 212
 213    HTML, as an application of SGML, does not directly address the ques-
 214    tion of how characters are encoded as octets in external representa-
 215    tions such as files. This is deferred to mechanisms external to HTML,
 216    such as MIME as used by the HTTP protocol or by electronic mail.
 217
 218    For the HTTP protocol [RFC1945], the way characters are encoded is
 219
 220
 221
 222
 223                          Expires 2 December 1996        [Page 4]
 224 \f
 225 Internet Draft          HTML internationalization            27 May 1996
 226
 227
 228    defined by the "charset" parameter[1] of the "Content-Type" field of
 229    the header of an HTTP response. For example, to indicate that the
 230    transmitted document is encoded in the "JIS" encoding of Japanese
 231    [RFC1468], the header will contain the following line:
 232
 233    Content-Type: text/html; charset=ISO-2022-JP
 234
 235    The HTTP protocol also defines a mechanism for the client to specify
 236    the character encodings it can accept. Clients and servers are
 237    strongly requested to use these mechanisms to assure correct trans-
 238    mission and interpretation of any document. Provisions that can be
 239    taken to help correct interpretation, even in cases where a server or
 240    client do not yet use these mechanisms, are described in section 6.
 241
 242    Similarly, if HTML documents are transferred by electronic mail, the
 243    character encoding is defined by the "charset" parameter of the "Con-
 244    tent-Type" MIME header line [RFC1521], and defaults to US-ASCII in
 245    its absence.
 246
 247    In the case any other way of transferring and storing HTML documents
 248    are defined or become popular, it is advised that similar provisions
 249    be made to clearly identify the character encoding used and/or to use
 250    a single/default encoding capable of representing the widest range of
 251    characters used in an international context.
 252
 253    Whatever the external character encoding may be, the reference pro-
 254    cessing model translates it to a representation of the document char-
 255    acter set specified in Section 2.2 before processing specific to
 256    SGML/HTML.  The reference processing model can be depicted as fol-
 257    lows:
 258
 259      [resource]->[decoder]->[entity ]->[ SGML ]->[application]->[display]
 260                             [manager]  [parser]
 261                                 ^          |
 262                                 |          |
 263                                 +----------+
 264
 265    The decoder is responsible for decoding the external representation
 266    of the resource to a representation using the document character set.
 267    The entity manager, the parser, and the application deal only with
 268    characters of the document character set.  A display-oriented part of
 269    the application or the display machinery itself may again convert
 270 -----------
 271   1 The term "charset" in MIME is used to designate a char-
 272 acter encoding, rather than a coded character set as the
 273 term may suggest.  A character encoding is a mapping (possi-
 274 bly many-to-one) of a sequence of octets to a sequence of
 275 characters taken from one or more character repertoires.
 276
 277
 278
 279                          Expires 2 December 1996        [Page 5]
 280 \f
 281 Internet Draft          HTML internationalization            27 May 1996
 282
 283
 284    characters represented in the document character set to some other
 285    representation more suitable for their purpose. In any case, the
 286    entity manager, the parser, and the application, as far as character
 287    semantics are concerned, are using the HTML document character set
 288    only.
 289
 290    An actual implementation may choose, or not, to translate the docu-
 291    ment into some encoding of the document character set as described
 292    above; the behaviour described by this reference processing model can
 293    be achieved otherwise.  This subject is well out of the scope of this
 294    specification, however, and the reader is invited to consult the SGML
 295    standard [ISO-8879] or an SGML handbook [BRYAN88] [GOLD90] [VANH90]
 296    [SQ91] for further information.
 297
 298    The most important consequence of this reference processing model is
 299    that numeric character references are always resolved with respect to
 300    the fixed document character set, and thus to the same characters,
 301    whatever the external encoding actually used. For an example, see
 302    Section 2.2.
 303
 304 2.2. The document character set
 305
 306    The document character set, in the SGML sense, is the Universal Char-
 307    acter Set (UCS) of ISO 10646:1993 [ISO-10646], as amended.  Cur-
 308    rently, this is code-by-code identical with the Unicode standard,
 309    version 1.1 [UNICODE].
 310
 311         NOTE -- implementers should be aware that ISO 10646 is
 312         amended from time to time; 4 amendments have been adopted
 313         since the initial 1993 publication, none of which signifi-
 314         cantly affects this specification.  A fifth amendment, now
 315         under consideration, will introduce incompatible changes to
 316         the standard: 6556 Korean Hangul syllables allocated
 317         between code positions 3400 and 4DFF (hexadecimal) will be
 318         moved to new positions (and 4516 new syllables added), thus
 319         making references to the old positions invalid.  Since the
 320         Unicode consortium has already adopted a corresponding
 321         amendment for inclusion in the forthcoming Unicode 2.0,
 322         adoption of DAM 5 is considered likely and implementers
 323         should probably consider the old code positions as already
 324         invalid.  Despite this one-time change, the relevant stan-
 325         dard bodies appear to remain committed not to change any
 326         allocated code position in the future.  To encode Korean
 327         Hangul irrespective of these changes, the combining Hangul
 328         Jamo in the range 1110-11F9 can be used.
 329
 330    The adoption of this document character set implies a change in the
 331    SGML declaration specified in the HTML 2.0 specification (section 9.5
 332
 333
 334
 335                          Expires 2 December 1996        [Page 6]
 336 \f
 337 Internet Draft          HTML internationalization            27 May 1996
 338
 339
 340    of [RFC1866]).  The change amounts to removing the first BASESET
 341    specification and its accompanying DESCSET declaration, replacing
 342    them with the following declaration:
 343
 344      BASESET "ISO Registration Number 177//CHARSET
 345               ISO/IEC 10646-1:1993 UCS-4 with implementation level 3
 346               //ESC 2/5 2/15 4/6"
 347      DESCSET  0   9     UNUSED
 348               9   2     9
 349               11  2     UNUSED
 350               13  1     13
 351               14  18    UNUSED
 352               32  95    32
 353               127 1     UNUSED
 354               128 32    UNUSED
 355               160 2147483486 160
 356
 357    Making the UCS the document character set does not create non-
 358    conformance of any expression, construct or document that is conform-
 359    ing to HTML 2.0.  It does make conforming certain constructs that are
 360    not admissible in HTML 2.0.  One consequence is that data characters
 361    outside the repertoire of ISO-8859-1, but within that of UCS-4 become
 362    valid SGML characters.  Another is that the upper limit of the range
 363    of numeric character references is extended from 255 to 2147483645;
 364    thus, &#1048; is a valid reference to a "CYRILLIC CAPITAL LETTER I".
 365    [ERCS] is a good source of information on Unicode and SGML, although
 366    its scope and technical content differ greatly from this specifica-
 367    tion.
 368
 369         NOTE -- the above SGML declaration, like that of HTML 2.0,
 370         specifies the character numbers 128 to 159 (80 to 9F hex)
 371         as UNUSED.  This means that numeric character references
 372         within that range (e.g.  &#146;) are illegal in HTML.  Nei-
 373         ther ISO 8859-1 nor ISO 10646 contain characters in that
 374         range, which is reserved for control characters.
 375
 376    ISO 10646-1:1993 is the most encompassing character set currently
 377    existing, and there is no other character set that could take its
 378    place as the document character set for HTML. If nevertheless for a
 379    specific application there is a need to use characters outside this
 380    standard, this should be done by avoiding any conflicts with present
 381    or future versions of ISO 10646, i.e. by assigning these characters
 382    to a private zone. Also, it should be borne in mind that such a use
 383    will be highly unportable; in many cases, it may be better to use
 384    inline bitmaps.
 385
 386
 387
 388
 389
 390
 391                          Expires 2 December 1996        [Page 7]
 392 \f
 393 Internet Draft          HTML internationalization            27 May 1996
 394
 395
 396 2.3. Undisplayable characters
 397
 398    With the document character set being the full ISO 10646, the possi-
 399    bility that a character cannot be displayed due to lack of appropri-
 400    ate resources (fonts) cannot be avoided. Because there are many dif-
 401    ferent things that can be done in such a case, this document does not
 402    prescribe any specific behaviour. Depending on the implementation,
 403    this may also be handled by the underlaying display system and not
 404    the application itself.  The following considerations, however, may
 405    be of help:
 406
 407    -  A clearly visible, but unobtrusive behaviour should be preferred.
 408       Some documents may contain many characters that cannot be renden-
 409       dered, and so showing an alert for each of them is not the right
 410       thing to do.
 411
 412    -  In case a numeric representation of the missing character is
 413       given, its hexadecimal (not decimal) form is to be preferred,
 414       because this form is used in character set standards [ERCS].
 415
 416 3. The LANG attribute
 417
 418    Language tags can be used to control rendering of a marked up docu-
 419    ment in various ways: glyph disambiguation, in cases where the char-
 420    acter encoding is not sufficient to resolve to a specific glyph; quo-
 421    tation marks; hyphenation; ligatures; spacing; voice synthesis; etc.
 422    Independently of rendering issues, language markup is useful as con-
 423    tent markup for purposes such as classification and searching.
 424
 425    Since any text can logically be assigned a language, almost all HTML
 426    elements admit the LANG attribute.  The DTD reflects this.  It is
 427    also intended that any new element introduced in later versions of
 428    HTML will admit the LANG attribute, unless there is a good reason not
 429    to do so.
 430
 431    The language attribute, LANG, takes as its value a language tag that
 432    identifies a natural language spoken, written, or otherwise conveyed
 433    by human beings for communication of information to other human
 434    beings. Computer languages are explicitly excluded.
 435
 436    The syntax and registry of HTML language tags is the same as that
 437    defined by RFC 1766 [RFC1766]. In summary, a language tag is composed
 438    of one or more parts: A primary language tag and a possibly empty
 439    series of subtags:
 440
 441         language-tag  = primary-tag *( "-" subtag )
 442         primary-tag   = 1*8ALPHA
 443         subtag        = 1*8ALPHA
 444
 445
 446
 447                          Expires 2 December 1996        [Page 8]
 448 \f
 449 Internet Draft          HTML internationalization            27 May 1996
 450
 451
 452    Whitespace is not allowed within the tag and all tags are case-
 453    insensitive. The namespace of language tags is administered by the
 454    IANA. Example tags include:
 455
 456        en, en-US, en-cockney, i-cherokee, x-pig-latin
 457
 458    In the context of HTML, a language tag is not to be interpreted as a
 459    single token, as per RFC 1766, but as a hierarchy. For example, a
 460    user agent that adjusts rendering according to language should con-
 461    sider that it has a match when a language tag in a style sheet entry
 462    matches the initial portion of the language tag of an element. An
 463    exact match should be preferred. This interpretation allows an ele-
 464    ment marked up as, for instance, "en-US" to trigger styles corre-
 465    sponding to, in order of preference, US-English ("en-US") or 'plain'
 466    or 'international' English ("en").
 467
 468         NOTE -- using the language tag as a hierarchy does not
 469         imply that all languages with a common prefix will be
 470         understood by those fluent in one or more of those lan-
 471         guages; it simply allows the user to request this commonal-
 472         ity when it is true for that user.
 473
 474    The rendering of elements may be affected by the LANG attribute.  For
 475    any element, the value of the LANG attribute overrides the value
 476    specified by the LANG attribute of any enclosing element and the
 477    value (if any) of the HTTP Content-Language header. If none of these
 478    are set, a suitable default, perhaps controlled by user preferences,
 479    by automatic context analysis or by the user's locale, should be used
 480    to control rendering.
 481
 482 4. Additional entities, attributes and elements
 483
 484 4.1. Full Latin-1 entity set
 485
 486    According to the suggestion of section 14 of [RFC1866], the set of
 487    Latin-1 entities is extended to cover the whole right part of
 488    ISO-8859-1 (all code positions with the high-order bit set), includ-
 489    ing the already commonly used &nbsp;, &copy; and &reg;.  The names of
 490    the entities are taken from the appendices of SGML [ISO-8879].  A
 491    list is provided in section 7.3 of this specification.
 492
 493 4.2. Markup for language-dependent presentation
 494
 495
 496 4.2.1. Overview
 497
 498    For the correct presentation of text in certain languages (irrespec-
 499    tive of formatting issues), some support in the form of additional
 500
 501
 502
 503                          Expires 2 December 1996        [Page 9]
 504 \f
 505 Internet Draft          HTML internationalization            27 May 1996
 506
 507
 508    entities and elements is needed.
 509
 510    In particular, the following features are dealt with:
 511
 512    -  Markup of bidirectional text, i.e. text where left-to-right and
 513       right-to-left scripts are mixed.
 514
 515    -  Control of cursive joining behaviour in contexts where the default
 516       behaviour is not appropriate.
 517
 518    -  Language-dependent rendering of short (in-line) quotations.
 519
 520    -  Better justification control for languages where this is impor-
 521       tant.
 522
 523    -  Superscripts and subscripts for languages where they appear as
 524       part of general text.
 525
 526    Some of the above features need very little additional support; oth-
 527    ers need more. The additional features are introduced below with
 528    brief comments only. Explanations on cursive joining behaviour and
 529    bidirectional text follow later.  For cursive joining behaviour and
 530    bidirectional text, this document follows [UNICODE] in that: i) char-
 531    acter semantics, where applicable, are identical to [UNICODE], and
 532    ii) where functionality is moved to HTML as a higher level protocol,
 533    this is done in a way that allows straightforward conversion to the
 534    lower-level mechanisms defined in [UNICODE].
 535
 536
 537 4.2.2. List of entities, elements, and attributes
 538
 539    First, a generic container is needed to carry the LANG and DIR (see
 540    below) attributes in cases where no other element is appropriate; the
 541    SPAN element is introduced for that purpose.
 542
 543    A set of named character entities is added for use with bidirectional
 544    rendering and cursive joining control:
 545
 546    <!ENTITY zwnj CDATA "&#8204;"--=zero width non-joiner-->
 547    <!ENTITY zwj  CDATA "&#8205;"--=zero width joiner-->
 548    <!ENTITY lrm  CDATA "&#8206;"--=left-to-right mark-->
 549    <!ENTITY rlm  CDATA "&#8207;"--=right-to-left mark-->
 550
 551    These entities can be used in place of the corresponding formatting
 552    characters whenever convenient, for example to ease keyboard entry or
 553    when a formatting character is not available in the character encod-
 554    ing of the document.
 555
 556
 557
 558
 559                          Expires 2 December 1996       [Page 10]
 560 \f
 561 Internet Draft          HTML internationalization            27 May 1996
 562
 563
 564    Next, an attribute called DIR is introduced, restricted to the values
 565    LTR (left-to-right) and RTL (right-to-left) and admitted by most ele-
 566    ments, for the indication of directionality in the context of bidi-
 567    rectional text (see 4.2.4 below for details).  Since any text and
 568    many other elements (e.g. tables) can logically be assigned a direc-
 569    tionality, almost all HTML elements admit the DIR attribute.  The DTD
 570    reflects this.  It is also intended that any new element introduced
 571    in later versions of HTML will admit the DIR attribute, unless there
 572    is a good reason not to do so.
 573
 574    A new element called BDO (BIDI Override) is introduced, which
 575    requires the DIR attribute to specify whether the override is left-
 576    to-right or right-to-left.  This element is required for bidirec-
 577    tional text control; for detailed explanations, see section 4.2.4.
 578
 579    The <Q> element is introduced to allow language-dependent rendering
 580    of short quotations depending on language and platform capability.
 581    As the following examples show, in particular the quotation marks
 582    surrounding the quotation are affected: "a quotation in English",
 583    `another, slightly better one', ,,a quotation in German'', << a quo-
 584    tation in French >>.  The contents of the <Q> element does not
 585    include quotation marks, they have to be added by the rendering pro-
 586    cess.
 587
 588         NOTE -- <Q> elements can be nested. Many languages use dif-
 589         ferent quotation styles for outer and inner quotations, and
 590         this should be respected by user-agents implementing this
 591         element.
 592
 593    Many languages require superscripts for proper rendering: as an exam-
 594    ple, the French "Mlle Dupont" should have "lle" in superscript.  The
 595    <SUP> element, and its sibling <SUB>, are introduced to allow proper
 596    markup of such text.  <SUP> and <SUB> contents are restricted to
 597    PCDATA to avoid nesting problems.
 598
 599    Finally, in many languages text justification is much more important
 600    than it is in Western languages, and justifies markup.  The ALIGN
 601    attribute, admitting values of LEFT, RIGHT, CENTER and JUSTIFY, is
 602    added to a selection of elements where it makes sense (block-like).
 603    If a user-agent chooses to have LEFT as a default for blocks of left-
 604    to-right directionality, it should use RIGHT for blocks of right-to-
 605    left directionality.
 606
 607    In the DTD, the LANG and DIR attributes are grouped together in a
 608    parameter entity called attrs.  In addition, the ID and CLASS
 609    attributes from RFC 1942 [RFC1942] were added to attrs, as was done
 610    in the latter. The ID, and CLASS attributes are required for use with
 611    style sheets, and RFC 1942 defines them as follows:
 612
 613
 614
 615                          Expires 2 December 1996       [Page 11]
 616 \f
 617 Internet Draft          HTML internationalization            27 May 1996
 618
 619
 620    ID      Used to define a document-wide identifier. This can be used
 621            for naming positions within documents as the destination of a
 622            hypertext link. It may also be used by style sheets for ren-
 623            dering an element in a unique style. An ID attribute value is
 624            an SGML NAME token. NAME tokens are formed by an initial let-
 625            ter followed by letters, digits, "-" and "." characters. The
 626            letters are restricted to A-Z and a-z.
 627
 628    CLASS   A space separated list of SGML NAME tokens. CLASS names spec-
 629            ify that the element belongs to the corresponding named
 630            classes. It allows authors to distinguish different roles
 631            played by the same tag. The classes may be used by style
 632            sheets to provide different renderings as appropriate to
 633            these roles.
 634
 635 4.2.3. Cursive joining behaviour
 636
 637    Markup is needed in some cases to force cursive joining behavior in
 638    contexts in which it would not normally occur, or to block it when it
 639    would normally occur.
 640
 641    The zero-width joiner and non-joiner (&zwj; and &zwnj;) are used to
 642    control cursive joining behaviour.  For example, ARABIC LETTER HEH is
 643    used in isolation to abbreviate "Hijri" (the Islamic calendrical sys-
 644    tem); however, the initial form of the letter is desired, because the
 645    isolated form of HEH looks like the digit five as employed in Arabic
 646    script.  This is obtained by following the HEH with a zero-width
 647    joiner whose only effect is to provide context.  In Persian texts,
 648    there are cases where a letter that normally would join a subsequent
 649    letter in a cursive connection does not.  Here a zero-width non-
 650    joiner is used.
 651
 652 4.2.4. Bidirectional text
 653
 654    Many languages are written in horizontal lines from left to right,
 655    while others are written from right to left.  When both writing
 656    directions are present, one talks of bidirectional text (BIDI for
 657    short). BIDI text requires markup in special circumstances where
 658    ambiguities as to the directionality of some characters have to be
 659    resolved.  This markup affects the ability to render BIDI text in a
 660    semantically legible fashion.  That is, without this special BIDI
 661    markup, cases arise which would prevent *any* rendering whatsoever
 662    that reflected the basic meaning of the text. Plain text may contain
 663    this markup (joining or BIDI) in the form of special-purpose charac-
 664    ters; in HTML, these are supplemented by SGML markup.
 665
 666    BIDI is a complex issue, and implementers are advised to consult
 667    appropriate documentation such as [UNICODE]. Here, explanations are
 668
 669
 670
 671                          Expires 2 December 1996       [Page 12]
 672 \f
 673 Internet Draft          HTML internationalization            27 May 1996
 674
 675
 676    given only as far as they are needed to understand the necessity of
 677    the features introduced and to define their exact semantics.
 678
 679    The Unicode BIDI algorithm is based on a logical sequence of text
 680    characters and works mainly by reference to the implicit directional-
 681    ity of characters (e.g. Hebrew and Arabic characters are specified to
 682    be rendered from right to left, etc.).
 683
 684    The left-to-right and right-to-left marks (&lrm; and &rlm;) are used
 685    to disambiguate directionality of neutral characters. For example,
 686    when a double quote sits between an Arabic and a Latin letter, its
 687    direction is ambiguous; if a directional mark is added on one side
 688    such that the quotation mark is surrounded by characters of only one
 689    directionality, the ambiguity is removed. These characters are like
 690    zero width spaces which have a directional property (but no word/line
 691    break property).
 692
 693    Nested embeddings of contra-directional text runs, due to nested quo-
 694    tations or to the pasting of text from one BIDI context to another,
 695    is also a case where the implicit directionality of characters is not
 696    sufficient, requiring markup.  Also, it is frequently desirable to
 697    specify the basic directionality of a block of text. For these pur-
 698    poses, the DIR attribute is used.
 699
 700    On block-type elements, the DIR attribute indicates the base direc-
 701    tionality of the text in the block; if omitted it is inherited from
 702    the parent element.   The default directionality of the overall HTML
 703    document is left-to-right.
 704
 705    On inline elements, it makes the element start a new embedding level
 706    (to be explained below); if omitted the inline element does not start
 707    a new embedding level.
 708
 709         NOTE -- the PRE, XMP and LISTING elements admit the DIR
 710         attribute, indicating that the contents should not be con-
 711         sidered as preformatted with respect to bidirectional lay-
 712         out. The BIDI algorithm still needs to be applied to each
 713         line of text.
 714
 715    Following is an example of a case where embedding is needed, showing
 716    its effect:
 717
 718         Given the following latin (upper case) and arabic (lower
 719         case) letters in backing store with the specified embed-
 720         dings:
 721
 722         <SPAN DIR=LTR> AB <SPAN DIR=RTL> xy <SPAN DIR=LTR> CD
 723         </SPAN> zw </SPAN> EF </SPAN>
 724
 725
 726
 727                          Expires 2 December 1996       [Page 13]
 728 \f
 729 Internet Draft          HTML internationalization            27 May 1996
 730
 731
 732         One gets the following rendering (with [] showing the
 733         directional transitions):
 734
 735         [ AB [ wz [ CD ] yx ] EF ]
 736
 737         On the other hand, without this markup and with a base
 738         direction of LTR one gets the following rendering:
 739
 740         [ AB [ yx ] CD [ wz ] EF ]
 741
 742         Notice that yx is on the left and wz on the right unlike
 743         the above case where the embedding levels are used.  With-
 744         out the embedding markup one has at most two levels: a base
 745         directional level and a single counterflow directional
 746         level.
 747
 748    The DIR attribute on inline elements is equivalent to the formatting
 749    characters  LEFT-TO-RIGHT EMBEDDING (202A) and RIGHT-TO-LEFT EMBED-
 750    DING (202B) of ISO 10646.  The end tag of the element is equivalent
 751    to the POP DIRECTIONAL FORMATTING (202C) character.
 752
 753    Directional override, as provided by the <BDO> element, is needed to
 754    deal with unusual short pieces of text in which directionality cannot
 755    be resolved from context in an unambiguous fashion. For example, it
 756    can be used to force left-to-right (or right-to-left) display of part
 757    numbers composed of Latin letters, digits and Hebrew letters.
 758
 759    The effect of <BDO> is to force the directionality of all characters
 760    within it to the value of DIR, irrespective of their intrinsic direc-
 761    tional properties.  It is equivalent to using the LEFT-TO-RIGHT OVER-
 762    RIDE (202D) or RIGHT-TO-LEFT OVERRIDE (202E) characters of ISO 10646,
 763    the end tag again being equivalent to the POP DIRECTIONAL FORMATTING
 764    (202C) character.
 765
 766         NOTE -- authors and authoring software writers should be
 767         aware that conflicts can arise if the DIR attribute is used
 768         on inline elements (including <BDO>) concurrently with the
 769         use of the corresponding ISO 10646 formatting characters.
 770         Preferably one or the other should be used exclusively; the
 771         markup method is better able to guarantee document struc-
 772         tural integrity, and alleviates some problems when editing
 773         bidirectional HTML text with a simple text editor, but some
 774         software may be more apt at using the 10646 characters.  If
 775         both methods are used, great care should be exercised to
 776         insure proper nesting of markup and directional embedding
 777         or override; otherwise, rendering results are undefined.
 778
 779
 780
 781
 782
 783                          Expires 2 December 1996       [Page 14]
 784 \f
 785 Internet Draft          HTML internationalization            27 May 1996
 786
 787
 788 5. Forms
 789
 790
 791 5.1. DTD additions
 792
 793    It is natural to expect input in any language in forms, as they pro-
 794    vide one of the only ways of obtaining user input. While this is pri-
 795    marily a UI issue, there are some things that should be specified at
 796    the HTML level to guide behavior and promote interoperability.
 797
 798    To ensure full interoperability, it is necessary for the user agent
 799    (and the user) to have an indication of the character encoding(s)
 800    that the server providing a form will be able to handle upon submis-
 801    sion of the filled-in form.  Such an indication is provided by the
 802    ACCEPT-CHARSET attribute of the INPUT and TEXTAREA elements, modeled
 803    on the HTTP Accept-Charset header (see [HTTP-1.1]), which contains a
 804    space and/or comma delimited list of character sets acceptable to the
 805    server.  A user agent may want to somehow advise the user of the con-
 806    tents of this attribute, or to restrict his possibility to enter
 807    characters outside the repertoires of the listed character sets.
 808
 809         NOTE -- The list of character sets is to be interpreted as
 810         an EXCLUSIVE-OR list; the server announces that it is ready
 811         to accept any ONE of these character encoding schemes for
 812         each part of a multipart entity.  The client may perform
 813         character encoding translation to satisfy the server if
 814         necessary.
 815
 816         NOTE -- The default value for the ACCEPT-CHARSET attribute
 817         of an INPUT or TEXTAREA element is the reserved value
 818         "UNKNOWN".  A user agent may interpret that value as the
 819         character encoding scheme that was used to transmit the
 820         document containing that element.
 821
 822
 823 5.2. Form submission
 824
 825    The HTML 2.0 form submission mechanism, based on the "application/x-
 826    www-form-urlencoded" media type, is ill-equipped with regard to
 827    internationalization.  In fact, since URLs are restricted to ASCII
 828    characters, the mechanism is akward even for ISO-8859-1 text.  Sec-
 829    tion 2.2 of [RFC1738] specifies that octets may be encoded using the
 830    "%HH" notation, but text submitted from a form is composed of charac-
 831    ters, not octets.  Lacking a specification of a character encoding
 832    scheme, the "%HH" notation has no well-defined meaning.
 833
 834    The best solution is to use the "multipart/form-data" media type
 835    described in [RFC1867] with the POST method of form submission.  This
 836
 837
 838
 839                          Expires 2 December 1996       [Page 15]
 840 \f
 841 Internet Draft          HTML internationalization            27 May 1996
 842
 843
 844    mechanism encapsulates the value part of each name-value pair in a
 845    body-part of a multipart MIME body that is sent as the HTTP entity;
 846    each body part can be labeled with an appropriate Content-Type,
 847    including if necessary a charset parameter that specifies the charac-
 848    ter encoding scheme.  The changes to the DTD necessary to support
 849    this method of form submission have been incorporated in the DTD
 850    included in this specification.
 851
 852    A less satisfactory solution is to add a MIME charset parameter to
 853    the "application/x-www-form-urlencoded" media type specifier sent
 854    along with a POST method form submission, with the understanding that
 855    the URL encoding of [RFC1738] is applied on top of the specified
 856    character encoding, as a kind of implicit Content-Transfer-Encoding.
 857
 858    One problem with both solutions above is that current browsers do not
 859    generally allow for bookmarks to specify the POST method; this should
 860    be improved.  Conversely, the GET method could be used with the form
 861    data transmitted in the body instead of in the URL.  Nothing in the
 862    protocol seems to prevent it, but no implementations appear to exist
 863    at present.
 864
 865    How the user agent determines the encoding of the text entered by the
 866    user is outside the scope of this specification.
 867
 868         NOTE -- Designers of forms and their handling scripts
 869         should be aware of an important caveat: when the default
 870         value of a field (the VALUE attribute) is returned upon
 871         form submission (i.e. the user did not modify this value),
 872         it cannot be guaranteed to be transmitted as a sequence of
 873         octets identical to that in the source document -- only as
 874         a possibly different but valid encoding of the same
 875         sequence of text elements.  This may be true even if the
 876         encoding of the document containing the form and that used
 877         for submission are the same.
 878
 879         Differences can occur when a sequence of characters can be
 880         represented by various sequences of octets, and also when a
 881         composite sequence (a base character plus one or more com-
 882         bining diacritics) can be represented by either a different
 883         but equivalent composite sequence or by a fully precomposed
 884         character. For instance, the UCS-2 sequence 00EA+0232
 885         (LATIN SMALL LETTER E WITH CIRCUMFLEX ACCENT + COMBINING
 886         DOT BELOW) may be transformed into 1EC7 (LATIN SMALL LETTER
 887         E WITH CIRCUMFLEX ACCENT AND DOT BELOW), into
 888         0065+0302+0323 (LATIN SMALL LETTER E + COMBINING CIRCUMFLEX
 889         ACCENT + COMBINING DOT BELOW), as well as into other equiv-
 890         alent composite sequences.
 891
 892
 893
 894
 895                          Expires 2 December 1996       [Page 16]
 896 \f
 897 Internet Draft          HTML internationalization            27 May 1996
 898
 899
 900 6. Miscellaneous
 901
 902    Proper interpretation of a text document requires that the character
 903    encoding scheme be known.  Current HTTP servers, however, do not gen-
 904    erally include an appropriate charset parameter with the Content-Type
 905    header.  This is bad behaviour[2], and as such strongly discouraged,
 906    but some preventive measures can be taken to minimize the detrimental
 907    effects.
 908
 909    In the case where a document is accessed from a hyperlink in an ori-
 910    gin HTML document, a CHARSET attribute is added to the attribute list
 911    of elements with link semantics (A and LINK), specifically by adding
 912    it to the linkExtraAttributes entity.  The value of that attribute is
 913    to be considered a hint to the User Agent as to the character encod-
 914    ing scheme used by the ressource pointed to by the hyperlink; it
 915    should be the appropriate value of the MIME charset parameter for
 916    that ressource.
 917
 918    In any document, it is possible to include an indication of the
 919    encoding scheme like the following, as early as possible within the
 920    HEAD of the document:
 921
 922     <META HTTP-EQUIV="Content-Type"
 923      CONTENT="text/html; charset=ISO-2022-JP">
 924
 925    This is not foolproof, but will work if the encoding scheme is such
 926    that ASCII characters stand for themselves at least until the META
 927    element is parsed.  Note that there are better ways for a server to
 928    obtain character encoding information, instead of the unreliable
 929    <META> above; see [NICOL2] for some details and a proposal.
 930
 931    For definiteness, the "charset" parameter received from the source of
 932    the document should be considered the most authoritative, followed in
 933    order of preference by the contents of a META element such as the
 934    above, and finally the CHARSET parameter of the anchor that was fol-
 935    lowed (if any).
 936
 937    When HTML text is transmitted directly in UCS-2 or UCS-4 form, the
 938    question of byte order arises: does the high-order byte of each
 939    multi-byte character come first or last?  For definiteness, this
 940    specification recommends that UCS-2 and UCS-4 be transmitted in big-
 941 -----------
 942   2 This bad behaviour is even encouraged by the continued
 943 existence of browsers that declare an unrecognized media
 944 type when they receive a charset parameter.  User agent
 945 implementators are strongly encouraged to make their soft-
 946 ware tolerant of this parameter, even if they cannot take
 947 advantage of it.
 948
 949
 950
 951                          Expires 2 December 1996       [Page 17]
 952 \f
 953 Internet Draft          HTML internationalization            27 May 1996
 954
 955
 956    endian byte order (high order byte first), which corresponds to the
 957    established network byte order for two- and four-byte quantities, to
 958    the Unicode recommendation for serialized text data and to RFC 1641.
 959    Furthermore, to maximize chances of proper interpretation, it is rec-
 960    ommended that documents transmitted as UCS-2 or UCS-4 always begin
 961    with a ZERO-WIDTH NON-BREAKING SPACE character (hexadecimal FEFF or
 962    0000FEFF) which, when byte-reversed becomes number FFFE or FFFE0000,
 963    a character guaranteed to be never assigned.  Thus, a user-agent
 964    receiving an FFFE as the first octets of a text would know that bytes
 965    have to be reversed for the remainder of the text.
 966
 967    There exist so-called UCS Transformation Formats than can be used to
 968    transmit UCS data, in addition to UCS-2 and UCS-4.  UTF-7 [RFC1642]
 969    and UTF-8 [UTF-8] have favorable properties (no byte-ordering prob-
 970    lem, different flavours of ASCII compatibility) that make them worthy
 971    of consideration, especially for transmission of multilingual text.
 972    Another encoding scheme, MNEM [RFC1345], also has interesting proper-
 973    ties and the capability to transmit the full UCS.  The UTF-1 trans-
 974    formation format of ISO 10646:1993 (registered by IANA as
 975    ISO-10646-UTF-1), has been removed from ISO 10646 by amendment 4, and
 976    should not be used.
 977
 978    The SOFT HYPHEN character (U+00AD) needs a little attention from
 979    user-agent implementers.  It is present in many character sets
 980    (including the whole ISO 8859 series and, of course, ISO 10646), and
 981    has semantics different from the plain HYPHEN.  If not used for
 982    hyphenation, the soft hyphen must be completely ignored.  For exam-
 983    ple, "rec&shy;ord" should display as "record", should match a search
 984    for "record", and should sort as "record".  Non-observance of these
 985    semantics effectively discourages its use on the World Wide Web, even
 986    with software that does support it.
 987
 988 7. HTML Public Text
 989
 990 7.1. HTML DTD
 991
 992    This section contains a DTD for HTML based on the HTML 2.0 DTD of RFC
 993    1866, incorporating the changes for file upload as specified in RFC
 994    1867, and the changes deriving from this document.
 995
 996    <!--    html.dtd
 997
 998            Document Type Definition for the HyperText Markup Language,
 999            extended for internationalisation (HTML DTD)
1000
1001            Last revised: 96/05/27
1002
1003         Authors: Daniel W. Connolly <connolly@w3.org>
1004
1005
1006
1007                          Expires 2 December 1996       [Page 18]
1008 \f
1009 Internet Draft          HTML internationalization            27 May 1996
1010
1011
1012                     Francois Yergeau <yergeau@alis.com>
1013         See Also: html.decl, html-1.dtd
1014           http://www.w3.org/hypertext/WWW/MarkUp/MarkUp.html
1015    -->
1016
1017    <!ENTITY % HTML.Version
1018            "-//IETF//DTD HTML//EN"
1019
1020            -- Typical usage:
1021
1022                <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
1023                <html>
1024                ...
1025                </html>
1026            --
1027            >
1028
1029
1030    <!--============ Feature Test Entities ========================-->
1031
1032    <!ENTITY % HTML.Recommended "IGNORE"
1033         -- Certain features of the language are necessary for
1034            compatibility with widespread usage, but they may
1035            compromise the structural integrity of a document.
1036            This feature test entity enables a more prescriptive
1037            document type definition that eliminates
1038            those features.
1039         -->
1040
1041    <![ %HTML.Recommended [
1042            <!ENTITY % HTML.Deprecated "IGNORE">
1043    ]]>
1044
1045    <!ENTITY % HTML.Deprecated "INCLUDE"
1046         -- Certain features of the language are necessary for
1047            compatibility with earlier versions of the specification,
1048            but they tend to be used and implemented inconsistently,
1049            and their use is deprecated. This feature test entity
1050            enables a document type definition that eliminates
1051            these features.
1052         -->
1053
1054    <!ENTITY % HTML.Highlighting "INCLUDE"
1055         -- Use this feature test entity to validate that a
1056            document uses no highlighting tags, which may be
1057            ignored on minimal implementations.
1058         -->
1059
1060
1061
1062
1063                          Expires 2 December 1996       [Page 19]
1064 \f
1065 Internet Draft          HTML internationalization            27 May 1996
1066
1067
1068    <!ENTITY % HTML.Forms "INCLUDE"
1069            -- Use this feature test entity to validate that a document
1070               contains no forms, which may not be supported in minimal
1071               implementations
1072            -->
1073
1074    <!--============== Imported Names ==============================-->
1075
1076    <!ENTITY % Content-Type "CDATA"
1077            -- meaning an internet media type
1078               (aka MIME content type, as per RFC1521)
1079            -->
1080
1081    <!ENTITY % HTTP-Method "GET | POST"
1082            -- as per HTTP specification, RFC1945
1083            -->
1084
1085    <!--========= DTD "Macros" =====================-->
1086
1087    <!ENTITY % heading "H1|H2|H3|H4|H5|H6">
1088
1089    <!ENTITY % list " UL | OL | DIR | MENU " >
1090
1091    <!ENTITY % attrs -- common attributes for elements --
1092             "LANG  NAME      #IMPLIED  -- RFC 1766 language tag --
1093              DIR  (ltr|rtl)  #IMPLIED  -- text directionnality --
1094              ID      ID      #IMPLIED  -- element identifier (from RFC1942) --
1095              CLASS   NAMES   #IMPLIED  -- for subclassing elements (from RFC1942) --">
1096
1097    <!ENTITY % just -- an attribute for text justification --
1098             "ALIGN  (left|right|center|justify)  #IMPLIED"
1099             -- default is left for ltr paragraphs, right for rtl -- >
1100
1101    <!--======= Character mnemonic entities =================-->
1102
1103    <!ENTITY % ISOlat1 PUBLIC
1104      "ISO 8879-1986//ENTITIES Added Latin 1//EN//HTML">
1105    %ISOlat1;
1106
1107    <!ENTITY amp CDATA "&#38;"     -- ampersand          -->
1108    <!ENTITY gt CDATA "&#62;"      -- greater than       -->
1109    <!ENTITY lt CDATA "&#60;"      -- less than          -->
1110    <!ENTITY quot CDATA "&#34;"    -- double quote       -->
1111
1112    <!--Entities for language-dependent presentation (BIDI and contextual analysis) -->
1113    <!ENTITY zwnj CDATA "&#8204;"-- zero width non-joiner-->
1114    <!ENTITY zwj  CDATA "&#8205;"-- zero width joiner-->
1115    <!ENTITY lrm  CDATA "&#8206;"-- left-to-right mark-->
1116
1117
1118
1119                          Expires 2 December 1996       [Page 20]
1120 \f
1121 Internet Draft          HTML internationalization            27 May 1996
1122
1123
1124    <!ENTITY rlm  CDATA "&#8207;"-- right-to-left mark-->
1125
1126
1127    <!--========= SGML Document Access (SDA) Parameter Entities =====-->
1128
1129    <!-- HTML contains SGML Document Access (SDA) fixed attributes
1130    in support of easy transformation to the International Committee
1131    for Accessible Document Design (ICADD) DTD
1132          "-//EC-USA-CDA/ICADD//DTD ICADD22//EN".
1133    ICADD applications are designed to support usable access to
1134    structured information by print-impaired individuals through
1135    Braille, large print and voice synthesis.  For more information on
1136    SDA & ICADD:
1137            - ISO 12083:1993, Annex A.8, Facilities for Braille,
1138           large print and computer voice
1139            - ICADD ListServ
1140           <ICADD%ASUACAD.BITNET@ARIZVM1.ccit.arizona.edu>
1141            - Usenet news group bit.listserv.easi
1142            - Recording for the Blind, +1 800 221 4792
1143    -->
1144
1145    <!ENTITY % SDAFORM  "SDAFORM  CDATA  #FIXED"
1146           -- one to one mapping        -->
1147    <!ENTITY % SDARULE  "SDARULE  CDATA  #FIXED"
1148           -- context-sensitive mapping -->
1149    <!ENTITY % SDAPREF  "SDAPREF  CDATA  #FIXED"
1150           -- generated text prefix     -->
1151    <!ENTITY % SDASUFF  "SDASUFF  CDATA  #FIXED"
1152           -- generated text suffix     -->
1153    <!ENTITY % SDASUSP  "SDASUSP  NAME   #FIXED"
1154           -- suspend transform process -->
1155
1156
1157    <!--========== Text Markup =====================-->
1158
1159    <![ %HTML.Highlighting [
1160
1161    <!ENTITY % font " TT | B | I ">
1162
1163    <!ENTITY % phrase "EM | STRONG | CODE | SAMP | KBD | VAR | CITE ">
1164
1165    <!ENTITY % text "#PCDATA|A|IMG|BR|%phrase|%font|SPAN|Q|BDO|SUP|SUB">
1166
1167    <!ELEMENT (%font;|%phrase) - - (%text)*>
1168    <!ATTLIST ( TT | CODE | SAMP | KBD | VAR )
1169            %attrs;
1170            %SDAFORM; "Lit"
1171            >
1172
1173
1174
1175                          Expires 2 December 1996       [Page 21]
1176 \f
1177 Internet Draft          HTML internationalization            27 May 1996
1178
1179
1180    <!ATTLIST ( B | STRONG )
1181            %attrs;
1182            %SDAFORM; "B"
1183            >
1184    <!ATTLIST ( I | EM | CITE )
1185            %attrs;
1186            %SDAFORM; "It"
1187            >
1188
1189    <!-- <TT>       Typewriter text                         -->
1190    <!-- <B>        Bold text                               -->
1191    <!-- <I>        Italic text                             -->
1192
1193    <!-- <EM>       Emphasized phrase                       -->
1194    <!-- <STRONG>   Strong emphasis                         -->
1195    <!-- <CODE>     Source code phrase                      -->
1196    <!-- <SAMP>     Sample text or characters               -->
1197    <!-- <KBD>      Keyboard phrase, e.g. user input        -->
1198    <!-- <VAR>      Variable phrase or substitutable        -->
1199    <!-- <CITE>     Name or title of cited work             -->
1200
1201    <!ENTITY % pre.content "#PCDATA|A|HR|BR|%font|%phrase|SPAN|BDO">
1202
1203    ]]>
1204
1205    <!ENTITY % text "#PCDATA|A|IMG|BR|SPAN|Q|BDO|SUP|SUB">
1206
1207    <!ELEMENT BR    - O EMPTY>
1208    <!ATTLIST BR
1209            %SDAPREF; "&#RE;"
1210            >
1211
1212    <!-- <BR>       Line break      -->
1213
1214    <!ELEMENT SPAN - - (%text)*>
1215    <!ATTLIST SPAN
1216            %attrs;
1217            %SDAFORM; "other #Attlist"
1218    >
1219
1220    <!-- <SPAN>             Generic inline container  -->
1221    <!-- <SPAN DIR=...>     New counterflow embedding -->
1222    <!-- <SPAN LANG="...">  Language of contents      -->
1223
1224    <!ELEMENT Q - - (%text)*>
1225    <!ATTLIST Q
1226            %attrs;
1227            %SDAPREF; '"'
1228
1229
1230
1231                          Expires 2 December 1996       [Page 22]
1232 \f
1233 Internet Draft          HTML internationalization            27 May 1996
1234
1235
1236            %SDASUFF; '"'
1237            >
1238
1239    <!-- <Q>         Short quotation              -->
1240    <!-- <Q LANG=xx> Language of quotation is xx  -->
1241    <!-- <Q DIR=...> New conterflow embedding     -->
1242
1243    <!ELEMENT BDO - - (%text)+>
1244    <!ATTLIST BDO
1245            LANG   NAME      #IMPLIED
1246            DIR    (ltr|rtl) #REQUIRED
1247            %SDAPREF "Bidi Override #Attval(DIR): "
1248            %SDASUFF "End Bidi"
1249            >
1250
1251    <!-- <BDO DIR=...>   Override directionality of text to value of DIR -->
1252    <!-- <BDO LANG=...>  Language of contents                            -->
1253
1254    <!ELEMENT (SUP|SUB) - - (#PCDATA)>
1255    <!ATTLIST (SUP)
1256            %attrs;
1257            %SDAPREF "Superscript(#content)"
1258            >
1259    <!ATTLIST (SUB)
1260            %attrs;
1261            %SDAPREF "Subscript(#content)"
1262            >
1263
1264    <!-- <SUP>      Superscript              -->
1265    <!-- <SUB>      Subscript                -->
1266
1267    <!--========= Link Markup ======================-->
1268
1269    <!ENTITY % linkType "NAMES">
1270
1271    <!ENTITY % linkExtraAttributes
1272            "REL %linkType #IMPLIED
1273            REV %linkType #IMPLIED
1274            URN CDATA #IMPLIED
1275            TITLE CDATA #IMPLIED
1276            METHODS NAMES #IMPLIED
1277            CHARSET NAME #IMPLIED
1278            ">
1279
1280    <![ %HTML.Recommended [
1281            <!ENTITY % A.content   "(%text)*"
1282            -- <H1><a name="xxx">Heading</a></H1>
1283                    is preferred to
1284
1285
1286
1287                          Expires 2 December 1996       [Page 23]
1288 \f
1289 Internet Draft          HTML internationalization            27 May 1996
1290
1291
1292               <a name="xxx"><H1>Heading</H1></a>
1293            -->
1294    ]]>
1295
1296    <!ENTITY % A.content   "(%heading|%text)*">
1297
1298    <!ELEMENT A     - - %A.content -(A)>
1299    <!ATTLIST A
1300            %attrs;
1301            HREF CDATA #IMPLIED
1302            NAME CDATA #IMPLIED
1303            %linkExtraAttributes;
1304            %SDAPREF; "<Anchor: #AttList>"
1305            >
1306    <!-- <A>       Anchor; source/destination of link -->
1307    <!-- <A NAME="..."> Name of this anchor           -->
1308    <!-- <A HREF="..."> Address of link destination        -->
1309    <!-- <A URN="...">  Permanent address of destination   -->
1310    <!-- <A REL=...>    Relationship to destination        -->
1311    <!-- <A REV=...>    Relationship of destination to this     -->
1312    <!-- <A TITLE="...">     Title of destination (advisory)         -->
1313    <!-- <A METHODS="...">   Operations on destination (advisory)    -->
1314    <!-- <A CHARSET="...">   Charset of destination (advisory)  -->
1315    <!-- <A LANG="...">     Language of contents btw <A> and </A>   -->
1316    <!-- <A DIR=...>        Contents is a new counterflow embedding -->
1317
1318    <!--========== Images ==========================-->
1319
1320    <!ELEMENT IMG    - O EMPTY>
1321    <!ATTLIST IMG
1322            %attrs;
1323            SRC CDATA  #REQUIRED
1324            ALT CDATA #IMPLIED
1325            ALIGN (top|middle|bottom) #IMPLIED
1326            ISMAP (ISMAP) #IMPLIED
1327            %SDAPREF; "<Fig><?SDATrans Img: #AttList>#AttVal(Alt)</Fig>"
1328            >
1329
1330    <!-- <IMG>              Image; icon, glyph or illustration      -->
1331    <!-- <IMG SRC="...">    Address of image object                 -->
1332    <!-- <IMG ALT="...">    Textual alternative                     -->
1333    <!-- <IMG ALIGN=...>    Position relative to text               -->
1334    <!-- <IMG LANG=...>     Image contains "text" in that language  -->
1335    <!-- <IMG DIR=rtl>      Inline image acts as a right-to-left
1336                            embedding w/r to BIDI algorithm         -->
1337    <!-- <IMG ISMAP>        Each pixel can be a link                -->
1338
1339    <!--========== Paragraphs=======================-->
1340
1341
1342
1343                          Expires 2 December 1996       [Page 24]
1344 \f
1345 Internet Draft          HTML internationalization            27 May 1996
1346
1347
1348    <!ELEMENT P     - O (%text)*>
1349    <!ATTLIST P
1350            %attrs;
1351            %just;
1352            %SDAFORM; "Para"
1353            >
1354
1355    <!-- <P>             Paragraph                           -->
1356    <!-- <P LANG="...">  Language of paragraph text          -->
1357    <!-- <P DIR=...>     Base directionality of paragraph    -->
1358    <!-- <P ALIGN=...>   Paragraph alignment (justification) -->
1359
1360    <!--========== Headings, Titles, Sections ===============-->
1361
1362    <!ELEMENT HR    - O EMPTY>
1363    <!ATTLIST HR
1364            %just;
1365            %SDAPREF; "&#RE;&#RE;"
1366            >
1367
1368    <!-- <HR>       Horizontal rule -->
1369
1370    <!ELEMENT ( %heading )  - -  (%text;)*>
1371    <!ATTLIST H1
1372            %attrs;
1373            %just;
1374            %SDAFORM; "H1"
1375            >
1376    <!ATTLIST H2
1377            %attrs;
1378            %just;
1379            %SDAFORM; "H2"
1380            >
1381    <!ATTLIST H3
1382            %attrs;
1383            %just;
1384            %SDAFORM; "H3"
1385            >
1386    <!ATTLIST H4
1387            %attrs;
1388            %just;
1389            %SDAFORM; "H4"
1390            >
1391    <!ATTLIST H5
1392            %attrs;
1393            %just;
1394            %SDAFORM; "H5"
1395            >
1396
1397
1398
1399                          Expires 2 December 1996       [Page 25]
1400 \f
1401 Internet Draft          HTML internationalization            27 May 1996
1402
1403
1404    <!ATTLIST H6
1405            %attrs;
1406            %just;
1407            %SDAFORM; "H6"
1408            >
1409
1410    <!-- <H1>       Heading, level 1 -->
1411    <!-- <H2>       Heading, level 2 -->
1412    <!-- <H3>       Heading, level 3 -->
1413    <!-- <H4>       Heading, level 4 -->
1414    <!-- <H5>       Heading, level 5 -->
1415    <!-- <H6>       Heading, level 6 -->
1416
1417
1418    <!--========== Text Flows ======================-->
1419
1420    <![ %HTML.Forms [
1421            <!ENTITY % block.forms "BLOCKQUOTE | FORM | ISINDEX">
1422    ]]>
1423
1424    <!ENTITY % block.forms "BLOCKQUOTE">
1425
1426    <![ %HTML.Deprecated [
1427            <!ENTITY % preformatted "PRE | XMP | LISTING">
1428    ]]>
1429
1430    <!ENTITY % preformatted "PRE">
1431
1432    <!ENTITY % block "P | %list | DL
1433            | %preformatted
1434            | %block.forms">
1435
1436    <!ENTITY % flow "(%text|%block)*">
1437
1438    <!ENTITY % pre.content "#PCDATA | A | HR | BR | SPAN | BDO">
1439    <!ELEMENT PRE - - (%pre.content)*>
1440    <!ATTLIST PRE
1441            %attrs;
1442            WIDTH NUMBER #implied
1443            %SDAFORM; "Lit"
1444            >
1445
1446    <!-- <PRE>              Preformatted text                    -->
1447    <!-- <PRE WIDTH=...>    Maximum characters per line          -->
1448    <!-- <PRE DIR=...>      Base direction of preformatted block -->
1449    <!-- <PRE LANG=...>     Language of contents                 -->
1450
1451    <![ %HTML.Deprecated [
1452
1453
1454
1455                          Expires 2 December 1996       [Page 26]
1456 \f
1457 Internet Draft          HTML internationalization            27 May 1996
1458
1459
1460    <!ENTITY % literal "CDATA"
1461            -- historical, non-conforming parsing mode where
1462               the only markup signal is the end tag
1463               in full
1464            -->
1465
1466    <!ELEMENT (XMP|LISTING) - -  %literal>
1467    <!ATTLIST XMP
1468            %attrs;
1469            %SDAFORM; "Lit"
1470            %SDAPREF; "Example:&#RE;"
1471            >
1472    <!ATTLIST LISTING
1473            %attrs;
1474            %SDAFORM; "Lit"
1475            %SDAPREF; "Listing:&#RE;"
1476            >
1477
1478    <!-- <XMP>              Example section         -->
1479    <!-- <LISTING>          Computer listing        -->
1480
1481    <!ELEMENT PLAINTEXT - O %literal>
1482    <!-- <PLAINTEXT>        Plain text passage      -->
1483
1484    <!ATTLIST PLAINTEXT
1485            %attrs;
1486            %SDAFORM; "Lit"
1487            >
1488    ]]>
1489
1490
1491    <!--========== Lists ==================-->
1492
1493    <!ELEMENT DL    - -  (DT | DD)+>
1494    <!ATTLIST DL
1495            %attrs;
1496            COMPACT (COMPACT) #IMPLIED
1497            %SDAFORM; "List"
1498            %SDAPREF; "Definition List:"
1499            >
1500
1501    <!ELEMENT DT    - O (%text)*>
1502    <!ATTLIST DT
1503            %attrs;
1504            %SDAFORM; "Term"
1505            >
1506
1507    <!ELEMENT DD    - O %flow>
1508
1509
1510
1511                          Expires 2 December 1996       [Page 27]
1512 \f
1513 Internet Draft          HTML internationalization            27 May 1996
1514
1515
1516    <!ATTLIST DD
1517            %attrs;
1518            %SDAFORM; "LItem"
1519            >
1520
1521    <!-- <DL>               Definition list, or glossary    -->
1522    <!-- <DL COMPACT>       Compact style list              -->
1523    <!-- <DT>               Term in definition list         -->
1524    <!-- <DD>               Definition of term              -->
1525
1526    <!ELEMENT (OL|UL) - -  (LI)+>
1527    <!ATTLIST OL
1528            %attrs;
1529            %just;
1530            COMPACT (COMPACT) #IMPLIED
1531            %SDAFORM; "List"
1532            >
1533    <!ATTLIST UL
1534            %attrs;
1535            %just;
1536            COMPACT (COMPACT) #IMPLIED
1537            %SDAFORM; "List"
1538            >
1539    <!-- <UL>               Unordered list                  -->
1540    <!-- <UL COMPACT>       Compact list style              -->
1541    <!-- <OL>               Ordered, or numbered list       -->
1542    <!-- <OL COMPACT>       Compact list style              -->
1543
1544
1545    <!ELEMENT (DIR|MENU) - -  (LI)+ -(%block)>
1546    <!ATTLIST DIR
1547            %attrs;
1548            %just;
1549            COMPACT (COMPACT) #IMPLIED
1550            %SDAFORM; "List"
1551            %SDAPREF; "<LHead>Directory</LHead>"
1552            >
1553    <!ATTLIST MENU
1554            %attrs;
1555            %just;
1556            COMPACT (COMPACT) #IMPLIED
1557            %SDAFORM; "List"
1558            %SDAPREF; "<LHead>Menu</LHead>"
1559            >
1560
1561    <!-- <DIR>              Directory list                  -->
1562    <!-- <DIR COMPACT>      Compact list style              -->
1563    <!-- <MENU>             Menu list                       -->
1564
1565
1566
1567                          Expires 2 December 1996       [Page 28]
1568 \f
1569 Internet Draft          HTML internationalization            27 May 1996
1570
1571
1572    <!-- <MENU COMPACT>     Compact list style              -->
1573
1574    <!ELEMENT LI    - O %flow>
1575    <!ATTLIST LI
1576            %attrs;
1577            %just;
1578            %SDAFORM; "LItem"
1579            >
1580
1581    <!-- <LI>               List item                       -->
1582
1583    <!--========== Document Body ===================-->
1584
1585    <![ %HTML.Recommended [
1586         <!ENTITY % body.content "(%heading|%block|HR|ADDRESS|IMG)*"
1587         -- <h1>Heading</h1>
1588            <p>Text ...
1589              is preferred to
1590            <h1>Heading</h1>
1591            Text ...
1592         -->
1593    ]]>
1594
1595    <!ENTITY % body.content "(%heading | %text | %block |
1596                         HR | ADDRESS)*">
1597
1598    <!ELEMENT BODY O O  %body.content>
1599    <!ATTLIST BODY
1600            %attrs;
1601            >
1602
1603    <!-- <BODY>          Document body                -->
1604    <!-- <BODY DIR=...>  Base direction of whole body -->
1605    <!-- <BODY LANG=...> Language of contents         -->
1606
1607    <!ELEMENT BLOCKQUOTE - - %body.content>
1608    <!ATTLIST BLOCKQUOTE
1609            %attrs;
1610            %just;
1611            %SDAFORM; "BQ"
1612            >
1613
1614    <!-- <BLOCKQUOTE>       Quoted passage  -->
1615
1616    <!ELEMENT ADDRESS - - (%text|P)*>
1617    <!ATTLIST  ADDRESS
1618            %attrs;
1619            %just;
1620
1621
1622
1623                          Expires 2 December 1996       [Page 29]
1624 \f
1625 Internet Draft          HTML internationalization            27 May 1996
1626
1627
1628            %SDAFORM; "Lit"
1629            %SDAPREF; "Address:&#RE;"
1630            >
1631
1632    <!-- <ADDRESS> Address, signature, or byline -->
1633
1634
1635    <!--======= Forms ====================-->
1636
1637    <![ %HTML.Forms [
1638
1639    <!ELEMENT FORM - - %body.content -(FORM) +(INPUT|SELECT|TEXTAREA)>
1640    <!ATTLIST FORM
1641            %attrs;
1642            ACTION CDATA #IMPLIED
1643            METHOD (%HTTP-Method) GET
1644            ENCTYPE %Content-Type; "application/x-www-form-urlencoded"
1645            %SDAPREF; "<Para>Form:</Para>"
1646            %SDASUFF; "<Para>Form End.</Para>"
1647            >
1648
1649    <!-- <FORM>                     Fill-out or data-entry form     -->
1650    <!-- <FORM ACTION="...">        Address for completed form      -->
1651    <!-- <FORM METHOD=...>          Method of submitting form       -->
1652    <!-- <FORM ENCTYPE="...">       Representation of form data     -->
1653    <!-- <FORM DIR=...>             Base direction of form          -->
1654    <!-- <FORM LANG=...>            Language of contents            -->
1655
1656    <!ENTITY % InputType "(TEXT | PASSWORD | CHECKBOX |
1657                            RADIO | SUBMIT | RESET |
1658                            IMAGE | HIDDEN | FILE )">
1659    <!ELEMENT INPUT - O EMPTY>
1660    <!ATTLIST INPUT
1661            %attrs;
1662         TYPE %InputType TEXT
1663         NAME CDATA #IMPLIED
1664         VALUE CDATA #IMPLIED
1665         SRC CDATA #IMPLIED
1666         CHECKED (CHECKED) #IMPLIED
1667         SIZE CDATA #IMPLIED
1668         MAXLENGTH NUMBER #IMPLIED
1669         ALIGN (top|middle|bottom) #IMPLIED
1670            ACCEPT CDATA #IMPLIED --list of content types --
1671            ACCEPT-CHARSET CDATA #IMPLIED --list of charsets accepted by server --
1672            %SDAPREF; "Input: "
1673         >
1674
1675    <!-- <INPUT>               Form input datum        -->
1676
1677
1678
1679                          Expires 2 December 1996       [Page 30]
1680 \f
1681 Internet Draft          HTML internationalization            27 May 1996
1682
1683
1684    <!-- <INPUT TYPE=...>           Type of input interaction    -->
1685    <!-- <INPUT NAME=...>           Name of form datum           -->
1686    <!-- <INPUT VALUE="...">   Default/initial/selected value -->
1687    <!-- <INPUT SRC="...">          Address of image        -->
1688    <!-- <INPUT CHECKED>            Initial state is "on"        -->
1689    <!-- <INPUT SIZE=...>           Field size hint         -->
1690    <!-- <INPUT MAXLENGTH=...>      Data length maximum          -->
1691    <!-- <INPUT ALIGN=...>          Image alignment         -->
1692    <!-- <INPUT ACCEPT="...">         List of desired media types    -->
1693    <!-- <INPUT ACCEPT-CHARSET="..."> List of acceptable charsets    -->
1694
1695    <!ELEMENT SELECT - - (OPTION+) -(INPUT|SELECT|TEXTAREA)>
1696    <!ATTLIST SELECT
1697            %attrs;
1698            NAME CDATA #REQUIRED
1699            SIZE NUMBER #IMPLIED
1700            MULTIPLE (MULTIPLE) #IMPLIED
1701            %SDAFORM; "List"
1702            %SDAPREF;
1703            "<LHead>Select #AttVal(Multiple)</LHead>"
1704         >
1705
1706    <!-- <SELECT>            Selection of option(s)        -->
1707    <!-- <SELECT NAME=...>        Name of form datum       -->
1708    <!-- <SELECT SIZE=...>        Options displayed at a time   -->
1709    <!-- <SELECT MULTIPLE>        Multiple selections allowed   -->
1710
1711    <!ELEMENT OPTION - O (#PCDATA)*>
1712    <!ATTLIST OPTION
1713            %attrs;
1714            SELECTED (SELECTED) #IMPLIED
1715            VALUE CDATA #IMPLIED
1716            %SDAFORM; "LItem"
1717            %SDAPREF;
1718            "Option: #AttVal(Value) #AttVal(Selected)"
1719         >
1720
1721    <!-- <OPTION>            A selection option       -->
1722    <!-- <OPTION SELECTED>        Initial state            -->
1723    <!-- <OPTION VALUE="...">     Form datum value for this option-->
1724
1725    <!ELEMENT TEXTAREA - - (#PCDATA)* -(INPUT|SELECT|TEXTAREA)>
1726    <!ATTLIST TEXTAREA
1727            %attrs;
1728            NAME CDATA #REQUIRED
1729            ROWS NUMBER #REQUIRED
1730            COLS NUMBER #REQUIRED
1731            ACCEPT-CHARSET CDATA #IMPLIED -- list of charsets accepted by server --
1732
1733
1734
1735                          Expires 2 December 1996       [Page 31]
1736 \f
1737 Internet Draft          HTML internationalization            27 May 1996
1738
1739
1740            %SDAFORM; "Para"
1741            %SDAPREF; "Input Text -- #AttVal(Name): "
1742            >
1743
1744    <!-- <TEXTAREA>               An area for text input        -->
1745    <!-- <TEXTAREA NAME=...> Name of form datum       -->
1746    <!-- <TEXTAREA ROWS=...> Height of area           -->
1747    <!-- <TEXTAREA COLS=...> Width of area            -->
1748
1749    ]]>
1750
1751
1752    <!--======= Document Head ======================-->
1753
1754    <![ %HTML.Recommended [
1755         <!ENTITY % head.extra "">
1756    ]]>
1757    <!ENTITY % head.extra "& NEXTID?">
1758
1759    <!ENTITY % head.content "TITLE & ISINDEX? & BASE? %head.extra">
1760
1761    <!ELEMENT HEAD O O  (%head.content) +(META|LINK)>
1762    <!ATTLIST HEAD
1763            %attrs;           >
1764
1765    <!-- <HEAD>     Document head   -->
1766
1767    <!ELEMENT TITLE - -  (#PCDATA)*  -(META|LINK)>
1768    <!ATTLIST TITLE
1769            %attrs;
1770            %SDAFORM; "Ti"    >
1771
1772    <!-- <TITLE>    Title of document -->
1773
1774    <!ELEMENT LINK - O EMPTY>
1775    <!ATTLIST LINK
1776            %attrs;
1777            HREF CDATA #REQUIRED
1778            %linkExtraAttributes;
1779            %SDAPREF; "Linked to : #AttVal (TITLE) (URN) (HREF)>"    >
1780
1781    <!-- <LINK>         Link from this document            -->
1782    <!-- <LINK HREF="...">   Address of link destination        -->
1783    <!-- <LINK URN="...">    Lasting name of destination        -->
1784    <!-- <LINK REL=...> Relationship to destination        -->
1785    <!-- <LINK REV=...> Relationship of destination to this     -->
1786    <!-- <LINK TITLE="...">  Title of destination (advisory)         -->
1787    <!-- <LINK CHARSET="..."> Charset of destination (advisory)      -->
1788
1789
1790
1791                          Expires 2 December 1996       [Page 32]
1792 \f
1793 Internet Draft          HTML internationalization            27 May 1996
1794
1795
1796    <!-- <LINK METHODS="..."> Operations allowed (advisory)          -->
1797
1798    <!ELEMENT ISINDEX - O EMPTY>
1799    <!ATTLIST ISINDEX
1800            %attrs;
1801            %SDAPREF;
1802       "<Para>[Document is indexed/searchable.]</Para>">
1803
1804    <!-- <ISINDEX>          Document is a searchable index          -->
1805
1806    <!ELEMENT BASE - O EMPTY>
1807    <!ATTLIST BASE
1808            HREF CDATA #REQUIRED     >
1809
1810    <!-- <BASE>             Base context document                   -->
1811    <!-- <BASE HREF="...">  Address for this document               -->
1812
1813    <!ELEMENT NEXTID - O EMPTY>
1814    <!ATTLIST NEXTID
1815            N CDATA #REQUIRED     >
1816
1817    <!-- <NEXTID>       Next ID to use for link name       -->
1818    <!-- <NEXTID N=...> Next ID to use for link name       -->
1819
1820    <!ELEMENT META - O EMPTY>
1821    <!ATTLIST META
1822            HTTP-EQUIV  NAME    #IMPLIED
1823            NAME        NAME    #IMPLIED
1824            CONTENT     CDATA   #REQUIRED    >
1825
1826    <!-- <META>                     Generic Meta-information        -->
1827    <!-- <META HTTP-EQUIV=...>      HTTP response header name       -->
1828    <!-- <META NAME=...>          Meta-information name           -->
1829    <!-- <META CONTENT="...">       Associated information          -->
1830
1831    <!--======= Document Structure =================-->
1832
1833    <![ %HTML.Deprecated [
1834            <!ENTITY % html.content "HEAD, BODY, PLAINTEXT?">
1835    ]]>
1836    <!ENTITY % html.content "HEAD, BODY">
1837
1838    <!ELEMENT HTML O O  (%html.content)>
1839    <!ENTITY % version.attr "VERSION CDATA #FIXED '%HTML.Version;'">
1840
1841    <!ATTLIST HTML
1842            %attrs;
1843            %version.attr;
1844
1845
1846
1847                          Expires 2 December 1996       [Page 33]
1848 \f
1849 Internet Draft          HTML internationalization            27 May 1996
1850
1851
1852            %SDAFORM; "Book"
1853            >
1854
1855    <!-- <HTML>              HTML Document  -->
1856
1857
1858 7.2. SGML Declaration for HTML
1859
1860    <!SGML  "ISO 8879:1986"
1861    --
1862         SGML Declaration for HyperText Markup Language version 2.x
1863            (HTML 2.x = HTML 2.0 + i18n).
1864
1865    --
1866
1867    CHARSET
1868             BASESET  "ISO Registration Number 177//CHARSET
1869                       ISO/IEC 10646-1:1993 UCS-4 with
1870                       implementation level 3//ESC 2/5 2/15 4/6"
1871             DESCSET  0   9     UNUSED
1872                      9   2     9
1873                      11  2     UNUSED
1874                      13  1     13
1875                      14  18    UNUSED
1876                      32  95    32
1877                      127 1     UNUSED
1878                      128 32    UNUSED
1879                      160 2147483486 160
1880
1881    CAPACITY        SGMLREF
1882                    TOTALCAP        150000
1883                    GRPCAP          150000
1884              ENTCAP         150000
1885
1886    SCOPE    DOCUMENT
1887    SYNTAX
1888             SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1889               17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127
1890             BASESET  "ISO 646:1983//CHARSET
1891                       International Reference Version
1892                       (IRV)//ESC 2/5 4/0"
1893             DESCSET  0 128 0
1894
1895             FUNCTION
1896                      RE            13
1897                      RS            10
1898                      SPACE         32
1899                      TAB SEPCHAR    9
1900
1901
1902
1903                          Expires 2 December 1996       [Page 34]
1904 \f
1905 Internet Draft          HTML internationalization            27 May 1996
1906
1907
1908             NAMING   LCNMSTRT ""
1909                      UCNMSTRT ""
1910                      LCNMCHAR ".-"
1911                      UCNMCHAR ".-"
1912                      NAMECASE GENERAL YES
1913                               ENTITY  NO
1914             DELIM    GENERAL  SGMLREF
1915                      SHORTREF SGMLREF
1916             NAMES    SGMLREF
1917             QUANTITY SGMLREF
1918                      ATTSPLEN 2100
1919                      LITLEN   1024
1920                      NAMELEN  72    -- somewhat arbitrary; taken from
1921                                    internet line length conventions --
1922                      PILEN    1024
1923                      TAGLVL   100
1924                      TAGLEN   2100
1925                      GRPGTCNT 150
1926                      GRPCNT   64
1927
1928    FEATURES
1929      MINIMIZE
1930        DATATAG  NO
1931        OMITTAG  YES
1932        RANK     NO
1933        SHORTTAG YES
1934      LINK
1935        SIMPLE   NO
1936        IMPLICIT NO
1937        EXPLICIT NO
1938      OTHER
1939        CONCUR   NO
1940        SUBDOC   NO
1941        FORMAL   YES
1942      APPINFO    "SDA"  -- conforming SGML Document Access application
1943                  --
1944    >
1945
1946
1947 7.3. ISO Latin 1 entity set
1948
1949    The following public text lists each of the characters specified in
1950    the Added Latin 1 entity set, along with its name, syntax for use,
1951    and description. This list is derived from ISO Standard
1952    8879:1986//ENTITIES Added Latin 1//EN. HTML includes the entire
1953    entity set, and adds entities for all missing characters in the right
1954    part of ISO-8859-1.
1955
1956
1957
1958
1959                          Expires 2 December 1996       [Page 35]
1960 \f
1961 Internet Draft          HTML internationalization            27 May 1996
1962
1963
1964     <!-- (C) International Organization for Standardization 1986
1965          Permission to copy in any form is granted for use with
1966          conforming SGML systems and applications as defined in
1967          ISO 8879, provided this notice is included in all copies.
1968       -->
1969     <!-- Character entity set. Typical invocation:
1970          <!ENTITY % ISOlat1 PUBLIC
1971            "ISO 8879-1986//ENTITIES Added Latin 1//EN//HTML">
1972          %ISOlat1;
1973       -->
1974     <!ENTITY nbsp   CDATA "&#160;" -- no-break space -->
1975     <!ENTITY iexcl  CDATA "&#161;" -- inverted exclamation mark -->
1976     <!ENTITY cent   CDATA "&#162;" -- cent sign -->
1977     <!ENTITY pound  CDATA "&#163;" -- pound sterling sign -->
1978     <!ENTITY curren CDATA "&#164;" -- general currency sign -->
1979     <!ENTITY yen    CDATA "&#165;" -- yen sign -->
1980     <!ENTITY brvbar CDATA "&#166;" -- broken (vertical) bar -->
1981     <!ENTITY sect   CDATA "&#167;" -- section sign -->
1982     <!ENTITY uml    CDATA "&#168;" -- umlaut (dieresis) -->
1983     <!ENTITY copy   CDATA "&#169;" -- copyright sign -->
1984     <!ENTITY ordf   CDATA "&#170;" -- ordinal indicator, feminine -->
1985     <!ENTITY laquo  CDATA "&#171;" -- angle quotation mark, left -->
1986     <!ENTITY not    CDATA "&#172;" -- not sign -->
1987     <!ENTITY shy    CDATA "&#173;" -- soft hyphen -->
1988     <!ENTITY reg    CDATA "&#174;" -- registered sign -->
1989     <!ENTITY macr   CDATA "&#175;" -- macron -->
1990     <!ENTITY deg    CDATA "&#176;" -- degree sign -->
1991     <!ENTITY plusmn CDATA "&#177;" -- plus-or-minus sign -->
1992     <!ENTITY sup2   CDATA "&#178;" -- superscript two -->
1993     <!ENTITY sup3   CDATA "&#179;" -- superscript three -->
1994     <!ENTITY acute  CDATA "&#180;" -- acute accent -->
1995     <!ENTITY micro  CDATA "&#181;" -- micro sign -->
1996     <!ENTITY para   CDATA "&#182;" -- pilcrow (paragraph sign) -->
1997     <!ENTITY middot CDATA "&#183;" -- middle dot -->
1998     <!ENTITY cedil  CDATA "&#184;" -- cedilla -->
1999     <!ENTITY sup1   CDATA "&#185;" -- superscript one -->
2000     <!ENTITY ordm   CDATA "&#186;" -- ordinal indicator, masculine -->
2001     <!ENTITY raquo  CDATA "&#187;" -- angle quotation mark, right -->
2002     <!ENTITY frac14 CDATA "&#188;" -- fraction one-quarter -->
2003     <!ENTITY frac12 CDATA "&#189;" -- fraction one-half -->
2004     <!ENTITY frac34 CDATA "&#190;" -- fraction three-quarters -->
2005     <!ENTITY iquest CDATA "&#191;" -- inverted question mark -->
2006     <!ENTITY Agrave CDATA "&#192;" -- capital A, grave accent -->
2007     <!ENTITY Aacute CDATA "&#193;" -- capital A, acute accent -->
2008     <!ENTITY Acirc  CDATA "&#194;" -- capital A, circumflex accent -->
2009     <!ENTITY Atilde CDATA "&#195;" -- capital A, tilde -->
2010     <!ENTITY Auml   CDATA "&#196;" -- capital A, dieresis or umlaut mark -->
2011     <!ENTITY Aring  CDATA "&#197;" -- capital A, ring -->
2012
2013
2014
2015                          Expires 2 December 1996       [Page 36]
2016 \f
2017 Internet Draft          HTML internationalization            27 May 1996
2018
2019
2020     <!ENTITY AElig  CDATA "&#198;" -- capital AE diphthong (ligature) -->
2021     <!ENTITY Ccedil CDATA "&#199;" -- capital C, cedilla -->
2022     <!ENTITY Egrave CDATA "&#200;" -- capital E, grave accent -->
2023     <!ENTITY Eacute CDATA "&#201;" -- capital E, acute accent -->
2024     <!ENTITY Ecirc  CDATA "&#202;" -- capital E, circumflex accent -->
2025     <!ENTITY Euml   CDATA "&#203;" -- capital E, dieresis or umlaut mark -->
2026     <!ENTITY Igrave CDATA "&#204;" -- capital I, grave accent -->
2027     <!ENTITY Iacute CDATA "&#205;" -- capital I, acute accent -->
2028     <!ENTITY Icirc  CDATA "&#206;" -- capital I, circumflex accent -->
2029     <!ENTITY Iuml   CDATA "&#207;" -- capital I, dieresis or umlaut mark -->
2030     <!ENTITY ETH    CDATA "&#208;" -- capital Eth, Icelandic -->
2031     <!ENTITY Ntilde CDATA "&#209;" -- capital N, tilde -->
2032     <!ENTITY Ograve CDATA "&#210;" -- capital O, grave accent -->
2033     <!ENTITY Oacute CDATA "&#211;" -- capital O, acute accent -->
2034     <!ENTITY Ocirc  CDATA "&#212;" -- capital O, circumflex accent -->
2035     <!ENTITY Otilde CDATA "&#213;" -- capital O, tilde -->
2036     <!ENTITY Ouml   CDATA "&#214;" -- capital O, dieresis or umlaut mark -->
2037     <!ENTITY times  CDATA "&#215;" -- multiply sign -->
2038     <!ENTITY Oslash CDATA "&#216;" -- capital O, slash -->
2039     <!ENTITY Ugrave CDATA "&#217;" -- capital U, grave accent -->
2040     <!ENTITY Uacute CDATA "&#218;" -- capital U, acute accent -->
2041     <!ENTITY Ucirc  CDATA "&#219;" -- capital U, circumflex accent -->
2042     <!ENTITY Uuml   CDATA "&#220;" -- capital U, dieresis or umlaut mark -->
2043     <!ENTITY Yacute CDATA "&#221;" -- capital Y, acute accent -->
2044     <!ENTITY THORN  CDATA "&#222;" -- capital Thorn, Icelandic -->
2045     <!ENTITY szlig  CDATA "&#223;" -- small sharp s, German (sz ligature) -->
2046     <!ENTITY agrave CDATA "&#224;" -- small a, grave accent -->
2047     <!ENTITY aacute CDATA "&#225;" -- small a, acute accent -->
2048     <!ENTITY acirc  CDATA "&#226;" -- small a, circumflex accent -->
2049     <!ENTITY atilde CDATA "&#227;" -- small a, tilde -->
2050     <!ENTITY auml   CDATA "&#228;" -- small a, dieresis or umlaut mark -->
2051     <!ENTITY aring  CDATA "&#229;" -- small a, ring -->
2052     <!ENTITY aelig  CDATA "&#230;" -- small ae diphthong (ligature) -->
2053     <!ENTITY ccedil CDATA "&#231;" -- small c, cedilla -->
2054     <!ENTITY egrave CDATA "&#232;" -- small e, grave accent -->
2055     <!ENTITY eacute CDATA "&#233;" -- small e, acute accent -->
2056     <!ENTITY ecirc  CDATA "&#234;" -- small e, circumflex accent -->
2057     <!ENTITY euml   CDATA "&#235;" -- small e, dieresis or umlaut mark -->
2058     <!ENTITY igrave CDATA "&#236;" -- small i, grave accent -->
2059     <!ENTITY iacute CDATA "&#237;" -- small i, acute accent -->
2060     <!ENTITY icirc  CDATA "&#238;" -- small i, circumflex accent -->
2061     <!ENTITY iuml   CDATA "&#239;" -- small i, dieresis or umlaut mark -->
2062     <!ENTITY eth    CDATA "&#240;" -- small eth, Icelandic -->
2063     <!ENTITY ntilde CDATA "&#241;" -- small n, tilde -->
2064     <!ENTITY ograve CDATA "&#242;" -- small o, grave accent -->
2065     <!ENTITY oacute CDATA "&#243;" -- small o, acute accent -->
2066     <!ENTITY ocirc  CDATA "&#244;" -- small o, circumflex accent -->
2067     <!ENTITY otilde CDATA "&#245;" -- small o, tilde -->
2068
2069
2070
2071                          Expires 2 December 1996       [Page 37]
2072 \f
2073 Internet Draft          HTML internationalization            27 May 1996
2074
2075
2076     <!ENTITY ouml   CDATA "&#246;" -- small o, dieresis or umlaut mark -->
2077     <!ENTITY divide CDATA "&#247;" -- divide sign -->
2078     <!ENTITY oslash CDATA "&#248;" -- small o, slash -->
2079     <!ENTITY ugrave CDATA "&#249;" -- small u, grave accent -->
2080     <!ENTITY uacute CDATA "&#250;" -- small u, acute accent -->
2081     <!ENTITY ucirc  CDATA "&#251;" -- small u, circumflex accent -->
2082     <!ENTITY uuml   CDATA "&#252;" -- small u, dieresis or umlaut mark -->
2083     <!ENTITY yacute CDATA "&#253;" -- small y, acute accent -->
2084     <!ENTITY thorn  CDATA "&#254;" -- small thorn, Icelandic -->
2085     <!ENTITY yuml   CDATA "&#255;" -- small y, dieresis or umlaut mark -->
2086
2087
2088 Bibliography
2089
2090    [BRYAN88]      M. Bryan, "SGML -- An Author's Guide to the Standard
2091                   Generalized Markup Language", Addison-Wesley, Reading,
2092                   1988.
2093
2094    [ERCS]         Extended Reference Concrete Syntax for SGML.
2095                   <http://www.sgmlopen.org/sgml/docs/ercs/ercs-
2096                   home.html>
2097
2098    [GOLD90]       C. F. Goldfarb, "The SGML Handbook", Y. Rubinsky, Ed.,
2099                   Oxford University Press, 1990.
2100
2101    [HTTP-1.1]     R.T. Fielding, H. Frystyk Nielsen, and T. Berners-Lee,
2102                   "Hypertext Transfer Protocol -- HTTP/1.1", Work in
2103                   progress (draft-ietf-http-v11-spec-03.txt), MIT/LCS,
2104                   May 1996.
2105
2106    [ISO-639]      ISO 639:1988. Codes pour la représentation des noms de
2107                   langue.  Technical content in
2108                   <http://www.sil.org/sgml/iso639a.html>
2109
2110    [ISO-3166]     ISO 3166:1993. Codes pour la représentation des noms
2111                   de pays.
2112
2113    [ISO-8601]     ISO 8601:1988.  Éléments de données et formats
2114                   d'échange -- Échange d'information -- Représentation
2115                   de la date et de l'heure.
2116
2117    [ISO-8859-1]   ISO 8859-1:1987.  International Standard -- Informa-
2118                   tion Processing -- 8-bit Single-Byte Coded Graphic
2119                   Character Sets -- Part 1: Latin Alphabet No. 1.
2120
2121    [ISO-8879]     ISO 8879:1986. International Standard -- Information
2122                   Processing -- Text and Office Systems -- Standard Gen-
2123                   eralized Markup Language (SGML).
2124
2125
2126
2127                          Expires 2 December 1996       [Page 38]
2128 \f
2129 Internet Draft          HTML internationalization            27 May 1996
2130
2131
2132    [ISO-10646]    ISO/IEC 10646-1:1993. International Standard -- Infor-
2133                   mation technology -- Universal Multiple-Octet Coded
2134                   Character Set (UCS) -- Part 1: Architecture and Basic
2135                   Multilingual Plane.
2136
2137    [NICOL]        G.T. Nicol, "The Multilingual World Wide Web", Elec-
2138                   tronic Book Technologies, 1995,
2139                   <http://www.ebt.com/docs/multling.html>
2140
2141    [NICOL2]       G.T. Nicol, "MIME Header Supplemented File Type", Work
2142                   in progress, <draft-nicol-mime-header-type-00.txt>,
2143                   EBT, October 1995.
2144
2145    [RFC1345]      K. Simonsen, "Character Mnemonics & Character Sets",
2146                   RFC 1345, Rationel Almen Planlaegning, June 1992.
2147
2148    [RFC1468]      J. Murai, M. Crispin and E. van der Poel, "Japanese
2149                   Character Encoding for Internet Messages", RFC 1468,
2150                   Keio University, Panda Programming, June 1993.
2151
2152    [RFC1521]      N. Borenstein and N. Freed, "MIME (Multipurpose Inter-
2153                   net Mail Extensions) Part One: Mechanisms for Specify-
2154                   ing and Describing the Format of Internet Message Bod-
2155                   ies", RFC 1521, Bellcore, Innosoft, September 1993.
2156
2157    [RFC1641]      D. Goldsmith, M.Davis, "Using Unicode with MIME", RFC
2158                   1641, Taligent inc., July 1994.
2159
2160    [RFC1642]      D. Goldsmith, M. Davis, "UTF-7: A Mail-safe Transfor-
2161                   mation Format of Unicode", RFC 1642, Taligent inc.,
2162                   July 1994.
2163
2164    [RFC1738]      T. Berners-Lee, L. Masinter, and M. McCahill, "Uniform
2165                   Resource Locators (URL)", RFC 1738, CERN, Xerox PARC,
2166                   University of Minnesota, October 1994.
2167
2168    [RFC1766]      H. Alverstrand, "Tags for the Identification of Lan-
2169                   guages", RFC 1766, UNINETT, March 1995.
2170
2171    [RFC1866]      T. Berners-Lee and D. Connolly, "Hypertext Markup Lan-
2172                   guage - 2.0", RFC 1866, MIT/W3C, November 1995.
2173
2174    [RFC1867]      E. Nebel and L. Masinter, "Form-based File Upload in
2175                   HTML", RFC 1867, Xerox Corporation, November 1995.
2176
2177    [RFC1942]      D. Raggett, "HTML Tables", RFC 1942, W3C, May 1996.
2178
2179
2180
2181
2182
2183                          Expires 2 December 1996       [Page 39]
2184 \f
2185 Internet Draft          HTML internationalization            27 May 1996
2186
2187
2188    [RFC1945]      T. Berners-Lee, R.T. Fielding, and H. Frystyk Nielsen,
2189                   "Hypertext Transfer Protocol -- HTTP/1.0", RFC 1945,
2190                   MIT/LCS, UC Irvine, May 1996.
2191
2192    [SQ91]         SoftQuad, "The SGML Primer", 3rd ed., SoftQuad Inc.,
2193                   1991.
2194
2195    [TAKADA]       Toshihiro Takada, "Multilingual Information Exchange
2196                   through the World-Wide Web", Computer Networks and
2197                   ISDN Systems, Vol. 27, No. 2, Nov. 1994 , p. 235-241.
2198
2199    [TEI]          TEI Guidelines for Electronic Text Encoding and Inter-
2200                   change.  <http://etext.virgina.edu/TEI.html>
2201
2202    [UNICODE]      The Unicode Consortium, "The Unicode Standard --
2203                   Worldwide Character Encoding -- Version 1.0", Addison-
2204                   Wesley, Volume 1, 1991, Volume 2, 1992, and Technical
2205                   Report #4, 1993.  The BIDI algorithm is in appendix A
2206                   of volume 1, with corrections in appendix D of volume
2207                   2.
2208
2209    [UTF-8]        ISO/IEC 10646-1:1993 AMENDMENT 2 (1996). UCS Transfor-
2210                   mation Format 8 (UTF-8).
2211
2212    [VANH90]       E. van Hervijnen, "Practical SGML", Kluwer Academicq
2213                   Publishers Group, Norwell and Dordrecht, 1990.
2214
2215 Authors' Addresses
2216
2217    François Yergeau
2218    Alis Technologies
2219    100, boul. Alexis-Nihon, bureau 600
2220    Montréal  QC  H4M 2P2
2221    Canada
2222
2223    Tel: +1 (514) 747-2547
2224    Fax: +1 (514) 747-2561
2225    EMail: fyergeau@alis.com
2226
2227
2228    Gavin Thomas Nicol
2229    Electronic Book Technologies, Japan
2230    1-29-9 Tsurumaki,
2231    Setagaya-ku,
2232    Tokyo
2233    Japan
2234
2235    Tel: +81-3-3230-8161
2236
2237
2238
2239                          Expires 2 December 1996       [Page 40]
2240 \f
2241 Internet Draft          HTML internationalization            27 May 1996
2242
2243
2244    Fax: +81-3-3230-8163
2245    EMail: gtn@ebt.com, gtn@twics.co.jp
2246
2247
2248    Glenn Adams
2249    Spyglass
2250    118 Magazine Street
2251    Cambridge, MA 02139
2252    U.S.A.
2253
2254    Tel: +1 (617) 864-5524
2255    Fax: +1 (617) 864-4965
2256    EMail: glenn@spyglass.com
2257
2258
2259    Martin J. Duerst
2260    Multimedia-Laboratory
2261    Department of Computer Science
2262    University of Zurich
2263    Winterthurerstrasse 190
2264    CH-8057 Zurich
2265    Switzerland
2266
2267    Tel: +41 1 257 43 16
2268    Fax: +41 1 363 00 35
2269    E-mail: mduerst@ifi.unizh.ch
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295                          Expires 2 December 1996       [Page 41]
2296 \f