User Tools

Site Tools


formatted_text

Not to be confused with Text formatting.

Formatted text, styled text, or rich text, as opposed to plain text, has styling information beyond the minimum of semantic elements: colours, styles (boldface, italic), sizes, and special features (such as hyperlinks).

Terminology

Formatted text cannot rightly be identified with binary files or be distinct from ASCII text. This is because formatted text is not necessarily binary, it may be text-only, such as HTML, RTF or enriched text files, and it may be ASCII-only. Conversely, a plain text file may be non-ASCII (in an encoding such as Unicode UTF-8). Text-only formatted text is achieved by markup which too is textual, while some editors of formatted text like Microsoft Word save in a binary format.

Beginnings of formatted text

Formatted text has its genesis in the pre-computer use of underscoring to embolden passages in typewritten manuscripts. In the first interactive systems of early computer technology, underscoring was not possible, and users made up for this lack (and the lack of formatting in ASCII) by using certain symbols as substitutes. Emphasis, for example, could be achieved in ASCII in a number of ways:

  • Capitalization: I am NOT making this up.
  • Surrounding with underscores: I am _not_ making this up.
  • Surrounding with asterisks: I am *not* making this up.
  • Spacing: I am n o t making this up.

Surrounding by underscores was also used for book titles: Look it up in _The_C_Programming_Language_.

Markup languages

Formatting can be marked by tags distinguished from the body text by special characters, such as angle brackets in HTML. For example, this text:

:The dog is classified as Canis lupus familiaris in taxonomy.

is marked up in HTML thus: <source lang=“html”> <p>The dog is classified as <i>Canis lupus familiaris</i> in taxonomy.</p> </source> The italicised text is enclosed by an opening and a closing italics tag. In LaTeX, the text would be marked up like this: <source lang=“latex”>

The dog is classified as \textit{Canis lupus familiaris} in taxonomy.
</source> Markup languages can be implemented with any text editor, needing no special software.

Formatted document files

Since the invention of MacWrite, the first WYSIWYG word processor, in which the typist codes the formatting visually rather than by inserting textual markup, word processors have tended to save to binary files. Opening such files with a text editor reveals the text embellished with various binary characters, either around the formatted areas (e.g. in WordPerfect) or separately, at the beginning or end of the file (e.g. in Microsoft Word).

Formatted text documents in binary files have, however, the disadvantages of formatting scope and secrecy. Whereas the extent of formatting is accurately marked in markup languages, WYSIWYG formatting is based on memory, that is, keeping for example your pressing of the boldface button until cancelled. This can lead to formatting mistakes and maintenance troubles. As for secrecy, formatted text document file formats tend to be proprietary and undocumented, leading to difficulty in coding compatibility by third parties, and also to unnecessary upgrades because of version changes.

WordStar was a popular word processor that did not use binary files with hidden characters.

OpenOffice.org Writer saves files in an XML format. However, the resultant file is a binary since it is compressed (a tarball equivalent).

PDF is another formatted text file format that is usually binary (using compression for the text, and storing graphics and fonts in binary). It is generally an end-user format, written from an application such as Microsoft Word or OpenOffice.org Writer, and not editable by the user once done.

See also

<!– moved section from Text encoding –> <!– All this is pretty much bogus, I think. I agree. It contains naïve information. Text encoding ≠ Text formatting.

as a sequence of codes (from a character encoding) for the purpose of computer storage or electronic communication of that text. While character encodings like ASCII represent individual characters of a language, a text encoding has to represent much larger things like articles and books, and must represent not only the characters they contain but the structure and organization of the text, and perhaps information about the text or its appearance. Common examples are HTML and RTF which represent texts in natural languages, and XML, which can represent many kinds of text not necessarily intended to be human-readable (the contents of a database, for example).

In general there are two basic forms of text encoding that are widely used. One is to use a markup language which adds markers to the text itself. Markup has the advantage of being easy to represent, but has the disadvantage of being hard to view without an “aware” reader application. For instance, if an HTML document is opened in a text editor, it is largely readable, but the text is cluttered with codes, and even more so in the case of a table, and there are character references for special characters which may make parts unreadable, at least to those unfamiliar with the format. Another method is to use “pointers” into the text, which is left in the original format. This has the advantage of allowing the content to be easily readable in any editor, although you lose the “styling”. On the downside, editing such a document in a non-aware application typically leaves the pointers pointing to the wrong data. Today the majority of text encoding systems appear to use markup, although whether by choice or simply because “everyone else does” is open to question.

Though character encodings like ASCII and Unicode are not, strictly speaking, text encodings in their own right, they may serve as very simple text encodings if one wishes only to preserve the English content of a document and not necessarily its formatting. By far the most common text encoding now in use is what might informally be called “Plain ASCII”, which involves simply encoding a text as a stream of ASCII characters. The specifics of how this is done vary greatly: for example, the end of a text line might be encoded as ASCII code 10 (“line feed” or “new line”) as is common practice on Unix machines, or as ASCII code 13 (“carriage return”) as is common on Apple machines, or as both (the sequence <13, 10> is used to end lines on DOS based machines and many others, while the rather rare sequence <10, 13> was used by some Acorn machines). Some texts also use this line-end sequence inside paragraphs (with a blank line between paragraphs) while some do not. Also, various texts in this form interpret code 9 (“tab”) and other control characters differently. None of these methods specify how to identify text structure like headings and tables, or special text forms like italics. Text in this format is basically readable by any computer though some work might be needed to accommodate local variations, and all information besides the actual words of the text will be lost. –>

formatted_text.txt · Last modified: 2016/10/26 20:47 by Mike J. Kreuzer PhD MCSE MCT Microsoft Cloud Ecosystem