Date: Tue, 22 Jul 1997 11:22:21 -0500
From: Mark Mandel Mark[AT SYMBOL GOES HERE]DRAGONSYS.COM
Subject: Re: HTML diacritics and such
In further answer to the question,
How can one tell something is not ASCII when it looks fine on one's own=
email system before it's sent?
[and note that weird line-ending equal sign! Inserted by MIME, perhaps?]:
Here's an exhaustive answer. ASCII proper (7-bit), which should be universally carried as is on
the Internet(1),
consists of the following, arranged sort of by familiarity instead by ASCII code or keyboard:
The ultra-basics:
26 uppercase unaccented letters A to Z
26 lowercase unaccented letters a to z
10 digits 0 to 9
The invisible:
space, generated by pressing the space bar
newline, generated by pressing ENTER or --' (or maybe RETURN?) (2)
tab, generated by pressing the TAB key (4)
Familiar punctuation:
, comma
. period, full stop
? question mark
; semicolon
: colon
' apostrophe / open single quote / close single quote (3)
" quotation mark (3)
( ) parentheses
! exclamation mark
- hyphen or minus sign
Familiar miscellaneous and mathematical symbols:
[AT SYMBOL GOES HERE] at sign
# number sign, pound sign, sharp, octothorp (5)
$ dollar sign
% percent sign
& ampersand
* asterisk
+ plus sign
= equal sign
/ slash, solidus
less-than sign
greater-than sign
[ ] brackets (1)
{ } braces ("curly brackets") (1)
Other characters:
~ tilde
` grave accent
^ caret or circumflex
_ underscore
| vertical bar, "pipe symbol" to Unix users (1)
\ backslash, reverse solidus (1)
NOTES:
1. At one time the Scandinavian languages replaced [ ] { } \ | with special letters. To the best of
my knowledge, those
letters now are commonly represented with codes above 127.
2. Different operating systems and word processors notoriously have different ways of
representing a new line.
Automatic word wrap is especially deceptive, producing line breaks that depend on the OS, the
WP, the margin
setting, the font name, size, and style, and the width of the screen. Regardless, though, when you
actually,
physically press ENTER in your mail editor you ought to produce a line break that all recipients
will see there... unless
THEIR mail readers mess it up, as mine often does (but that's not your problem)!
3. ASCII does not have distinct left and right quotation marks, just
' a single vertical tick
and
" a double vertical tick
which do double duty, working both ends of the quotation; and the single tick is also the
apostrophe. Some people
and word processors use the ASCII grave accent `, singly or doubled, for left quotes. This
transmits safely, as far
as I know.
Many word processors generate distinct left and right characters by analyzing the context in which
you press the
keys labeled with these symbols, and insert special codes for them into the text. These will appear
unpredictably
when transmitted through email and displayed on other peoples' screens. If you are using a word
processor to write
text that you intend to email, and you see distinct left and right quotes on the screen, or a curly
apostrophe, look for
an option menu that lets you turn off this feature (often called "smart quotes").
4. The tab key (short for TABULATOR) is wonderful for making tables. How far does it move
the cursor? Your
guess is as good as mine. Like word wrap, it depends on the OS, the WP, the margin setting, and
the font name,
size, and style. Even in monospace fonts -- even in DOS -- there is no uniform standard. The table
that looks so
great on your screen will come to me staggered, jagged, and word-wrapped to hell. When making
tables for email
transmission or posting, use a monospace font (such as Courier or Monaco) and the space bar
(autorepeat can be
very helpful) rather than the tab key. After all, you're trying to share information, not disguise it.
5. The cross-hatched symbol like a tic-tac-toe board or musical sharp sign usually means
"number" in the US, as in
"Post Office Box #217", when preceding a number, and occasionally "pounds [avoirdupois]"
when following, as in
"Idaho potatoes, 100# bag"; whence the name "pound sign", which in the UK refers to the
pound-sterling symbol, a
crossed curly capital L. Some wiseacre decided to pun on this by making the SAME byte appear
as a cross-hatch in
the US and a sterling symbol in the UK, at least sometimes.
If you "honor and remember" these limitations, your fellow researchers will bless you!
Oh, and one more thing. "HTML diacritics"? HTML doesn't use diacritics: it goes to great lengths
to use ONLY 7-bit
ASCII to express all its formats and characters, including diacritical marks: e.g.,
â Small a, circumflex accent
Mark A. Mandel : Senior Linguist : mark[AT SYMBOL GOES HERE]dragonsys.com
Dragon Systems, Inc. : speech recognition : +1 617 965-5200
320 Nevada St., Newton, MA 02160, USA : http://www.dragonsys.com/
Personal home page: http://world.std.com/~mam/