Date: Tue, 22 Jul 1997 11:22:21 -0500

From: Mark Mandel Mark[AT SYMBOL GOES HERE]DRAGONSYS.COM

Subject: Re: HTML diacritics and such



In further answer to the question,



How can one tell something is not ASCII when it looks fine on one's own=



email system before it's sent?



[and note that weird line-ending equal sign! Inserted by MIME, perhaps?]:



Here's an exhaustive answer. ASCII proper (7-bit), which should be universally carried as is on

the Internet(1),

consists of the following, arranged sort of by familiarity instead by ASCII code or keyboard:



The ultra-basics:

26 uppercase unaccented letters A to Z

26 lowercase unaccented letters a to z

10 digits 0 to 9



The invisible:

space, generated by pressing the space bar

newline, generated by pressing ENTER or --' (or maybe RETURN?) (2)

tab, generated by pressing the TAB key (4)



Familiar punctuation:

, comma

. period, full stop

? question mark

; semicolon

: colon

' apostrophe / open single quote / close single quote (3)

" quotation mark (3)

( ) parentheses

! exclamation mark

- hyphen or minus sign



Familiar miscellaneous and mathematical symbols:

[AT SYMBOL GOES HERE] at sign

# number sign, pound sign, sharp, octothorp (5)

$ dollar sign

% percent sign

& ampersand

* asterisk

+ plus sign

= equal sign

/ slash, solidus

less-than sign

greater-than sign

[ ] brackets (1)

{ } braces ("curly brackets") (1)



Other characters:

~ tilde

` grave accent

^ caret or circumflex

_ underscore

| vertical bar, "pipe symbol" to Unix users (1)

\ backslash, reverse solidus (1)





NOTES:

1. At one time the Scandinavian languages replaced [ ] { } \ | with special letters. To the best of

my knowledge, those

letters now are commonly represented with codes above 127.



2. Different operating systems and word processors notoriously have different ways of

representing a new line.

Automatic word wrap is especially deceptive, producing line breaks that depend on the OS, the

WP, the margin

setting, the font name, size, and style, and the width of the screen. Regardless, though, when you

actually,

physically press ENTER in your mail editor you ought to produce a line break that all recipients

will see there... unless

THEIR mail readers mess it up, as mine often does (but that's not your problem)!



3. ASCII does not have distinct left and right quotation marks, just

' a single vertical tick

and

" a double vertical tick

which do double duty, working both ends of the quotation; and the single tick is also the

apostrophe. Some people

and word processors use the ASCII grave accent `, singly or doubled, for left quotes. This

transmits safely, as far

as I know.



Many word processors generate distinct left and right characters by analyzing the context in which

you press the

keys labeled with these symbols, and insert special codes for them into the text. These will appear

unpredictably

when transmitted through email and displayed on other peoples' screens. If you are using a word

processor to write

text that you intend to email, and you see distinct left and right quotes on the screen, or a curly

apostrophe, look for

an option menu that lets you turn off this feature (often called "smart quotes").



4. The tab key (short for TABULATOR) is wonderful for making tables. How far does it move

the cursor? Your

guess is as good as mine. Like word wrap, it depends on the OS, the WP, the margin setting, and

the font name,

size, and style. Even in monospace fonts -- even in DOS -- there is no uniform standard. The table

that looks so

great on your screen will come to me staggered, jagged, and word-wrapped to hell. When making

tables for email

transmission or posting, use a monospace font (such as Courier or Monaco) and the space bar

(autorepeat can be

very helpful) rather than the tab key. After all, you're trying to share information, not disguise it.



5. The cross-hatched symbol like a tic-tac-toe board or musical sharp sign usually means

"number" in the US, as in

"Post Office Box #217", when preceding a number, and occasionally "pounds [avoirdupois]"

when following, as in

"Idaho potatoes, 100# bag"; whence the name "pound sign", which in the UK refers to the

pound-sterling symbol, a

crossed curly capital L. Some wiseacre decided to pun on this by making the SAME byte appear

as a cross-hatch in

the US and a sterling symbol in the UK, at least sometimes.



If you "honor and remember" these limitations, your fellow researchers will bless you!



Oh, and one more thing. "HTML diacritics"? HTML doesn't use diacritics: it goes to great lengths

to use ONLY 7-bit

ASCII to express all its formats and characters, including diacritical marks: e.g.,

â Small a, circumflex accent



Mark A. Mandel : Senior Linguist : mark[AT SYMBOL GOES HERE]dragonsys.com

Dragon Systems, Inc. : speech recognition : +1 617 965-5200

320 Nevada St., Newton, MA 02160, USA : http://www.dragonsys.com/

Personal home page: http://world.std.com/~mam/