Date: Tue, 22 Jul 1997 11:22:21 -0500 From: Mark Mandel Subject: Re: HTML diacritics and such In further answer to the question, >>How can one tell something is not ASCII when it looks fine on one's own= email system before it's sent? [and note that weird line-ending equal sign! Inserted by MIME, perhaps?]: Here's an exhaustive answer. ASCII proper (7-bit), which should be universally carried as is on the Internet(1), consists of the following, arranged sort of by familiarity instead by ASCII code or keyboard: The ultra-basics: 26 uppercase unaccented letters A to Z 26 lowercase unaccented letters a to z 10 digits 0 to 9 The invisible: space, generated by pressing the space bar newline, generated by pressing ENTER or <--' (or maybe RETURN?) (2) tab, generated by pressing the TAB key (4) Familiar punctuation: , comma . period, full stop ? question mark ; semicolon : colon ' apostrophe / open single quote / close single quote (3) " quotation mark (3) ( ) parentheses ! exclamation mark - hyphen or minus sign Familiar miscellaneous and mathematical symbols: [AT SYMBOL GOES HERE] at sign # number sign, pound sign, sharp, octothorp (5) $ dollar sign % percent sign & ampersand * asterisk + plus sign = equal sign / slash, solidus < less-than sign > greater-than sign [ ] brackets (1) { } braces ("curly brackets") (1) Other characters: ~ tilde ` grave accent ^ caret or circumflex _ underscore | vertical bar, "pipe symbol" to Unix users (1) \ backslash, reverse solidus (1) NOTES: 1. At one time the Scandinavian languages replaced [ ] { } \ | with special letters. To the best of my knowledge, those letters now are commonly represented with codes above 127. 2. Different operating systems and word processors notoriously have different ways of representing a new line. Automatic word wrap is especially deceptive, producing line breaks that depend on the OS, the WP, the margin setting, the font name, size, and style, and the width of the screen. Regardless, though, when you actually, physically press ENTER in your mail editor you ought to produce a line break that all recipients will see there... unless THEIR mail readers mess it up, as mine often does (but that's not your problem)! 3. ASCII does not have distinct left and right quotation marks, just ' a single vertical tick and " a double vertical tick which do double duty, working both ends of the quotation; and the single tick is also the apostrophe. Some people and word processors use the ASCII grave accent `, singly or doubled, for left quotes. This transmits safely, as far as I know. Many word processors generate distinct left and right characters by analyzing the context in which you press the keys labeled with these symbols, and insert special codes for them into the text. These will appear unpredictably when transmitted through email and displayed on other peoples' screens. If you are using a word processor to write text that you intend to email, and you see distinct left and right quotes on the screen, or a curly apostrophe, look for an option menu that lets you turn off this feature (often called "smart quotes"). 4. The tab key (short for TABULATOR) is wonderful for making tables. How far does it move the cursor? Your guess is as good as mine. Like word wrap, it depends on the OS, the WP, the margin setting, and the font name, size, and style. Even in monospace fonts -- even in DOS -- there is no uniform standard. The table that looks so great on your screen will come to me staggered, jagged, and word-wrapped to hell. When making tables for email transmission or posting, use a monospace font (such as Courier or Monaco) and the space bar (autorepeat can be very helpful) rather than the tab key. After all, you're trying to share information, not disguise it. 5. The cross-hatched symbol like a tic-tac-toe board or musical sharp sign usually means "number" in the US, as in "Post Office Box #217", when preceding a number, and occasionally "pounds [avoirdupois]" when following, as in "Idaho potatoes, 100# bag"; whence the name "pound sign", which in the UK refers to the pound-sterling symbol, a crossed curly capital L. Some wiseacre decided to pun on this by making the SAME byte appear as a cross-hatch in the US and a sterling symbol in the UK, at least sometimes. If you "honor and remember" these limitations, your fellow researchers will bless you! Oh, and one more thing. "HTML diacritics"? HTML doesn't use diacritics: it goes to great lengths to use ONLY 7-bit ASCII to express all its formats and characters, including diacritical marks: e.g., â Small a, circumflex accent Mark A. Mandel : Senior Linguist : mark[AT SYMBOL GOES HERE]dragonsys.com Dragon Systems, Inc. : speech recognition : +1 617 965-5200 320 Nevada St., Newton, MA 02160, USA : http://www.dragonsys.com/ Personal home page: http://world.std.com/~mam/