33 Years of the Digest ... founded August 21, 1981
Copyright © 2014 E. William Horne. All Rights Reserved.
The Telecom Digest for Sep 22, 2014
|Messages in this Issue:|
|Re: Is it time for a new charset in the Digest?||(Neal McLain)|
|Re: Is it time for a new charset in the Digest?||(tlvp)|
|Re: Is it time for a new charset in the Digest?||(Gordon Burditt)|
|Re: Is it time for a new charset in the Digest?||(Garrett Wollman)|
|Opponents of Internet Regulation Carry the Day||(Neal McLain)|
|Re: Is it time for a new charset in the Digest?||(John Levine)|
|Re: Is it time for a new charset in the Digest?||(John Levine)|
|Vermont reminds drivers of Oct. 1 cellphone ban||(Monty Solomon)|
|Android L will have device encryption on by default||(Monty Solomon)|
If Tyranny and Oppression come to this land, it will be in the guise of fighting a foreign enemy. - James Madison
See the bottom of this issue for subscription and archive details.
|Date: Sat, 20 Sep 2014 21:54:21 -0700 (PDT) From: Neal McLain <email@example.com> To: firstname.lastname@example.org. Subject: Re: Is it time for a new charset in the Digest? Message-ID: <email@example.com> On Saturday, September 20, 2014 11:14:29 AM UTC-5, Telecom Digest Moderator wrote: [snip] > >> be able to add "Internationalization" to my résumé. [snip] > Code point char Hex values Description > U+00E9 é c3 a9 "LATIN SMALL LETTER E WITH ACUTE" [snip] > went out as "résumé" comes back as "r©Âsumr©Â" or similar gibberish. The above-quoted lines of text indicate how Google treats diacritic characters in the Google Groups archive. https://groups.google.com/forum/?hl=en#!topic/comp.dcom.telecom/yqHnudP1Ia4 Judging from previous comments in this thread, I may be the only T-D reader who read messages in, and posts from, the Google Group archive. While I don't have any opinion about what character set T-D should use in the future, I suggest that the Google Group archive shouldn't be left out of the discussion. Neal McLain ***** Moderator's Note ***** I'd like to use a character set which Everyone can understand, but that's not always possible. I don't like it, but windoze is the default standard, and redmond gets to dictate what works whether I like it or not, so if I try to please Google and mickeysoft, I'm done before I start. I'd much rather choose a standards-based solution, which everyone can agree on, and which will, at least, allow those with evil-empire software (of ANY kind) to adapt and find workarounds. Bill Horne Moderator|
|Date: Sat, 20 Sep 2014 21:59:23 -0400 From: tlvp <mPiOsUcB.EtLlLvEp@att.net> To: firstname.lastname@example.org. Subject: Re: Is it time for a new charset in the Digest? Message-ID: <email@example.com> On Sat, 20 Sep 2014 12:14:29 -0400, Telecom Digest Moderator wrote: >> Another thing that uses the term "transitional" is HTML, which >> is not related to character sets. > > I had not known that either. What part of html is "transitional"? It's not that the "t" in "html" means "transitional" (it doesn't) -- it's that HTML has various acceptable standards, some of them transitional, as shown in the corresponding DOCTYPE declarations (with which a valid HTML document must begin), for example, in this one: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> There you plainly see 'the term "transitional" ', n'est-ce pas? Cheers, -- tlvp -- Avant de repondre, jeter la poubelle, SVP.|
|Date: Sun, 21 Sep 2014 03:37:58 -0500 From: firstname.lastname@example.org (Gordon Burditt) To: email@example.com. Subject: Re: Is it time for a new charset in the Digest? Message-ID: <X8GdnexBqe37E4PJnZ2dnUVZ_vadnZ2d@posted.internetamerica> >>> I'm wondering if it's time for another change, either to one of the >>> "transitional" Unicode formats, such as UTF-8, or perhaps to a >>> permanent solution such as UCS-16. >> >> There is no UCS-16. There are UCS-2 or UTF-16. The "TF" in "UTF" >> stands for "Transformation Format", not "Transitional Format". > > Thanks for the correction: I had not known that. May I infer that > UTF-8 is a "permanent" format that is here to stay? UTF-8 is as permanent as any other format, hopefully moreso. If someone seriously proposed UTF-666 to support all intergalactic languages when we're accepted into the Intergalactic Federation of Planets, I'd expect it would die quickly because of the massive waste of bits per character. Also, a character is not a whole number of bytes. > (BTW, what is being "transformed"?) Character sets, especially Unicode, are defined in terms of "code points". For 8-bit character sets and smaller, the "code point" and the representation are the same. A possible exception is Baudot, with its shift characters, where a character is sent as 5 bits with the shift state implicitly included as a 6th bit. There is, in this case, a 6-bit code point for Baudot character, whether it's defined that way or not or the term "code point" was even invented when Baudot was in wide use. In Unicode, there are at least 5 different representations (transformations): UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE. These are obtained by re-arranging bits and maybe adding more. You may argue about UTF-7, and whether UTF-16 is equivalent to one of UTF-16BE or UTF-16LE or it's a new representation by itself (the rules for BOMs are different). UTF-32LE and UTF-32BE: you take the code point, turn it into a 32-bit number, break it up into 4 bytes in the appropriate order, and ship it. There are 22 possible byte orders that Unicode doesn't support. That was easy. UTF-16BE and UTF-16LE: You take the code point, and if it fits in 16 bits, you break it into 2 bytes in the appropriate order, and ship it. Otherwise, a code point is turned into a High Surrogate (D800-DBFF) followed by a Low Surrogate (DC00-DFFF). Take the code point value, and subtract 0x10000. (If you get a negative number here, then it fits in 16 bits, and you weren't supposed to get here.) Treat it as a 20-bit number (if it doesn't fit in 20 bits, codes >= 0x110000) are invalid), and split it into a 10-bit high half and a 10-bit low half. Add 0xD800 to the high half, and that gives the high surrogate. Add 0xDC00 to the low half, and that gives the low surrogate. Break these into 4 bytes in the appropriate order, and ship it. The code points D800-DFFF are reserved (not only in UTF-16, but in all of Unicode) to avoid ambiguities between real characters and surrogates. UTF-8: You take the code point, and if it fits in 7 bits, you ship it as-is (with a 0 high bit). Otherwise, a character consists of one leader byte (binary pattern 11XXXXXX) followed by one or more following bytes (binary pattern 10XXXXXX). 00XXXXXX Single-byte ASCII character 01XXXXXX Single-byte ASCII character 110XXXXX 10XXXXXX 2-character sequence (0x0080 - 0x07FF) 1110XXXX 10XXXXXX 10XXXXXX 3-character sequence (0x800 - 0xFFFF) 11110XXX 10XXXXXX 10XXXXXX 10XXXXXX 4-character sequence (0x10000 - 0x10FFFF) Unicode only goes up to 0x10FFFF (21 bits) mostly because UTF-16 won't go any higher. It could be extended to use a 7-character sequence, giving a 36-bit code point. If you're willing to use 0xFF as a leader byte, you can go up to 41-bit code points. Following bytes carry 6 data bits. Leading bytes carry 3-5, or if you want to extend it, 0-5. UTF7: This was an attempt to combine Unicode and 7-bit-safe emails in a more space-efficient way than base64. I think they've pretty much given up on it. A Byte Order Mark is the code point U+FEFF translated into whatever format is being used. The byte-reversed character, U+FFFE, is not used, making backwards-byte-order detectable. There is a UTF-8 byte order mark, EF BB BF, but it really serves no purpose but to screw up PHP pages (You're supposed to put certain code in PHP before outputting any text, and that invisible BOM before the <?php counts as text to be output, and that causes errors). >> Another thing that uses the term "transitional" is HTML, which >> is not related to character sets. > > I had not known that either. What part of html is "transitional"? HTML defines a number of DOCTYPE lines (and corresponding DTDs) to put at the front of your HTML to indicate how it is to be parsed. Two of these are called "HTML 4.01 Strict" and "HTML 4.01 Transitional". The transitional one allows some older constructs in process of being phased out to stay around while the strict version doesn't. This may affect how the browser renders the page. > Good point, and that explains the "blank space" in UTF-8 which other There aren't any high-bit bytes that are part of a single-byte character sequence, if that's what you mean. > character sets use for "High ASCII" characters. THis is sometimes > non-intuitive, however, as for example with the acute-accent "E": > > In ISO-8859-1, the acute-accent "e" is a single byte with value > 0xE9. In UTF-8, it is a two-byte sequence listed at > > http://www.utf8-chartable.de/ as > > Code point char Hex values Description > U+00E9 é c3 a9 "LATIN SMALL LETTER E WITH ACUTE" > > ... which confuses me somewhat, since the "code point" is the same > value as ISO-8859-1, but the actual byte sequence is very > different. IIRC, the "C3" is an "escape" value that says "go to the > two-byte table", but I may need instruction. UTF-8 and ISO-8859-1 have the same code points, but different representation, for all of the characters in the "upper half" of ISO-8859-1. The "lower half" corresponds to ASCII, and the code point and the representation are the same. The rest of UTF-8 has no equivalents in ISO-8859-1 (and uses multiple bytes per character). There isn't one 2-byte table, there are 32 of them corresponding to the 32 leader bytes for 2-byte sequences starting with C0 thru DF. And you really aren't supposed to use C0 and C1. You may consider it an "escape", which is one valid way of looking at it, but it contains 5 high-order bits of the value, and the following byte contains 6 low-order bits, so the pattern of a two-byte sequence 110XXXXX 10XXXXXX can cover values up to 0x7FF (11 bits, corresponding to the X's). Point of further confusion: EVERY code point within the Unicode range (0 - 0x10FFFF) has a 4-byte-long UTF-8 encoding. Every code point in the range 0 - 0xFFFF has a 3-byte long UTF-8 encoding. Every code point in the range 0 - 0x7FF has a 2-byte long UTF-8 encoding. That's 4 different ways of encoding an 'X'! "Overlong" encodings are supposed to be treated as errors. Some people have used the byte sequence C0 80 as a way of sneaking an ASCII NUL into a C NUL-terminated string without it being a terminator. > OK, you've convinced me: I didn't know that there was such a thing > as a "Byte Order Mark", and having to add it to incoming posts which > are not in UCS-2 would be a PITA. So, I'll stay away from UCS-2 and > UTF-16. You have to do more than that: if it's pure ASCII you are inserting into an article to be sent as UTF-16, you need to add an ASCII NUL between each character in the text. Running it through "iconv" makes this easier than it sounds. >> One problem that often arises from using multiple charsets in a >> newsgroup or mailing list is that quoted text with charset A included >> in a post with charset B often results in a mess on the screens of >> readers. Using UTF-8 won't solve this, but it will reduce it. It's >> even worse when characters in charset A used in the quoted post >> have no equivalent in charset B (possible with, for example, >> ISO-8859-1 vs. ISO-8859-5). At least if charset B includes all the >> characters, translation is possible. Unless you try putting your >> foot down and claiming that all submissions must be in UTF-8, >> you'll probably still have to translate parts of some submissions. > > AHA! The crux of the issue! > > I am compelled to translate "mystery meat" characters several times > each week, and they always come in emails which have NO "charset" > specified. I have a program that tries to identify the charset of a file, largely by rejecting impossible sequences, assuming it's a text file. It still usually comes up with several ranked possibilities. (The UNIX "file" command might be more practical for this purpose. I think a variant of it has been ported to Mac and Windows.) No, it can't deal with hodgepodge mixtures. It's easy to identify pure 7-bit ASCII. It's easy to identify something that is NOT UTF-8 because the sequence of leading/following characters has to be right. One high-bit character from ISO-8859-* or Windows-* with surrounding 7-bit ASCII chars is enough to reject something as being UTF-8. Something that it calls UTF-8 with over a dozen or two non-7-bit characters almost certainly is UTF-8. It's easy to identify Windows-* character sets if they use characters in the range 0x80 - 0x9f. Which variant? Not so easy. Telling the difference between ISO-8859-* variants is not easy. ISO-8859-16 has no unassigned characters in the range used by ISO-8859-*. The others have only a few. The same applies to Windows-12XX variants. > Some email clients send all characters out as whatever-charset-the- > user-is-using, which in most cases is "windows-12xx", but without any > clue for other operating systems or email clients as to what kind of > mystery meat is in the can. Well, if they send the tag, you can translate it to UTF-8, but if they don't, you may have no choice but to label it all "rat meat". > Moreover, quoted material which the sender received as ISO-8859-1 is > usually returned unmarked and unconverted, and is lumped in with the > "default" character set of the email client being used, so that what > went out as "résumé" comes back as "r©Âsumr©Â" or similar gibberish. > >> You should check out browser and mail reader support for various >> charsets. I believe the only required charsets for browsers are: >> ASCII, ISO-8859-1 ("Latin1"), Windows-1252 (a superset of ISO-8859-1), >> and UTF-8. > > I will, and thanks again. It sounds like there is room for a lot of improvement in mail clients. Gordon L. Burditt|
|Date: Sat, 20 Sep 2014 22:36:11 +0000 (UTC)
From: firstname.lastname@example.org (Garrett Wollman)
Subject: Re: Is it time for a new charset in the Digest?
In article <20140920161429.GA25811@telecom.csail.mit.edu>,
Telecom Digest Moderator <email@example.com> wrote:
>In ISO-8859-1, the acute-accent "e" is a single byte with value
>0xE9. In UTF-8, it is a two-byte sequence listed at
>Code point char Hex values Description
>U+00E9 é c3 a9 "LATIN SMALL LETTER E WITH ACUTE"
>... which confuses me somewhat, since the "code point" is the same
>value as ISO-8859-1, but the actual byte sequence is very
>different. IIRC, the "C3" is an "escape" value that says "go to the
>two-byte table", but I may need instruction.
This is by design. Quoting from the utf8(5) manual page:
The UTF-8 encoding represents UCS-4 characters as a sequence of octets,
using between 1 and 6 for each character. It is backwards compatible
with ASCII, so 0x00-0x7f refer to the ASCII character set. The multibyte
encoding of non-ASCII characters consist entirely of bytes whose high
order bit is set. The actual encoding is represented by the following
[0x00000000 - 0x0000007f] [00000000.0bbbbbbb] -> 0bbbbbbb
[0x00000080 - 0x000007ff] [00000bbb.bbbbbbbb] -> 110bbbbb, 10bbbbbb
[0x00000800 - 0x0000ffff] [bbbbbbbb.bbbbbbbb] ->
1110bbbb, 10bbbbbb, 10bbbbbb
[0x00010000 - 0x001fffff] [00000000.000bbbbb.bbbbbbbb.bbbbbbbb] ->
11110bbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
[0x00200000 - 0x03ffffff] [000000bb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->
111110bb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
[0x04000000 - 0x7fffffff] [0bbbbbbb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->
1111110b, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
If more than a single representation of a value exists (for example,
0x00; 0xC0 0x80; 0xE0 0x80 0x80) the shortest representation is always
used. Longer ones are detected as an error as they pose a potential
security risk, and destroy the 1:1 character:octet sequence mapping.
>However, AFAIK, there is no "default" for Usenet.
Actually, there is, and it's UTF-8. This was standardized in the most
recent update of the article format RFC (the number of which I've
forgotten). But many clients have not yet been updated to meet the
new RFC (because the most functional clients are often old
abandonware that can't do character set conversion at all).
|Date: Sat, 20 Sep 2014 05:37:44 -0700 (PDT) From: Neal McLain <firstname.lastname@example.org> To: email@example.com. Subject: Opponents of Internet Regulation Carry the Day Message-ID: <firstname.lastname@example.org> By Phil Kerpen, Townhall Daily, Sep 19, 2014 An incredible thing happened in the recent reply-comment period regarding the Federal Communications Commission (FCC) proposal to regulate the Internet like old-fashioned monopoly telephone service: the side telling the agency not to regulate carried the day. The radical left, demanding federal regulatory control of the building blocks of Internet, brought all the usual hype and hoopla and had free-spending corporate backers in Google and Netflix, who want regulators to force you to pay the costs of their downstream bandwidth, so they won't have to. This campaign by liberal special interests like MoveOn and the Sierra Club converted forty thousand websites into campaign advertisements urging visitors to support Internet regulation. The websites participated in a stunt called "Internet Slowdown Day." These sites lied to visitors, claiming that without unprecedented new government regulation, broadband providers would start slowing down and degrading service. Of course, such a thing has never happened, even without politicians in charge of the Internet. If a broadband provider ever tried such a stupid move, they'd lose customers in droves, and the board of directors would fire the CEO. The very fact these sites had to fake a slowdown should serve as proof that liberals are engaging in pure fantasy. Continued: http://townhall.com/columnists/philkerpen/2014/09/19/opponents-of-internet-regulation-carry-the-day-n1894125/page/full -or- http://tinyurl.com/n1894125 Neal McLain|
|Date: 21 Sep 2014 14:31:42 -0000 From: "John Levine" <email@example.com> To: firstname.lastname@example.org. Subject: Re: Is it time for a new charset in the Digest? Message-ID: <email@example.com> >> There is no UCS-16. There are UCS-2 or UTF-16. The "TF" in "UTF" >> stands for "Transformation Format", not "Transitional Format". > >Thanks for the correction: I had not known that. May I infer that >UTF-8 is a "permanent" format that is here to stay? > >(BTW, what is being "transformed"?) Yes, you can be confident that UTF-8 is not going away. Everything in the IETF that needs characters beyond ASCII uses UTF-8. Unicode is a 32 bit character set. UTF-8 is a very clever way of encoding Unicode characters into variable length groups of 8 bit bytes. The 0-127 ASCII-compatible range is represented as itself, and the longer encodings are self-synchronizing. >> ... It has the advantage that no byte sequence for any >> character is a subset of the byte sequence for any other character, >> so a pattern-search designed for ASCII still works. Actually, a lot >> of things "just work" with UTF-8 for programs expecting ASCII. That >> won't happen for UTF-16. Quite right. >Code point char Hex values Description >U+00E9 é c3 a9 "LATIN SMALL LETTER E WITH ACUTE" > >... which confuses me somewhat, since the "code point" is the same >value as ISO-8859-1, but the actual byte sequence is very >different. IIRC, the "C3" is an "escape" value that says "go to the >two-byte table", but I may need instruction. Not quite. For Unicode values between 0x080 and 0x7ff, if the bits in the character are ABCDEFGHIJK, the UTF-8 bytes are 110ABCDE 10FGHIJK. The value in the high bits of the first byte tell you how many more bytes follow. The Wikipedia article explains this well. >OK, you've convinced me: I didn't know that there was such a thing >as a "Byte Order Mark", and having to add it to incoming posts which >are not in UCS-2 would be a PITA. So, I'll stay away from UCS-2 and >UTF-16. For this application BOMs aren't important. R's, John|
|Date: 21 Sep 2014 14:33:51 -0000 From: "John Levine" <firstname.lastname@example.org> To: email@example.com. Subject: Re: Is it time for a new charset in the Digest? Message-ID: <firstname.lastname@example.org> >1. If I change to UTF-8, then Digest contributors lose some "native > mode" flexibility in using common phrases which have accented > characters. Since ISO-8859-1 is "more or less" the Windows > character set, I think that makes it easier to copy-and-paste > quotes with accented characters from onlie sites. Only if they're using really out of date software. I use the antique trn newsreader on FreeBSD, and am typing this in UTF-8. Your résumé is safe. >And, if you're thinking that I know all the answers, I'll just shrug >and write "Après moi, le déluge". Mais oui. Jean|
|Date: Sun, 21 Sep 2014 01:53:18 -0400 From: Monty Solomon <email@example.com> To: firstname.lastname@example.org. Subject: Vermont reminds drivers of Oct. 1 cellphone ban Message-ID: <321130FD-FAF5-493C-BE9B-A3AFBF4385B3@roscom.com> MONTPELIER, Vt. . Drivers who use hand-held cellphones and other electronic devices while driving after a new law takes effect Oct. 1 should be prepared to be stopped and fined, according to the head of the Vermont State Police Traffic Safety Division. There will be no grace period after the law takes effect, Lt. Garry Scott said, but efforts are being made ahead of the new law to remind motorists of the ban so they will put away their phones and other devices. http://www.bostonglobe.com/metro/2014/09/20/vermont-reminds-drivers-oct-cellphone-ban/LePqljKgMxiIsHQ4vf9oaP/story.html -or- http://goo.gl/0aiO6I|
|Date: Sun, 21 Sep 2014 00:58:21 -0400 From: Monty Solomon <email@example.com> To: firstname.lastname@example.org. Subject: Android L will have device encryption on by default Message-ID: <7FA15125-9864-4F3B-9D57-DB586A29BB57@roscom.com> Android L will have device encryption on by default And Google says it doesn't have the keys to give to law enforcement. by Ron Amadeo - Sept 18 2014, 6:20pm EDT The Washington Post is reporting that Google will finally step up security efforts on Android and enable device encryption by default. The Post has quoted company spokeswoman Niki Christoff as saying "As part of our next Android release, encryption will be enabled by default out of the box, so you won't even have to think about turning it on". That "next Android release" should be Android L, which is currently out as a developer preview and is expected to be released before the end of the year. http://arstechnica.com/gadgets/2014/09/android-l-will-have-device-encryption-on-by-default/ -or- http://goo.gl/UnpdJO|
TELECOM Digest is an electronic journal devoted mostly to telecom- munications topics. It is circulated anywhere there is email, in addition to Usenet, where it appears as the moderated newsgroup 'comp.dcom.telecom'.
TELECOM Digest is a not-for-profit educational service offered to the Internet by Bill Horne.
The Telecom Digest is moderated by Bill Horne.
43 Deerfield Road
Sharon MA 02067-2301
bill at horne dot net
This Digest is the oldest continuing e-journal about telecomm- unications on the Internet, having been founded in August, 1981 and published continuously since then. Our archives are available for your review/research. We believe we are the oldest e-zine/mailing list on the internet in any category! URL information: http://telecom-digest.org Copyright © 2014 E. William Horne. All rights reserved.
Finally, the Digest is funded by gifts from generous readers such as yourself. Thank you!
All opinions expressed herein are deemed to be those of the author. Any organizations listed are for identification purposes only and messages should not be considered any official expression by the organization.