33 Years of the Digest ... founded August 21, 1981
Copyright © 2014 E. William Horne. All Rights Reserved.

The Telecom Digest for Sep 21, 2014
Volume 33 : Issue 164 : "text" Format

Messages in this Issue:
Opponents of Internet Regulation Carry the Day (Neal McLain)
Re: Is it time for a new charset in the Digest? (Telecom Digest Moderator)

If your actions inspire others to dream more, learn more, do more and become more, you are a leader.  - John Quincy Adams

See the bottom of this issue for subscription and archive details.

Date: Sat, 20 Sep 2014 05:37:44 -0700 (PDT) From: Neal McLain <nmclain@annsgarden.com> To: telecomdigestsubmissions.remove-this@and-this-too.telecom-digest.org. Subject: Opponents of Internet Regulation Carry the Day Message-ID: <74118b66-0f2e-427a-a60c-5457935decd7@googlegroups.com> By Phil Kerpen, Townhall Daily, Sep 19, 2014 An incredible thing happened in the recent reply-comment period regarding the Federal Communications Commission (FCC) proposal to regulate the Internet like old-fashioned monopoly telephone service: the side telling the agency not to regulate carried the day. The radical left, demanding federal regulatory control of the building blocks of Internet, brought all the usual hype and hoopla and had free-spending corporate backers in Google and Netflix, who want regulators to force you to pay the costs of their downstream bandwidth, so they won't have to. This campaign by liberal special interests like MoveOn and the Sierra Club converted forty thousand websites into campaign advertisements urging visitors to support Internet regulation. The websites participated in a stunt called "Internet Slowdown Day." These sites lied to visitors, claiming that without unprecedented new government regulation, broadband providers would start slowing down and degrading service. Of course, such a thing has never happened, even without politicians in charge of the Internet. If a broadband provider ever tried such a stupid move, they'd lose customers in droves, and the board of directors would fire the CEO. The very fact these sites had to fake a slowdown should serve as proof that liberals are engaging in pure fantasy. Continued: http://townhall.com/columnists/philkerpen/2014/09/19/opponents-of-internet-regulation-carry-the-day-n1894125/page/full -or- http://tinyurl.com/n1894125 Neal McLain
Date: Sat, 20 Sep 2014 12:14:29 -0400 From: Telecom Digest Moderator <telecomdigestsubmissions@remove-this.telecom-digest.org> To: telecomdigestsubmissions.remove-this@and-this-too.telecom-digest.org. Subject: Re: Is it time for a new charset in the Digest? Message-ID: <20140920161429.GA25811@telecom.csail.mit.edu> On Fri, Sep 19, 2014 at 12:52:37AM -0500, Gordon Burditt wrote: >Telecom Digest Moderator wrote: >> I've been using the ISO-8859-1 "Latin1" character set in the Digest >> for a few years now: we adopted it as the standard after a reader made >> me awaare that there are no accented characters in ASCII, so I figured >> that I'd implement a way for him to spell his name properly, and also >> be able to add "Internationalization" to my résumé. >> >> I'm wondering if it's time for another change, either to one of the >> "transitional" Unicode formats, such as UTF-8, or perhaps to a >> permanent solution such as UCS-16. > > There is no UCS-16. There are UCS-2 or UTF-16. The "TF" in "UTF" > stands for "Transformation Format", not "Transitional Format". Thanks for the correction: I had not known that. May I infer that UTF-8 is a "permanent" format that is here to stay? (BTW, what is being "transformed"?) > Another thing that uses the term "transitional" is HTML, which > is not related to character sets. I had not known that either. What part of html is "transitional"? > I recommend that you go to UTF-8, or stick with ISO-8859-1 (or > Windows-1252, which is a superset of ISO-8859-1). I don't think the > other choices are reasonable. Trying to go with ISO-8859-*, where > 15 different charsets with lots of overlap are distinguished by > charset tags is going to cause problems when someone using > ISO-8859-X quotes someone using ISO-8859-Y, where X != Y, and > characters outside the common subset are used. > I like UTF-8. I hope it becomes permanent for things like the web > and email. It has the advantage that no byte sequence for any > character is a subset of the byte sequence for any other character, > so a pattern-search designed for ASCII still works. Actually, a lot > of things "just work" with UTF-8 for programs expecting ASCII. That > won't happen for UTF-16. Good point, and that explains the "blank space" in UTF-8 which other character sets use for "High ASCII" characters. THis is sometimes non-intuitive, however, as for example with the acute-accent "E": In ISO-8859-1, the acute-accent "e" is a single byte with value 0xE9. In UTF-8, it is a two-byte sequence listed at http://www.utf8-chartable.de/ as Code point char Hex values Description U+00E9 é c3 a9 "LATIN SMALL LETTER E WITH ACUTE" ... which confuses me somewhat, since the "code point" is the same value as ISO-8859-1, but the actual byte sequence is very different. IIRC, the "C3" is an "escape" value that says "go to the two-byte table", but I may need instruction. > I hope UTF-16 and UCS-2 die out. They encourage a halfway solution > in which characters with codes that won't fit in 16 bits aren't > supported. They also have the byte-order abomination. They do > NOT solve the issue of variable-width characters. Even UCS-4 or > UTF-32 does not do that, due to the existence of "combining > characters". The byte order mark of UTF-16 is a problem for mail > and news articles. Where do you put it? If it's before the headers, > then most every mail and news server currently running will interpret > it as part of the headers, mangling one of them, or worse, interpret > it as a division between (no) headers and the body of the message, > and ending up with a lot of rejected mail due to "missing" headers > like From:, Subject: or Newsgroups: . If you put it at the start > of the body, well, I can imagine the mess you end up with replies > to articles with quoting, even if everyone is using UTF-16. No > BOM. Multiple conflicting BOMs. BOMs in the middle of text where > they aren't looked at. OK, you've convinced me: I didn't know that there was such a thing as a "Byte Order Mark", and having to add it to incoming posts which are not in UCS-2 would be a PITA. So, I'll stay away from UCS-2 and UTF-16. > How often have you needed to translate something to be posted from > whatever character set it was in to ISO-8859-1, and ended up with > untranslatable characters? If the answer is "never", there's > probably no pressing need to change. If your only concern is > people's names, there may be no need to change, unless you get a > lot of contributers with Japanese, Chinese, Korean, or Vietnamese > names who still write in English. But if you are going to change, > please choose UTF-8, not UTF-16. > > One problem that often arises from using multiple charsets in a > newsgroup or mailing list is that quoted text with charset A included > in a post with charset B often results in a mess on the screens of > readers. Using UTF-8 won't solve this, but it will reduce it. It's > even worse when characters in charset A used in the quoted post > have no equivalent in charset B (possible with, for example, > ISO-8859-1 vs. ISO-8859-5). At least if charset B includes all the > characters, translation is possible. Unless you try putting your > foot down and claiming that all submissions must be in UTF-8, > you'll probably still have to translate parts of some submissions. AHA! The crux of the issue! I am compelled to translate "mystery meat" characters several times each week, and they always come in emails which have NO "charset" specified. Some email clients send all characters out as whatever-charset-the- user-is-using, which in most cases is "windows-12xx", but without any clue for other operating systems or email clients as to what kind of mystery meat is in the can. Moreover, quoted material which the sender received as ISO-8859-1 is usually returned unmarked and unconverted, and is lumped in with the "default" character set of the email client being used, so that what went out as "résumé" comes back as "r@?sum@&" or similar gibberish. > You should check out browser and mail reader support for various > charsets. I believe the only required charsets for browsers are: > ASCII, ISO-8859-1 ("Latin1"), Windows-1252 (a superset of ISO-8859-1), > and UTF-8. I will, and thanks again. > In a survey of character sets used on the web in August, 2014 > ( > http://w3techs.com/technologies/overview/character_encoding/all > ), these > are some of the results (a web site may use more than one character > set, so results may add to more than 100%, but not by much): > > #1 UTF-8 81.4% > #2 ISO-8859-1 9.9% > #3 Windows-1251 2.3% > #4 GB2312 1.4% > #5 Shift JIS 1.3% > #6 Windows-1252 1.2% > #7 GBK 0.4% > ... > #18 US-ASCII 0.1% > ... > UTF-16 less than 0.1% OK, that's pretty powerful evidence that UTF-8 has become the default, at least on the web. However, AFAIK, there is no "default" for Usenet. The biggest problem I have when trying to come up with a one-size-fits-all solution to the charset dilemma is that so few email clients bother to mark outgoing messages (Either Usenet posts or emails) with the character set that was used to create them, and that means a lot of guesswork here at Digest Central whenever accented characters are used. Take a look at these stats, which are drawn from Digest posts received between Aug 1 and Sep 20: There were 271 posts, not including "service" messages from other sites, status reports from the Majordomo robot, etc. Of those 271, only 214 contained the "Content-Type: text/plain" header. We received 27 "multipart" MIME messages (1), which aren't counted here, or "text/html" messages, which were discarded.(2) US-ASCII 48.13% ISO-8859-1 30.37% UTF-8 9.81% ANSI_X3.4-1968 6.07% Windows-1252 5.61% There are two things to note: * "Multipart" messages are converted to plain text before I see them, but they aren't counted in the figures above. * I'm unable to verify if the Charset info is correct, i.e., if the email client which created each post actually used the character set the client reported. So, I think I can use these percentages as a first approximation to draw these conclusions: 1. The majority of posts are marked as "ASCII" when they are created. 2. "ISO-8859-1" is a clear second choice. 3. All others are distant third-place finishers. However, I can't tell just from these numbers if the readers want to use ASCII, ISO-8859-1, UTF-8, or something else. In other words, the large percentage of messages which used "ISO-8859-1" might be a result of readers setting their newsreaders or email clients to use that standard. Still, the low percentage of "UTF-8" submissions gives me pause. Thanks again for your insight. I'm going to do more research. Bill 1.) Multipart messages which have a "text/plain" component are stripped of other sections and sent to me as "text/plain" posts. They're not counted here because I didn't have time to go through the incoming emails and add up what character set each was using in the "text/plain" section. There were 27 multipart messages, less than 10% of the total. 2.) BTW, sorry if you sent a post with a "text/html" Content-type, but we don't have any way to convert HTML into plain text, and the Digest doesn't publish HTML posts. Every one I've ever looked at was spam. -- Bill Horne Moderator

TELECOM Digest is an electronic journal devoted mostly to telecom- munications topics. It is circulated anywhere there is email, in addition to Usenet, where it appears as the moderated newsgroup 'comp.dcom.telecom'.

TELECOM Digest is a not-for-profit educational service offered to the Internet by Bill Horne.

The Telecom Digest is moderated by Bill Horne.
Contact information: Bill Horne
Telecom Digest
43 Deerfield Road
Sharon MA 02067-2301
bill at horne dot net
Subscribe: telecom-request@telecom-digest.org?body=subscribe telecom
Unsubscribe: telecom-request@telecom-digest.org?body=unsubscribe telecom

This Digest is the oldest continuing e-journal about telecomm- unications on the Internet, having been founded in August, 1981 and published continuously since then. Our archives are available for your review/research. We believe we are the oldest e-zine/mailing list on the internet in any category! URL information: http://telecom-digest.org Copyright © 2014 E. William Horne. All rights reserved.

Finally, the Digest is funded by gifts from generous readers such as yourself. Thank you!

All opinions expressed herein are deemed to be those of the author. Any organizations listed are for identification purposes only and messages should not be considered any official expression by the organization.

End of The Telecom Digest (2 messages)

Return to Archives ** Older Issues