Discussion:
A certain committee?
Jill Ramonsky
2003-10-20 12:07:33 UTC
Permalink
Now THIS intrigues me.

Who were this "certain committee"? And why did they have so much control
over the Unicode Consortium that they could force the introduction of a
new character block that nobody had ever previously used? What was this
"abuse of UTF-8" of which you speak. Indeed, what /is/ an "abuse" of
UTF-8? What does the phrase even mean?

How can you possibly add a block of characters to Unicode and then say
"the UTC sincerely hopes that they never get used at all"? (Particularly
when there are still people around whose actual real characters are
still not being added).

If this "certain committee" had intended to (falsely) declare something
as UTF-8 and then embed something like:

<XXX>lang=en-uk<YYY>

where <XXX> and <YYY> are invalid UTF-8 byte-sequences, then so what?
That would simply mean that "a certain committee"'s code wouldn't then
interoperate with the rest of the world. Why is that any business of the
UC's?

Hell, if only the KLI had thought to implement the Klingon alphabet in
invalid UTF-8 sequences - then maybe the UC would have added Klingon
characters just to shut them up, saying things like "it's not really a
script", and "the UTC sincerely hopes that they never get used at all".
Could have saved an awful lot of time!

Jill
-----Original Message-----
Sent: Monday, October 20, 2003 12:43 PM
To: Jill Ramonsky
Subject: Re: Klingons and their allies - Beyond 17 planes
So, if I have understood this correctly (which is by no
means certain),
these tag characters were added to Unicode in the vague
hope that some
people might one day start using them, or on the off-chance
that someone
might one day need them.
Not.
They were added in order to ward off an abuse of UTF-8 by a certain
committee that insisted it needed lightweight language tagging in
a certain computer protocol. The tags were never a "script".
Everyone
on the UTC sincerely hopes, I believe, that they never get
used at all.
For 99.9% of all use cases, ordinary markup is the Right Thing for
language tagging.
Doug Ewell
2003-10-21 06:29:27 UTC
Permalink
Post by Jill Ramonsky
Who were this "certain committee"? And why did they have so much
control over the Unicode Consortium that they could force the
introduction of a new character block that nobody had ever previously
used? What was this "abuse of UTF-8" of which you speak. Indeed, what
is an "abuse" of UTF-8? What does the phrase even mean?
The so-called "Multi-Lingual String Format" was described in an
Internet-Draft, draft-ietf-acap-mlsf-01.txt, written by Chris Newman of
Innosoft in June 1997. It was an attempt to define a lightweight,
inline language tagging protocol for ACAP (Application Configuration
Access Protocol) using invalid UTF-8 sequences, such as <E0 E5 EE> for
"en".

The protocol was described as "another layer of encoding on top of
UTF-8," but since there was no signature mechanism or other way for
UTF-8 processors to tell this MLSF from normal (corrupted) UTF-8 text,
it was effectively a non-standard extension of UTF-8.

At the time this was proposed, UTF-8 was still new and not very widely
adopted, and there was apparently great concern within the UTC that this
non-standard extension would undermine the stability of the UTF-8 format
(just as the tacit approval of non-shortest UTF-8 sequences was
criticized as a security hole years later). Plane 14 tags were
introduced as an equally lightweight countermeasure to persuade the ACAP
people to abandon MLSF in favor of an official tagging mechanism that
used real (but out-of-the-way) Unicode characters and did not break the
rules of UTF-8.
Post by Jill Ramonsky
How can you possibly add a block of characters to Unicode and then say
"the UTC sincerely hopes that they never get used at all"?
(Particularly when there are still people around whose actual real
characters are still not being added).
First, the comparison between adding this special-purpose tagging
mechanism and adding "actual real characters" that are part of some
writing system is disingenuous. Nobody ever made a choice between
encoding Tai Lue, Rejang, or Plane 14 tags.

Second, there are those of us (outside the UTC) who do feel that Plane
14 language tags have a valid use, since not all text that may benefit
from language tagging is necessarily in a marked-up format. But the
writing is on the wall, and "those of us" have given up our battle.
Post by Jill Ramonsky
If this "certain committee" had intended to (falsely) declare
<XXX>lang=en-uk<YYY>
where <XXX> and <YYY> are invalid UTF-8 byte-sequences, then so what?
That would simply mean that "a certain committee"'s code wouldn't then
interoperate with the rest of the world. Why is that any business of
the UC's?
Because they were publishing their mechanism as an Internet-Draft, which
would soon have graduated to being an RFC, and then other groups might
have picked it up. Again, if you think back to 1997, the most commonly
referenced definition of UTF-8 itself was an RFC.
Post by Jill Ramonsky
Hell, if only the KLI had thought to implement the Klingon alphabet in
invalid UTF-8 sequences - then maybe the UC would have added Klingon
characters just to shut them up, saying things like "it's not really a
script", and "the UTC sincerely hopes that they never get used at
all". Could have saved an awful lot of time!
With all respect, this completely misrepresents the intent and working
process of the UTC.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/



------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Continue reading on narkive:
Loading...