Discussion:
UTF-16 Beyond U+10FFFF (was: Java char and Unicode 3.0+)
Jill Ramonsky
2003-10-16 14:35:02 UTC
Permalink
Here's an alternative idea.

In UTF-16, as it's currently defined, codepoints in the range U+010000
to U+10FFFF are represented as some High Surrogate (HS) followed by some
Low Surrogate (LS). Also, as currently defined, any HS not followed by
an LS, or an LS not preceeded by an HS, is illegal.

So, to create even higher codepoints still, all you have to do is use
some currently illegal sequences. For example:

HS + LS => 10 bits from HS plus 10 bits from LS (as now)
[This gives a range of 0x00000 to 0xFFFFF, to which we add 0x10000
giving an actual range of U+10000 to U+10FFFF]

HS + HS + LS => 10 bits from first HS plus 10 bits from second HS plus
10 bits from LS
[This gives a range of 0x00000000 to 0x3FFFFFFF, to which we can add
0x110000 giving an actual range of U+110000 to U+4010FFFF]

HS + HS + HS + LS => 10 bits from first HS plus 10 bits from second HS
plus 10 bits from third HS plus 10 bits from LS
[This gives a range of 0x0000000000 to 0xFFFFFFFFFF, to which we can add
0x40110000 giving an actual range of U+40110000 to U+1004010FFFF]

This system can be extended indefinitely, and conflicts with current
UTF-16 only in that it gives meaning to currently illegal sequences.
Observe, however, that it is still always possible to distinguish and
"end" surrogate from a "start-or-middle" surrogate, and that if you
start parsing a sequence in the middle, it will always be possible to
step either backwards or forwards to determine the start or end of a
codepoint sequence.

Jill
-----Original Message-----
Sent: Thursday, October 16, 2003 2:33 PM
Subject: Re: Java char and Unicode 3.0+ (was:Canonical equivalence in
rendering: mandatory or recommended?)
I am also doubting, but I would not bet on it. After all,
when Unicode
started, a single plane was considered waaaaaay more than
sufficient
too.
I not only would bet on it, I actually have a bet on it.
Henry Thompson
of the W3C's Schema WG bet me that we'd outrun the existing
planes within
five years; four left to go and no sign of it, even if
Michael Everson
were to achieve pluripresence and actually get everything
accepted into
the standard that he knows needs to be done.
Just for the case it would be needed, are you keeping an
unassigned range
in the BMP so that extension will remain possible to preserve
an ascending
compatibility or support for UTF-16 which currently is the
main reason why
there are for now 17 planes defined ?
(for example in the range between Hangul syllables and
existing surrogates)
That's OK not to document is officially for now, but it seems
that a prudent
and conservative policy to keep such a range available in the BMP
for the future is needed. Of course, if there's an evolution,
this would
require a later update to the current UTF-8 and UTF-16
conforming rules.
I'm not asking to document it now, but to keep it in mind and
not fully
filling the BMP so that UTF-16 would become impossible to upgrade to
the possible future scheme (such provisions already exist
natively in UTF-8
and UTF-32, since its origin by X/Open and their initial
documentation in
a RFC).
------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Philippe Verdy
2003-10-16 17:51:45 UTC
Permalink
----- Original Message -----
From: "Jill Ramonsky" <***@Aculab.com>
To: <***@unicode.org>
Sent: Thursday, October 16, 2003 4:35 PM
Subject: UTF-16 Beyond U+10FFFF (was: Java char and Unicode 3.0+)
Post by Jill Ramonsky
Here's an alternative idea.
In UTF-16, as it's currently defined, codepoints in the range U+010000
to U+10FFFF are represented as some High Surrogate (HS) followed by some
Low Surrogate (LS). Also, as currently defined, any HS not followed by
an LS, or an LS not preceeded by an HS, is illegal.
So, to create even higher codepoints still, all you have to do is use
HS + LS => 10 bits from HS plus 10 bits from LS (as now)
[This gives a range of 0x00000 to 0xFFFFF, to which we add 0x10000
giving an actual range of U+10000 to U+10FFFF]
HS + HS + LS => 10 bits from first HS plus 10 bits from second HS plus
10 bits from LS
[This gives a range of 0x00000000 to 0x3FFFFFFF, to which we can add
0x110000 giving an actual range of U+110000 to U+4010FFFF]
HS + HS + HS + LS => 10 bits from first HS plus 10 bits from second HS
plus 10 bits from third HS plus 10 bits from LS
[This gives a range of 0x0000000000 to 0xFFFFFFFFFF, to which we can add
0x40110000 giving an actual range of U+40110000 to U+1004010FFFF]
I don't like this idea: there's a performance penalty when parsing from
random places if they points to the HS codepoint: one has to scan backward
to find the start of the sequence (this is effectively the case with UTF-8,
but
not with UTF-16 where a single read indicates the position of the first
character in the encoding sequence).

I frankly would prefer the solution based on "hyper-surrogates" allocated
out of the BMP, with a couple of existing UTF-16 surrogates encoding
each hyper-surrogate (reserved for example in the special plane 14).



------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Loading...