Canonical equivalence in rendering: mandatory or recommended?

Discussion:

Peter Kirk

2003-10-15 11:19:06 UTC

I note the following text from section 5.13, p.127, of the Unicode
standard v.4:

> Canonical equivalence must be taken into account in rendering multiple
> accents, so that any two canonically equivalent sequences display as
> the same. This is particularly important when the canonical order is
> not the customary keyboarding order, which happens in Arabic with
> vowel signs, or in Hebrew with points. In those cases, a rendering
> system may be presented with either the typical typing order or the
> canonical order resulting from normalization, ...

> Rendering systems should handle any of the canonically equivalent
> orders of combining
> marks. This is not a performance issue: The amount of time necessary
> to reorder combining
> marks is insignificant compared to the time necessary to carry out
> other work required
> for rendering.

The word "must" is used here. But this is part of the "Implementation
Guidelines" chapter which is generally not normative. Should this
sentence with "must" be considered mandatory, or just a recommendation
although in certain cases a "particularly important" one?

The conformance chapter does state the following, p.82, which can be
understood as implying the same thing, and refers to section 5.13 in a
way which suggests that the "information" there is relevant to conformance:

> If combining characters have different combining classes... then no
> distinction of graphic form or semantic will result. This principle
> can be crucial for the correct appearance of combining characters. For
> more information, see “Canonical Equivalence” in /Section 5.13,
> Rendering Nonspacing Marks./

Does everyone agree that "This is not a performance issue"?

--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Jill Ramonsky

2003-10-15 12:08:54 UTC

Permalink

> -----Original Message-----
> From: Peter Kirk [mailto:***@qaya.org]
> Sent: Wednesday, October 15, 2003 12:19 PM
> To: Unicode List
> Subject: Canonical equivalence in rendering: mandatory or recommended?
>
>
> Does everyone agree that "This is not a performance issue"?

In my experience, there /is/ a performance hit.

I had to write an API for my employer last year to handle some aspects
of Unicode. We normalised everything to NFD, not NFC (but that's easier,
not harder). Nonetheless, all the string handling routines were not
allowed to /assume/ that the input was in NFD, but they had to guarantee
that the output was. These routines, therefore, had to do a "convert to
NFD" on every input, even if the input were already in NFD. This did
have a significant performance hit, since we were handling (Unicode)
strings throughout the app.

I think that next time I write a similar API, I wll deal with
(string+bool) pairs, instead of plain strings, with the bool meaning
"already normalised". This would definitely speed things up. Of course,
for any strings coming in from "outside", I'd still have to assume they
were not normalised, just in case.

Jill

Peter Kirk

2003-10-15 13:24:39 UTC

Permalink

On 15/10/2003 05:08, Jill Ramonsky wrote:

>
> > -----Original Message-----
> > From: Peter Kirk [mailto:***@qaya.org]
> > Sent: Wednesday, October 15, 2003 12:19 PM
> > To: Unicode List
> > Subject: Canonical equivalence in rendering: mandatory or recommended?
> >
> >
> > Does everyone agree that "This is not a performance issue"?
>
> In my experience, there /is/ a performance hit.
>
> ...
>
Thank you, Jill. Clearly there is a performance hit in this rather
general case or in an application in which string handling is dominant
and speed critical. My question was more specific to rendering processes
for complex scripts, where string handling is not already a major part
of the processing but matters of glyph selection and positioning are. My
instinct would be that in such circumstances the extra processing
required for normalisation is almost trivial, especially with
appropriate caching etc. But I have heard other opinions.

--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Philippe Verdy

2003-10-15 13:44:44 UTC

Permalink

From: Jill Ramonsky

> I think that next time I write a similar API, I will deal with
> (string+bool) pairs, instead of plain strings, with the bool
> meaning "already normalised". This would definitely speed
> things up. Of course, for any strings coming in from
> "outside", I'd still have to assume they were not
> normalised, just in case.

I had the same experience, and I solved it by using an
additional byte in string objects that contains the
current normalization form(s) as a bitfield with bits set
if the string is either in NFC, NFD, NFKC or KFKD form.

This bitfield is all zeroes for unknown (still unparsed)
forms, and one bit is set if the string has already been
parsed: I don't always have to require a string to be in
any NF* form, so I perform normalization only when needed
by testing this bit first which indicates if parsing is required.

This saves lots of unnecessary string allocations and copies
and reduces a lot the process VM footprint. The same
optimization can be done in Java by subclassing the String
class to add a "form" field and related form conversion (getters)
and tests methods. In fact, to further optimize and reduce the
memory footprint of Java strings, in fact I choosed to store
the String in a array of bytes with UTF-8, instead of an
array of chars with UTF-16. The internal representation is
chosen dynamically, depending on usage of that string: if
the string is not accessed often with char indices (which in
Java does not return actual Unicode codepoint indices as
there may be surrogates) the UTF-8 representation uses less
memory in most cases.

It is possible, with a custom class loader to overide the default
String class used in the Java core libraries (note that compiled
Java .class files use UTF-8 for internally stored String constants,
as this allows independance with the architecture, and this is the
class loader that transforms the bytes storage of String constants
into actual chars storage, i.e. currently UTF-16 at runtime.)

Looking at the Java VM machine specification, there does not
seem to be something implying that a Java "char" is necessarily a
16-bit entity. So I think that there will be sometime a conforming
Java VM that will return UTF-32 codepoints in a single char, or
some derived representation using 24-bit storage units.

So there already are some changes of representation for Strings in
Java, and similar technics could be used as well in C#, ECMAScript,
and so on... Handling UTF-16 surrogates will then be something of
the past, except if one uses the legacy String APIs that will
continue to emulate UTF-16 code unit indices. Depending of runtime
tuning parameters, the internal representation of String objects may
(should) become transparent to applications. One future goal
would be that a full Unicode String API will return real characters
as grapheme clusters of varying length, in a way that can be
comparable, orderable, etc... to better match what the users
consider as a string "length" (i.e. a number of grapheme clusters,
if not simply a combining sequence if we exclude the more complex
case of Hangul Jamos and Brahmic clusters).

This leads to many discussions about what is a "character"... This
may be context specific (depending on the application needs, the
system locale, or user preferences)... For XML, which recommends
(but does not mandate) the NFC form, it seems that the definition
of a character is mostly the combining sequence. It is very strange
however that this is a SHOULD and not a MUST, as this may return
unpredictable results in XML applications depending on whever the
SHOULD is implemented or not in the XML parser or transformation
engine.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Markus Scherer

2003-10-15 17:32:34 UTC

Permalink

Philippe Verdy wrote:
> ... In fact, to further optimize and reduce the
> memory footprint of Java strings, in fact I choosed to store
> the String in a array of bytes with UTF-8, instead of an
> array of chars with UTF-16. The internal representation is

This does or does not save space and time depending on the average string contents and on what kind
of processing you do.

> chosen dynamically, depending on usage of that string: if
> the string is not accessed often with char indices (which in
> Java does not return actual Unicode codepoint indices as
> there may be surrogates) the UTF-8 representation uses less
> memory in most cases.
>
> It is possible, with a custom class loader to overide the default
> String class used in the Java core libraries (note that compiled
> Java .class files use UTF-8 for internally stored String constants,

No. It's close to UTF-8, but .class files use a proprietary encoding instead of UTF-8. See the
.class file documentation from Sun.

> as this allows independance with the architecture, and this is the
> class loader that transforms the bytes storage of String constants
> into actual chars storage, i.e. currently UTF-16 at runtime.)
>
> Looking at the Java VM machine specification, there does not
> seem to be something implying that a Java "char" is necessarily a
> 16-bit entity. So I think that there will be sometime a conforming
> Java VM that will return UTF-32 codepoints in a single char, or
> some derived representation using 24-bit storage units.

I don't know about the VM spec, but the language and its APIs have 16-bit chars wired deeply into
them. It would be possible to _add_ a new char32 type, but that is not planned, as far as I know.
_Changing_ char would break all sorts of code. However, as far as I have heard, a future Java
release may provide access to Unicode code points and use ints for them.

(And please do not confuse using a single integer for a code point with UTF-32 - UTF-32 is an
encoding form for _strings_ requiring a certain bit pattern. Integers containing code points are
just that, integers containing code points, not any UTF.)

> So there already are some changes of representation for Strings in
> Java, and similar technics could be used as well in C#, ECMAScript,
> and so on...

I am quite confident that existing languages like these will keep using 16-bit Unicode strings, for
the same reasons as for Java: Changing the string units would break all kinds of code.

Besides, most software with good Unicode support and non-trivial string handling uses 16-bit Unicode
strings, which avoids transformations where software components meet.

> ... Depending of runtime
> tuning parameters, the internal representation of String objects may
> (should) become transparent to applications. One future goal

The internal representation is already transparent in languages like Java. The API behavior has to
match the documentation, though, and cannot be changed on a whim.

> would be that a full Unicode String API will return real characters
> as grapheme clusters of varying length, in a way that can be
> comparable, orderable, etc... to better match what the users
> consider as a string "length" (i.e. a number of grapheme clusters,
> if not simply a combining sequence if we exclude the more complex
> case of Hangul Jamos and Brahmic clusters).

This is overkill for low-level string handling, and is available via library functions. Such library
functions might be part of a language's standard libraries, but won't replace low-level access
functions.

Best regards,
markus

--
Opinions expressed here may not reflect my company's positions unless otherwise noted.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Harald Hammarstrom

2003-10-17 12:06:14 UTC

Permalink

Hello
Sorry is this is a stupid Q but I tried all the regular ways of finding
out.
1. Where are the presentation forms for the special Pashto characters
of the Arabic alphabet? The independent forms are in Arabic 0600-06ff
but I couldn't find them in Arabic Presentation Forms A or B?
2. I can't find a presentation or independent form for the Kurdish
yaa with two dots below and a horizontal line above (but all other
Kurdish chars seem to be present!)

thanks in advance,

Harald

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Cowan

2003-10-17 13:39:32 UTC

Permalink

Harald Hammarstrom scripsit:

> 1. Where are the presentation forms for the special Pashto characters
> of the Arabic alphabet? The independent forms are in Arabic 0600-06ff
> but I couldn't find them in Arabic Presentation Forms A or B?

The presentation forms for the basic Arabic letters are encoded for
the sake of backward compatibility only. Other letters will not receive
such redundant encodings. The 0600 block does not encode independent
forms; rather, these codings (which are preferred) are for the abstract
letters independent of positional considerations.

--
You are a child of the universe no less John Cowan
than the trees and all other acyclic http://www.reutershealth.com
graphs; you have a right to be here. http://www.ccil.org/~cowan
--DeXiderata by Sean McGrath ***@reutershealth.com

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Cowan

2003-10-15 17:00:17 UTC

Permalink

Jill Ramonsky scripsit:

> I had to write an API for my employer last year to handle some aspects
> of Unicode. We normalised everything to NFD, not NFC (but that's easier,
> not harder). Nonetheless, all the string handling routines were not
> allowed to /assume/ that the input was in NFD, but they had to guarantee
> that the output was. These routines, therefore, had to do a "convert to
> NFD" on every input, even if the input were already in NFD. This did
> have a significant performance hit, since we were handling (Unicode)
> strings throughout the app.

Indeed it would. However, checking for normalization is cheaper than
normalizing, and Unicode makes properties available that allow a streamlined
but incomplete check that returns "not normalized" or "maybe normalized".
So input can be handled as follows:

if maybeNormalized(input)
then if normalized(input)
then doTheWork(input)
else doTheWork(normalize(input))
fi
else doTheWork(normalize(input))
fi

The W3C recommends, however, that non-normalized input be rejected rather
than forcibly normalized, on the ground that the supplier of the input
is not meeting his contract.

> I think that next time I write a similar API, I wll deal with
> (string+bool) pairs, instead of plain strings, with the bool meaning
> "already normalised". This would definitely speed things up. Of course,
> for any strings coming in from "outside", I'd still have to assume they
> were not normalised, just in case.

W3C refers to this concept as "certified text". It's a good idea.

> Jill
>

--
Verbogeny is one of the pleasurettes John Cowan <***@reutershealth.com>
of a creatific thinkerizer. http://www.reutershealth.com
-- Peter da Silva http://www.ccil.org/~cowan

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-10-15 20:51:25 UTC

Permalink

On 15/10/2003 10:00, John Cowan wrote:

>...
>
>The W3C recommends, however, that non-normalized input be rejected rather
>than forcibly normalized, on the ground that the supplier of the input
>is not meeting his contract.
>
>
This is nothing at all to do with my canonical equivalence question, but
does touch on my other question today about normalisation in XML and
HTML. You told me then that normalisation is not mandatory and so
effectively only recommended. But if a reader is recommended to reject
non-normalised input, the effect is that normalisation is mandated
except for private communication between a group of cooperating
processes. So, while for example I may put a non-normalised text on my
website, it would be rather pointless because any browser following
recommendations would reject my page. Is that correct? Am I in fact
forced to work on the basis that normalisation is mandatory?

--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Cowan

2003-10-15 21:05:02 UTC

Permalink

Peter Kirk scripsit:

> You told me then that normalisation is not mandatory and so
> effectively only recommended. But if a reader is recommended to reject
> non-normalised input, the effect is that normalisation is mandated
> except for private communication between a group of cooperating
> processes.

The intention is that it's cheaper overall for the creator of the document
to normalize it once than for every receiver of the document to normalize
it, potentially many times over.

> So, while for example I may put a non-normalised text on my
> website, it would be rather pointless because any browser following
> recommendations would reject my page. Is that correct?

Yes, but I think it *very* unlikely that any general-use browser would
ever enforce that recommendation. In general, browsers are written to
accept as much as possible, at least in the HTML environment.
Even in a purely XML 1.1 world (which is unlikely to arrive for a number
of years!), I think that browsers would be built to perhaps warn the
user about non-normalized content, but by no means to reject it out of
hand.

The importance of normalization arises in machine-to-machine communication,
where the danger of being spoofed by non-normalized content that passes
unsubtly written filters is great. XML does not consider documents
equivalent merely because they are canonically equivalent; an element or
attribute name must be identical at the codepoint level to be correctly
recognized.

> Am I in fact
> forced to work on the basis that normalisation is mandatory?

No.

--
Not to perambulate John Cowan <***@reutershealth.com>
the corridors http://www.reutershealth.com
during the hours of repose http://www.ccil.org/~cowan
in the boots of ascension. --Sign in Austrian ski-resort hotel

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Markus Scherer

2003-10-15 17:15:23 UTC

Permalink

Jill Ramonsky wrote:
> I had to write an API for my employer last year to handle some aspects
> of Unicode. We normalised everything to NFD, not NFC (but that's easier,
> not harder). Nonetheless, all the string handling routines were not
> allowed to assume that the input was in NFD, but they had to guarantee
> that the output was. These routines, therefore, had to do a "convert to
> NFD" on every input, even if the input were already in NFD. This did
> have a significant performance hit, since we were handling (Unicode)
> strings throughout the app.

Note that, in addition to "is normalized" flags, it is much faster to check whether a string is
normalized, and to normalize it only if it's not. This at least if there is a good chance that the
string is normalized - as appears to be true in your application, and is usually true where most
other applications check for NFC on input. See UAX #15 for details. ICU has quick check and
normalization functions.

markus

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Jill Ramonsky

2003-10-15 14:23:00 UTC

Permalink

> -----Original Message-----
> From: Philippe Verdy [mailto:***@wanadoo.fr]
>
> The same
> optimization can be done in Java by subclassing the String
> class to add a "form" field and related form conversion (getters)
> and tests methods.

Only slightly confused about this. The Java String class is declared
*final* in the API, and therefore cannot be subclassed. One would have
to write an alternative String class (not rocket science of course, but
still a tad more involved than subclassing).

> In fact, to further optimize and reduce the
> memory footprint of Java strings, in fact I choosed to store
> the String in a array of bytes

Okay. That explains that then.

> It is possible, with a custom class loader to overide the default
> String class used in the Java core libraries

Ouch. Never taken Java that far myself. I like the idea though. Is it
difficult?

> Looking at the Java VM machine specification, there does not
> seem to be something implying that a Java "char" is necessarily a
> 16-bit entity. So I think that there will be sometime a conforming
> Java VM that will return UTF-32 codepoints in a single char, or
> some derived representation using 24-bit storage units.

I've wondered about that ever since Unicode went to 21 bits. Actually of
course, it's C (and C++), not Java, which has the real problem. C is
(supposed to be) portable, but fast on all architectures, so all of the
built-in types have platform-dependent widths. (So far so good). The
annoying thing is that, BY DEFINITION, the *sizeof()* operator returns
the size of an object /measured in chars/. Therefore, it is a violation
of the rules of C to have an addressable object smaller than a char. One
/can/ have 32-bit chars, but /only/ if you disallow bytes and 16-bit
words. *sizeof()* is not allowed to return a fraction. Sigh! If only C
had seen fit to measure addressable locations in /bits/, or even
architecture-specific-/atoms/ (which would have been 8-bits wide on most
systems), then we could have had sizeof(char) returning 4 or something.
Ah well.

> This leads to many discussions about what is a "character"

I think we just had that discussion. If it happens again I'm probably
not going to join in (though it was quite amusing).

Jill

Marco Cimarosti

2003-10-15 14:43:37 UTC

Permalink

Jill Ramonsky wrote:
> In my experience, there is a performance hit.
>
> I had to write an API for my employer last year to handle
> some aspects of Unicode. We normalised everything to NFD,
> not NFC (but that's easier, not harder). Nonetheless, all
> the string handling routines were not allowed to assume
> that the input was in NFD, but they had to guarantee that
> the output was. These routines, therefore, had to do a
> "convert to NFD" on every input, even if the input were
> already in NFD. This did have a significant performance
> hit, since we were handling (Unicode) strings throughout
> the app.
>
> I think that next time I write a similar API, I wll deal
> with (string+bool) pairs, instead of plain strings, with
> the bool meaning "already normalised". This would
> definitely speed things up. Of course, for any strings
> coming in from "outside", I'd still have to assume they
> were not normalised, just in case.

You could have split the NFD process in two separate steps:

1) Decomposition per se;

2) Reordering of combining classes.

You could have performed step 1 (which is presumably much heavier than 2)
only on strings coming from "outside", and step 2 at every passage.

In a further enhancement, step 2 could be called only upon operations which
could produce non-canonical order: e.g. when concatenating strings but not
when trimming them.

To gain even more speed, you could implement an ad-hoc version of step 2
which only operates on out-of order characters adjacent to a specified
location in the string (e.g., the joining point of a concatenation
operation).

Just my 0.02 euros.

_ Marco

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Philippe Verdy

2003-10-15 17:36:15 UTC

Permalink

From: "Nelson H. F. Beebe" <***@math.utah.edu>
> >> * char, whose values are 16-bit unsigned integers representing Unicode
> >> characters (section 2.1)

May be I have missed this line, but the Java bytecode instruction set does
make a native char not equivalent to a short, as there's an explicit
conversion
instruction between char's and other integer types.

Those Java programs that assume that a char is 16-bit wide would have a
big problem because of the way String constants are serialized within
class files.

Of course it will be hard to include new primitive types in the JVM (because
it would require adding new instructions). And for the same reason, the Java
language "char" type is bound to the JVM native char type, and it cannot
be changed, meaning that String objects are bound to contain only "char"
elements (but this is done only through the String and StringBuffer methods,
as the backing array is normally not acessible).

I don't see where is the issue if char is changed from 16-bit to 32-bit
wide, included with the existing bytecode instruction set and the String
API (for compatibility the existing methods should use indices as if the
backing array was storing 16-bit entities, even if in fact it would store
now
32-bit char's.)

But an augmented set of methods could use directly the new indices in
the native 32-bit char. Additionally, there may be an option bit set in
compiled classes to specify the behavior of the String API: with this bit
set, the class loader would bind the String API methods to the new 32-bit
version of the new core library, and without this bit set (legacy compiled
classes), they would use the compatibility 16-bit API.

Te javac compiler already sets version information for the target JVM, and
thus can be used to compile a class to use the new 32-bit API instead of the
legacy one: in this mode, for example, the String.length() method in the
source would be compiled to call the String.length32() method of the new
JVM, or to remap it to String32.length(), with a replacement class name
(I use String32 here, but in fact it could be the UString class of ICU).

I am not convinced that Java must be bound to a fixed size for its "char"
type, as it already includes "byte", "short", "int", "long" for integer
types
with known sizes (respectively 8, 16, 32, 64 bits), and the JVM bytecode
instruction set clearly separates the native char type from the native
integer type, and does not allow arithmetic operations between chars and
integer types without an explicit conversion.

Note also that JNI is not altered with this change: when a JNI program uses
the .getStringUTF() method, it expects a UTF-8 string. When it uses the
.getString(), it expects a UTF-16 string (not necessarily the native char
encoding seen in Java), and an augmented JNI interface could be defined
to use .getStringUTF32() is one wants maximum performance with no
conversion with the internal backing store used in the JVM.

For me the "caload" instruction, for example just expects to return the
"char" (whatever its size) at a fixed and know index. There's no way to
break this "char" into its bit components without using an explicit
conversion
to a native integer type: the char is unbreakable, and the same is true for
String constants (and that's why they can be reencoded internally into
UTF-8 in compiled classes and why String constants can't be used to store
reliably any array of 16-bit integers, because of surrogates, as a String
constant cannot include invalid surrogate pairs)

Out of String constants, all other strings in Java can be built from
serialized data only through a encoding converter with arrays of native
integers. It's up to the converter (not to the JVM itself) to parse the
native
integers in the array to build the internal String. The same is true for
StringBuffers, which could as well use an internal backing store with 32-bit
chars.

So the real question when doing this change from 16bit to 32-bit is whever
and how it will affect performance of existing applications: for Strings,
this may be an issue if conversions are needed to get an UTF-16 view of a
String internally using a UTF-32 backing store. But in fact, an internal
(private) attribute "form" could store this indicator, so that construction
of Strings will not suffer from performance issues. In fact, if the JVM
could internally manage several alternate encoding forms for Strings to
reduce
the footprint of a large application and just proded a "char" view through
an
emulation API to applcations, this could benefit to the performance (notably
in server/networking applications using lots of strings, such as SQL engines
like Oracle that include a Java VM).

What would I see if I was a programmer in such environment: the main
change would be in the character properties, where new Unicode block
identifiers would become accessible out of the BMP, and no "char" would
be identified as a low or high surrogate.

It would be quite simple to know if the 32-bit char environment would
be used:
final String test = "\uD800\uDC00"; // U+10000
this compiles a String constant into its UTF-8 serialization in the .class
file, thus compiling this bytes in the init data section of that .class
file:
const byte[] test$ = {0xF0,0x90,0x80,0x80};

Whatever if the environment is 16-bit-char or 32-bit-char, the class loader
parses the .class file as an UTF-8 sequence, returning the single codepoint
U+10000. Then this is an internal decision of the JVM to store it internally
with 16-bit or 32-bit backing stores, i.e. as a single {(char)'\U00010000'}
or as two {(char)'\0D800', (char)'\uDC00'}.
The existing String API does not need to be changed as well as already
compiled classes targetted for previous versions of the JVM.

For source code however, a compile-time flag may target a new VM and
automatically substitute java.lang.String by java.lang.UString, offering
an alias if needed with java.lang.String16 if one wants to force the
emulated (derived) 16-bit API with the new 32-bit String backing store.
In the compiled class for the new VM, there would noly remain references
to java.lang.String and javax.lang.UString, and the code would still run on
older VMs, provided that an installable package for those VMs provides
the UString class. If needed, class names referenced in .class files would
be solved and accessible through reflection according to the comtability
indicator detected by the class loader, so that even class names would
not need to be altered.

Known caveat with this change:
if the application compares the successor of '\uFFFF' with '\u0000',
the result will be equal on a 16bit-char environment and false in the
newer 32-bit environment. That's why the target VM indicator inserted
by the compiler in the .class file needs to be used: it defines (alters)
the semantics of the char native type, so that it will use a forced 16bit
truncation for classes in the compatibility mode.

However I still don't see any reason why char could not be 32-bit. This
is quite similar to the memory page mode of x86 processors that allow
fixing the semantics of 16-bit programs in 32-bit environments (notably
for pointer sizes stored in the processor stack).

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Asmus Freytag

2003-10-15 17:48:47 UTC

Permalink

I'm going to answer some of Peter's points, leaving aside the interesting
digressions into Java subclassing etc. that have developed later in the
discussion.

At 04:19 AM 10/15/03 -0700, Peter Kirk wrote:
>I note the following text from section 5.13, p.127, of the Unicode
>standard v.4:
>
>>Canonical equivalence must be taken into account in rendering multiple
>>accents, so that any two canonically equivalent sequences display as the same.

This statement goes to the core of Unicode. If it is followed, it
guarantees that normalizing a string does not change its appearance (and
therefore it remains the 'same' string as far as the user is concerned.)

>The word "must" is used here. But this is part of the "Implementation
>Guidelines" chapter which is generally not normative. Should this sentence
>with "must" be considered mandatory, or just a recommendation although in
>certain cases a "particularly important" one?

If you read the conformance requirements you deduce that any normalized or
unnormalized form of a string must represent the same 'content' on
interchange. However, the designers of the standard wanted to make even
specialized uses, such as 'reveal character codes' explicitly conformant.
Therefore you are free to show to a user whether a string is precomposed or
composed of combining characters, e.g. by using a different font color for
each character code.

The guidelines are concerned with the average case: displaying the
characters as *text*.

[The use of the word 'must' in a guideline is always awkward, since that
word has such a strong meaning in the normative part of the standard.]

>>Rendering systems should handle any of the canonically equivalent orders
>>of combining
>>marks. This is not a performance issue: The amount of time necessary to
>>reorder combining
>>marks is insignificant compared to the time necessary to carry out other
>>work required
>>for rendering.

The interesting digressions on string libraries aside, the statement made
here is in the context of the tasks needed for rendering. If you take a
rendering library and add a normalization pass on the front of it, you'll
be hard-pressed to notice a difference in performance, especially for any
complex scripts.

So we conclude: "rendering any string as if it was normalized" is *not* a
performance issue.

However, from the other messages on this thread we conclude: normalizing
*every* string, *every time* it gets touched, *is* a performance issue.

A few things: Unicode supports data that allow to perform a 'Normalization
Quick Check', which simply determines whether there is anything that might
be affected by normalization. (For example, nothing in this e-mail message
is affected by normalization, no matter to which form, since it's all in
ASCII.)

With a quick check like that you should be able to reduce the cost of
normalization dramatically --unless your data consists of data that needs
normalization throughout. Even then, if there is a chance that the data is
already normalized, verifying that is faster than normalizing (since
verification doesn't re-order).

Then, after that, as others have pointed out, if you can keep track of a
normalized state, either by recordkeeping or by having interfaces inside
which the data is guaranteed to be normalized, then you cut your costs furhter.

A./

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-10-15 20:44:20 UTC

Permalink

On 15/10/2003 10:48, Asmus Freytag wrote:

> I'm going to answer some of Peter's points, leaving aside the
> interesting digressions into Java subclassing etc. that have developed
> later in the discussion.

Thank you, Asmus. If people want to discuss normalisation and string
handling in Java, they are welcome to do so, but they should use a
different subject heading and not my (copyrighted :-) ) text.

>
> At 04:19 AM 10/15/03 -0700, Peter Kirk wrote:
>
>> I note the following text from section 5.13, p.127, of the Unicode
>> standard v.4:
>>
>>> Canonical equivalence must be taken into account in rendering
>>> multiple accents, so that any two canonically equivalent sequences
>>> display as the same.
>>
>
> This statement goes to the core of Unicode. If it is followed, it
> guarantees that normalizing a string does not change its appearance
> (and therefore it remains the 'same' string as far as the user is
> concerned.)
>
> ...
>
> The guidelines are concerned with the average case: displaying the
> characters as *text*.
>
> [The use of the word 'must' in a guideline is always awkward, since
> that word has such a strong meaning in the normative part of the
> standard.]

So, are you saying that for normal display of characters as text these
guidelines must be followed?

>
>>> Rendering systems should handle any of the canonically equivalent
>>> orders of combining
>>> marks. This is not a performance issue: The amount of time necessary
>>> to reorder combining
>>> marks is insignificant compared to the time necessary to carry out
>>> other work required
>>> for rendering.
>>
>
> The interesting digressions on string libraries aside, the statement
> made here is in the context of the tasks needed for rendering. If you
> take a rendering library and add a normalization pass on the front of
> it, you'll be hard-pressed to notice a difference in performance,
> especially for any complex scripts.
>
> So we conclude: "rendering any string as if it was normalized" is
> *not* a performance issue.

Thank you. This is the clarification I was looking for, and confirms my
own suspicions. But are there any other views on this? I have heard
them from implementers of rendering systems. But I have wondered if this
is because of their reluctance to do the extra work required to conform
to this requirement.

--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Asmus Freytag

2003-10-15 21:27:34 UTC

Permalink

At 01:44 PM 10/15/03 -0700, Peter Kirk wrote:
>>The guidelines are concerned with the average case: displaying the
>>characters as *text*.
>>
>>[The use of the word 'must' in a guideline is always awkward, since that
>>word has such a strong meaning in the normative part of the standard.]
>
>So, are you saying that for normal display of characters as text these
>guidelines must be followed?

No guidelines 'must' ever be followed. But if you want to follow the letter
of the guidelines, you 'must' do this. ;-)

A better phrasing would have been: "These guidelines strongly recommend
that you always do this, unless you have very good reasons not to, but in
that case we hope that you are thinking of something else than ordinary
display of characters as text, as you will sorely violate the expectations
of your users and endanger the interoperability of Unicode encoded text
from different sources."

Would that have been better?

A./

PS: perhaps we should have a little icon for that.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Philippe Verdy

2003-10-15 20:50:25 UTC

Permalink

From: "Nelson H. F. Beebe" <***@math.utah.edu>
To: "Philippe Verdy" <***@wanadoo.fr>
Cc: <***@math.utah.edu>; "Jill Ramonsky" <***@Aculab.com>
Sent: Wednesday, October 15, 2003 5:34 PM
Subject: Re: Canonical equivalence in rendering: mandatory or recommended?

> [This is off the unicode list.]
>
> Philippe Verdy wrote on the unicode list on Wed, 15 Oct 2003 15:44:44
> +0200:
>
> >> ...
> >> Looking at the Java VM machine specification, there does not
> >> seem to be something implying that a Java "char" is necessarily a
> >> 16-bit entity. So I think that there will be sometime a conforming
> >> Java VM that will return UTF-32 codepoints in a single char, or
> >> some derived representation using 24-bit storage units.
> >> ...
>
> I disagree: look at p. 62 of
>
> @String{pub-AW = "Ad{\-d}i{\-s}on-Wes{\-l}ey"}
> @String{pub-AW:adr = "Reading, MA, USA"}
>
> @Book{Lindholm:1999:JVM,
> author = "Tim Lindholm and Frank Yellin",
> title = "The {Java} Virtual Machine Specification",
> publisher = pub-AW,
> address = pub-AW:adr,
> edition = "Second",
> pages = "xv + 473",
> year = "1999",
> ISBN = "0-201-43294-3",
> LCCN = "QA76.73.J38L56 1999",
> bibdate = "Tue May 11 07:30:11 1999",
> price = "US\$42.95",
> acknowledgement = ack-nhfb,
> }
>
> where it states:
>
> >> * char, whose values are 16-bit unsigned integers representing Unicode
> >> characters (section 2.1)

Personnally I read this reference of the second edition of the Java VM:
http://java.sun.com/docs/books/vmspec/2nd-edition/html/VMSpecTOC.doc.html

Notably:
http://java.sun.com/docs/books/vmspec/2nd-edition/html/Concepts.doc.htm
which states clearly that "char" is not a integer type, but a numeric type
without a constraint on the number of bits it contains (yes there has
existed JVM implementations with 9-bit chars!)

It states:
[quote]
2.4.1 Primitive Types and Values

A primitive type is a type that is predefined by the Java programming
language and named by a reserved keyword. Primitive values do not share
state with other primitive values. A variable whose type is a primitive type
always holds a primitive value of that type.(*2)
[...]
The integral types are byte, short, int, and long, whose values are
8-bit, 16-bit, 32-bit, and 64-bit signed two's-complement integers,
respectively, and char, whose values are 16-bit unsigned integers
representing Unicode characters (section 2.1).
*2: Note that a local variable is not initialized on its creation and is
considered to hold a value only once it is assigned (section 2.5.1).
[/quote]

Then it defines the important rules for char conversions or promotions,
for arithmetic operations, assignments or method invokation:

[quote]
2.6.2 Widening Primitive Conversions [...]
* char to int, long, float, or double [...]
Widening conversions do not lose information about the sign or order of
magnitude of a numeric value. Conversions widening from an integral type to
another integral type do not lose any information at all; the numeric value
is preserved exactly. [...]
According to this rule, a widening conversion of a signed integer value to
an integral type simply sign-extends the two's-complement representation of
the integer value to fill the wider format. A widening conversion of a value
of type char to an integral type zero-extends the representation of the
character value to fill the wider format.
Despite the fact that loss of precision may occur, widening conversions
among primitive types never result in a runtime exception (section 2.16).
[/quote]

So a 'char' MUST have AT MOST the same number of bits as an 'int', i.e. it
cannot have more than 32 bits. If char was defined to have 32 bits, no
zero-extension would occur but the above rule would be still valid.

and:

[quote]
2.6.3 Narrowing Primitive Conversions [...]
* char to byte or short [...]
Narrowing conversions may lose information about the sign or order of
magnitude, or both, of a numeric value (for example, narrowing an int value
32763 to type byte produces the value -5). Narrowing conversions may also
lose precision.[...]
Despite the fact that overflow, underflow, or loss of precision may occur,
narrowing conversions among primitive types never result in a runtime
exception.
[/quote]

Yes this paragraph says that char to short may loose information. This
currently does not occur with unsigned 16-bit chars, but this could happen
safely with 32-bit chars without violating the rule.

However, as the sign must be kept when converting char to int, this means
that the new 32-bit char would have to be signed. Yes this means that there
would exist now negative values for chars but none of them are currently
used with 16-bit chars which are in the range [\u0000..\uFFFF] and promoted
as if they were integers in range [0..65535]. That's why a narrowing
conversion occurs when converting to short (the sign may change, even if no
bits are lost)...

The VM also allows assignment conversion to use narrowing conversions to
chars without causing a complie-time error or an exception at run-time.
(section 2.6.7)

Note that the 16bit indication in section 2.1 relates to the fact that it
was refering to the support of Unicode 2.1 which did not allocate any
character out of the BMP, and just reserved a range of codepoints for
surrogates. Until the UTF-16 specification was clarified by Unicode in a
later specification, this spec was valid, and it is still so now...

The Java language still lacks a way to specify a literal for a character out
of the BMP. Of course one can use the syntax '\uD800\uDC00' but this would
not compile with the current _compilers_, that expect only one char in the
literal. In a String literal "\uD800\uDC00" becomes the 4-bytes UTF-8
sequence for _one_ Unicode codepoint in the compiled class.
This is the class loader that will decode the UTF-8 sequence and build the
String object containing an unspecified number of chars!

So on a 16-bit-char VM the string "\uD800\uDC00" would be of length 2,
containing separate surrogates.
on a 32-bit-char VM it would be of length 1 containing only one char
'\U10000;'. If you extract the first char to store it in a short in your
application, you'll be assigning 0 by a narrowing conversion (and this is in
accordance with the VM spec and the language which stipulates explicitly
that you may loose information).

The good programmer practice is to extract 'char' values into 'int'
variables if one need to perform arithmetic, i.e. using widening conversions
which is the default for all arithmetic operations.

I have reread many times the VM spec in the past and still today, looking
for whever there was a violation if char was implemented to be now signed
and 32-bit wide, instead of unsigned and 16-bit wide, and I see nothing
against it.

Also, the VM spec (as well as more recent specs such as the Java Debugging
Interface) does not specify how String instances encode internally their
backing store: there's no requirement that this private field even uses
arrays of chars, and a valid VM could as well recode them with SCSU-8, UTF-8
like in compiled class files, or less probably even SCSU-8. The private
fields of String or StringBuffer instances are not documented and have
already changed in the past, so this is not even a problem for debuggers:
existing applications that use reflection to change the visibility of
private fields in undocumented classes are not guaranteed to be portable
across existing VMs.

In the anguage itself, portability is ensured by using either a
compile flag to specify the behavior of classes using char arithmetics or
conversions where widening truncations or a positive sign was expected. This
compile flag could set a currently unused flag in the .class file format, so
that classes could be compiled to run with the compatibility mode old
semantic or the new one.

(I bet that most correctly written code does not assume 16-bit truncation of
native chars nor their positive sign, but if not sure, there could be a
compiler flag to avoid using the extended semantics on a new VM that now
would have 32-bit chars)

Notes:

1. May be there should be a standard to specify codepoints of larger sizes
in the language (not in the JVM). I used '\Uxxx;' with an explicit ending
semi-colon instead of forcing all users to type in the full 8 hex digits for
each occurences. But some other languages or convensions use '\U010000'
where the uppercase \U requires 6 hex digits to enter all Unicode characters
in the standard range [\U000000..\U10FFFF]

2. The initial spec of UTF-32 and UTF-8 by ISO allowed much more planes with
31-bit codepoints, and may be there will be an agreement sometime in the
future between ISO and Unicode to define new codepoints out of the current
standard 17 first planes that can be safely converted with UTF-16, or a
mechanism will be specified to allow mapping more planes to UTF-16, but this
is currently not a priority as long as there remains unallocated space in
the BMP to define new types and ranges of surrogates for "hyperplanes",
something that is still possible near the Hangul block, just before the
existing low and high surrogates).

If someone from Sun can answer me in this list, I'd like to know their
opinion. May be I forgot to read a spec, but I think that Java should
continue to support Unicode better than it does now with Unicode 2.1. The
full support for Unicode 3.0+ in Java is a high priority for all those that
want to support Unicode (for now this support exists in... MS.Net, C#,
"JavaScript", VB...)

Will there be a third edition of the JVM spec that clarifies the allowed
size and semantics of char native types in the VM?

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Cowan

2003-10-15 21:18:05 UTC

Permalink

Philippe Verdy scripsit:

> [...] char, whose values are 16-bit unsigned integers
> representing Unicode characters (section 2.1).

Despite your ingenious special pleading, I don't see how this can mean
anything except that chars must be 16-bit unsigned integers.

> The Java language still lacks a way to specify a literal for a character out
> of the BMP. Of course one can use the syntax '\uD800\uDC00' but this would
> not compile with the current _compilers_, that expect only one char in the
> literal. In a String literal "\uD800\uDC00" becomes the 4-bytes UTF-8
> sequence for _one_ Unicode codepoint in the compiled class.

Character literals are crocky anyhow. IMHO modern programming languages
should not have a Character type, but deal only in Strings.

> 2. The initial spec of UTF-32 and UTF-8 by ISO allowed much more planes with
> 31-bit codepoints, and may be there will be an agreement sometime in the
> future between ISO and Unicode to define new codepoints out of the current
> standard 17 first planes that can be safely converted with UTF-16,

I doubt it very much. 17 planes is waaaay more than sufficient.

--
John Cowan ***@reutershealth.com www.reutershealth.com www.ccil.org/~cowan
Assent may be registered by a signature, a handshake, or a click of a computer
mouse transmitted across the invisible ether of the Internet. Formality
is not a requisite; any sign, symbol or action, or even willful inaction,
as long as it is unequivocally referable to the promise, may create a contract.
--_Specht v. Netscape_

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Cowan

2003-10-16 12:25:00 UTC

Permalink

Philippe Verdy scripsit:

> I am also doubting, but I would not bet on it. After all, when Unicode
> started, a single plane was considered waaaaaay more than sufficient too.

I not only would bet on it, I actually have a bet on it. Henry Thompson
of the W3C's Schema WG bet me that we'd outrun the existing planes within
five years; four left to go and no sign of it, even if Michael Everson
were to achieve pluripresence and actually get everything accepted into
the standard that he knows needs to be done.

> But the objectives of Unicode have changed, and now Unicode must cooperate
> with ISO 10646 which has its own objectives too.

There are only so many character standardizers on the planet, and the UTC
and WG2 are joined at the hip for the simple reason that they mostly
consist of the same people.

> Of course Unicode objectives are focused on text, but there are much enough
> works in other media-related technologies that may justify in the future to
> encode explicitly attributed characters, distinct from a sequence of format
> controls and characters, or just to offer compatibility with other standards
> than ISO10646.

Dream on.

> There may exist at some time a need to define new classes of encoded
> "characters" that would require mapping them in a complex way using a lot of
> new codepoints, just because the ISO committee will want to include support
> for the work in some of the many other ISO commitees.

WG2 has indeed been merging the existing character standards of other ISO
committees, and a good many of the Unicode 4.0 characters actually
come from there. It's a trickle and will remain so.

> But just look at the rapid growth of encoded characters in Unicode: in just
> 4 years, 2 new planes have been nearly allocated or reserved for extension.

Not new, not unforeseen. Multiple planes has been formally part of
Unicode since 1996, and the need was established several years before that.

> Also, there tends to exist a lot of pressures everywhere to define new
> vendor-specific characters for specific usage, and the PUA ranges may
> finally accept a reservation mechanism by third parties so that they don't
> collide each other, in a way similar to Internet IPv4 addressing space
> reservation.

Not going to happen.

> What would happen if ISO10646 decided to stop its work, giving up to let
> IANA contract with external registrars, just to comply with an rapid
> industry need to publish more medias and still interoperate? There may exist
> some regulated areas in the new scheme on which Unicode would continue to
> work with (the 17 planes), but other parts documented and implemented
> elsewhere on which Unicode would have no control, and where there may exist
> a compatibility scheme.

There's no shortage of integers.

--
There is / One art John Cowan <***@reutershealth.com>
No more / No less http://www.reutershealth.com
To do / All things http://www.ccil.org/~cowan
With art- / Lessness -- Piet Hein

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Philippe Verdy

2003-10-16 13:33:24 UTC

Permalink

From: "John Cowan" <***@mercury.ccil.org>

> Philippe Verdy scripsit:
>
> > I am also doubting, but I would not bet on it. After all, when Unicode
> > started, a single plane was considered waaaaaay more than sufficient
too.
>
> I not only would bet on it, I actually have a bet on it. Henry Thompson
> of the W3C's Schema WG bet me that we'd outrun the existing planes within
> five years; four left to go and no sign of it, even if Michael Everson
> were to achieve pluripresence and actually get everything accepted into
> the standard that he knows needs to be done.

Just for the case it would be needed, are you keeping an unassigned range
in the BMP so that extension will remain possible to preserve an ascending
compatibility or support for UTF-16 which currently is the main reason why
there are for now 17 planes defined ?
(for example in the range between Hangul syllables and existing surrogates)

That's OK not to document is officially for now, but it seems that a prudent
and conservative policy to keep such a range available in the BMP
for the future is needed. Of course, if there's an evolution, this would
require a later update to the current UTF-8 and UTF-16 conforming rules.

I'm not asking to document it now, but to keep it in mind and not fully
filling the BMP so that UTF-16 would become impossible to upgrade to
the possible future scheme (such provisions already exist natively in UTF-8
and UTF-32, since its origin by X/Open and their initial documentation in
a RFC).

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Cowan

2003-10-16 15:04:56 UTC

Permalink

Philippe Verdy scripsit:

> Just for the case it would be needed, are you keeping an unassigned range
> in the BMP so that extension will remain possible to preserve an ascending
> compatibility or support for UTF-16 which currently is the main reason why
> there are for now 17 planes defined ?

No.

--
John Cowan ***@reutershealth.com www.reutershealth.com www.ccil.org/~cowan
[R]eversing the apostolic precept to be all things to all men, I usually [before
Darwin] defended the tenability of the received doctrines, when I had to do
with the [evolution]ists; and stood up for the possibility of [evolution] among
the orthodox--thereby, no doubt, increasing an already current, but quite
undeserved, reputation for needless combativeness. --T. H. Huxley

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-10-16 15:03:34 UTC

Permalink

On 16/10/2003 06:33, Philippe Verdy wrote:

>From: "John Cowan" <***@mercury.ccil.org>
>
>
>
>>Philippe Verdy scripsit:
>>
>>
>>
>>>I am also doubting, but I would not bet on it. After all, when Unicode
>>>started, a single plane was considered waaaaaay more than sufficient
>>>
>>>
>too.
>
>
>>I not only would bet on it, I actually have a bet on it. Henry Thompson
>>of the W3C's Schema WG bet me that we'd outrun the existing planes within
>>five years; four left to go and no sign of it, even if Michael Everson
>>were to achieve pluripresence and actually get everything accepted into
>>the standard that he knows needs to be done.
>>
>>
>
>Just for the case it would be needed, are you keeping an unassigned range
>in the BMP so that extension will remain possible to preserve an ascending
>compatibility or support for UTF-16 which currently is the main reason why
>there are for now 17 planes defined ?
>(for example in the range between Hangul syllables and existing surrogates)
>
>
>
...

I would guess not. I can think of much more useful things to do with any
remaining space in the BMP. Anyway, the space you mention, if used for
additional high-half or low-half surrogates, is only 80 characters wide
and so would give just slightly more than one more plane, in fact 80 x
1024 characters. And it is the largest space on the BMP which is not
already roadmapped.

I suppose that, in the unlikely event that in the foreseeable future it
looks as if more than 17 planes might become necessary, and anyone is
still trying to use UTF-16 (although by that time memory and bandwidth
will probably be so cheap that no one bothers any more with encodings
that save them), it will be possible to reserve part of the 17 planes
for surrogate pairs representing the remaining planes. So the UTF-16
encoding would be two existing 16-bit surrogate pairs forming a higher
level surrogate pair. UTF-32 would of course be more efficient (32 bits
rather than 64), but I doubt if anyone will care.

If two whole planes were reserved for such surrogates, this mechanism
could cover the whole 32-bit hyperspace. Meanwhile UTF-8 can be extended
to 6 bytes (byte 1 being 111110xx) to cover the same space. Plenty of
room there to encode not just all the scripts of the Galactic Federation
but even to squeeze in those of the Klingons and their allies!

Or perhaps a way can be found to graciously retire UTF-16 in some
distant future version of Unicode. That is likely to become viable long
before the extra planes are needed.

--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Philippe Verdy

2003-10-16 17:45:07 UTC

Permalink

From: "Peter Kirk" <***@qaya.org>

> On 16/10/2003 06:33, Philippe Verdy wrote:
>
> >From: "John Cowan" <***@mercury.ccil.org>
> >
> >>Philippe Verdy scripsit:
> >>>
> >>>I am also doubting, but I would not bet on it. After all, when Unicode
> >>>started, a single plane was considered waaaaaay more than sufficient
> >
> >too.
> >
> >>I not only would bet on it, I actually have a bet on it. Henry Thompson
> >>of the W3C's Schema WG bet me that we'd outrun the existing planes
within
> >>five years; four left to go and no sign of it, even if Michael Everson
> >>were to achieve pluripresence and actually get everything accepted into
> >>the standard that he knows needs to be done.
> >
> >Just for the case it would be needed, are you keeping an unassigned range
> >in the BMP so that extension will remain possible to preserve an
ascending
> >compatibility or support for UTF-16 which currently is the main reason
why
> >there are for now 17 planes defined ?
> >(for example in the range between Hangul syllables and existing
surrogates)
> ...
>
> I would guess not. I can think of much more useful things to do with any
> remaining space in the BMP. Anyway, the space you mention, if used for
> additional high-half or low-half surrogates, is only 80 characters wide
> and so would give just slightly more than one more plane, in fact 80 x
> 1024 characters. And it is the largest space on the BMP which is not
> already roadmapped.
>
> I suppose that, in the unlikely event that in the foreseeable future it
> looks as if more than 17 planes might become necessary, and anyone is
> still trying to use UTF-16 (although by that time memory and bandwidth
> will probably be so cheap that no one bothers any more with encodings
> that save them), it will be possible to reserve part of the 17 planes
> for surrogate pairs representing the remaining planes. So the UTF-16
> encoding would be two existing 16-bit surrogate pairs forming a higher
> level surrogate pair. UTF-32 would of course be more efficient (32 bits
> rather than 64), but I doubt if anyone will care.

This is another solution, however this should be predictable by just
testing the value of the first high surrogate which would indicate
the length of the encoding sequence for the extended codepoints.
Given that each surrogate encodes 10 bits, a third surrogate of
the same size would encode 30 bits, i.e. half the size of the whole
UTF-32 space _as define at origine_.

> If two whole planes were reserved for such surrogates, this mechanism
> could cover the whole 32-bit hyperspace. Meanwhile UTF-8 can be extended
> to 6 bytes (byte 1 being 111110xx) to cover the same space. Plenty of
> room there to encode not just all the scripts of the Galactic Federation
> but even to squeeze in those of the Klingons and their allies!

Thanks for pointing something that is obvious in the original RFC describing
the unrestricted UTF-8 encoding as defined by X/Open and first adopted
in the first releases of ISO10646.

> Or perhaps a way can be found to graciously retire UTF-16 in some
> distant future version of Unicode. That is likely to become viable long
> before the extra planes are needed.

I really doubt this: UTF-16 was the prefered encoding scheme for Unicode
and it is still (and will remain for long) the prefered representation in
C/C++
environments that define wchar_t, and in OS'es like Windows that define and
use it in its Win32 API...

I would really not like to imagine a situation where UTF-16 would become
deprecated by Unicode: this would be a big issue for many systems that
rely on the fact of being able to encode Unicode characters with it, and
that will not like to shift to UTF-32 before long as it would require
defining
new APIs at the OS kernel level !!!

It's true that there is no plan in Unicode to encode something else than
plain text for existing or future actual scripts. But ISO10646 objectives
are to also to offer support and integrate almost all other related ISO
specifications that may need a unified codepoint space for encoding
either plain text or their own objects.

Yes we have some clear indications that we won't need more than 17
planes for scripts considered by Unicode. But keeping the space open
for non-Unicode applications (it would be up to ISO10646 to accept
and reference them, as Unicode.org will not attempt to define their
properties as actual text characters for general scripts) is still a
good security for the long term future of Unicode in a more open
architecture where it could be fitted.

So I don't see anything wrong if Unicode just says now that only 17
planes will be allocated to encode plain text in accordance with
ISO10646, leaving other applications use and allocate codepoint
ranges that could be kept compatible with UTF-16 with new
kinds of surrogates (someting like hyperplane selectors, used in
prefix before high and low surrogates).

The other solution based on assigning new hyper-surrogates out
of the BMP would require, for parsing predicatability, that these
"hyper-surrogates" be encoded each one with a pair of
UTF-16 surrogates. This would create sequences of 4 code units,
and this may be quite wasteful for memory space.

A solution with 3 UTF-16 surrogates however could allow
extending the encoding space to a little more than 30 bits,
adding 2^15 planes to the existing 17 ones.

Suppose that there's a 10-bits range reserved in the BMP for
these hyper-surrogates (1024 codepoints), this would of course
conflict with the current roadmap which does not leave such
space available, which for now remains only in these rows:

A8xx (Syloti Nagri) ¿Pahawh Hmong? ??? (Varang Kshiti) (Sorang Sng.) ???
A9xx ¿Chakma? ??? ??? ¿Javanese? ??? ???
AAxx ¿Newari? ??? ??? ¿Siddham? ??? ???
ABxx ¿Saurashtra? ??? ??? (hPhags-pa) ??? ???

As we are discussing here about the roadmap of possible future
integration of rare scripts still not standardized, this is an important
issue, for which a decision must be made: should we really fill all the BMP,
so that it won't fit with future efficient representations compatible with
UTF-16, of a larger encoding space in which Unicode will be only a
small part?

If we still want to keep these scripts in the BMP in the roadmap, then
the only solution would be to deprecate some ranges of the BMP PUA
area, giving soon an opportunity to authors that currently use PUAs
in the BMP to relocate them in one of the new PUA planes 15 and 16.
(why not the space EBxx..EFxx, or even the space E8xx-EFxx if we
want to cover the whole 31-bit space of the original X/open spec)

The other solution would be to reserve these hyper-surrogates in
the "special" plane 14, as the allocation roadmap leaves this plane
nearly empty with very few usages. There will still be issues with
applications using the old parsing rules for combining sequences,
and that would expect that they are independant characters

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Asmus Freytag

2003-10-16 18:31:03 UTC

Permalink

At 08:03 AM 10/16/03 -0700, Peter Kirk wrote:
>Or perhaps a way can be found to graciously retire UTF-16 in some distant
>future version of Unicode. That is likely to become viable long before the
>extra planes are needed.

This discussion is a pure numbers game. Since no-one can define a hard
number for a cut-off that's guaranteed to be good 'forever', all we have is
probability. (That's all we have anyway, whether in life or science). So
the question becomes an estimate of probability.

128 charaters (ASCII) cover 80% of the characters needed by 5% of the
world's population
256 characters (Latin-1) covers 80% of the characters needed by 15% of the
world's population
40,000 characters (Unicode 1.0) covers 95% of the characters needed by 85%
of the worlds population
90,000 characters (Unicode 4.0) covers 98% of the characters needed by 95%
of the world's population

Exercise for the reader:

Warmup:
Where do the other 910,000 characters come from, and who's using them?

Easy:
If the UTC and WG2 add 1,000 characters per amendment, how many amendments
will it take to fill the remaining space?

[Note: the number of characters accepted so for by UTC for the next
amendment is 684]

Medium:
Estimate the effect of some number of larger amendments (CJK)?

[Note: account for the possible use of variation selectors to code Han
variants.]

Hard:
Given your answers to the previous question, estimate when the BMP will be
completely filled.

[Hint: each WG meeting issues at most one amendment, meetings are at least
six months apart]

Extra credit:
Give a believable estimate for the other 16 planes.

A./

PS: private answer to Jill: make sure that your characters are always
represented internally by infinite precision integers. That way you are not
arbitrarily limited by 32-bit integral data types. ;-)

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Philippe Verdy

2003-10-16 19:59:00 UTC

Permalink

From: "Asmus Freytag" <***@ix.netcom.com>

> At 08:03 AM 10/16/03 -0700, Peter Kirk wrote:
> >Or perhaps a way can be found to graciously retire UTF-16 in some distant
> >future version of Unicode. That is likely to become viable long before
the
> >extra planes are needed.
>
> This discussion is a pure numbers game. Since no-one can define a hard
> number for a cut-off that's guaranteed to be good 'forever', all we have
is
> probability. (That's all we have anyway, whether in life or science). So
> the question becomes an estimate of probability.
>
> 128 charaters (ASCII) cover 80% of the characters needed by 5% of the
> world's population
> 256 characters (Latin-1) covers 80% of the characters needed by 15% of the
> world's population
> 40,000 characters (Unicode 1.0) covers 95% of the characters needed by 85%
> of the worlds population
> 90,000 characters (Unicode 4.0) covers 98% of the characters needed by 95%
> of the world's population
>
> Exercise for the reader:
>
> Warmup:
> Where do the other 910,000 characters come from, and who's using them?

We're not discussing about addition of characters standardized by joint
efforts
of Unicode's UTC and ISO's WG2, and I'm not expecting a lot of changes in
this
area. But about a more general scheme in which the Unicode/ISO10646 would
become a part of a larger set of standards for encoding something else than
just pure text. There are already attempts to encode attributed text, and
mixing/interleaving text and object data with a unified encoding scheme.

For now the inclusion of codepoints like the Object Replacement Character is
demonstrating that mixing text and other data in a single unified and
serialized
stream is already an issue. Of course there's now XML to add structure to
this
content, but unstructured data also has its applications, everywhere as a
predefinite schema cannot be designed.

Also, there's some needs to allow designers of glyph libraries to encode
them
and exchange them, using privately alocated codepoints, without risking
collision between each PUA assignments. As PUA characters are not designed
to be interchanged, the other solution could be based on private reservation
in a global registry similar to reservation in the IPv4 space. Then the
codepoint
usages can be privately agreed upon between collaborating companies that
wish to unify their own codesets, and reduce their assignment (a process
similar to IP space aggregation and renumbering, something that has some
technical issues but is solvable in a medium term).

In fact this interchangeability of PUA codepoints is still an unsolved
issue,
that could be solved in a way similar to IPv4 assignments under the IANA
authority. Nothing needs to be changed for the current 17 planes managed
and assigned to Unicode/ISO10646, as long as UTC&WG2 accept that they
will not need to manage centrally all character assignments for every
limited
group.

Due to that, there's a big risk that PUAs start being permanently assigned
as part of a OS core charset, and that data created on distinct systems
become mutually incompatible as they are using colliding subsets of PUAs
(this is already the case in core fonts and script processors used in
MS Windows, and a few private characters/logographs used by Apple in
MacOS).

There's a huge number of candidate corporate logographs that could be
reserved simply for usage within a unified scheme including Unicode, and
that could be negociated within a IANA registry, with a reservation system
similar to domain names. In addition, adding such a system could generate
some revenues to help finance Unicode and ISO10646 activities: these
private assignments become interchangeable as long as their registration
is active in the registry.

We could even imagine to implement this system within a special domain
and use rDNS requests to get a resolved domain name corresponding to
an assigned codepoint: this domain could then contain info on how to
get glyphs or fonts or information supporting this private codepoint.
These glyphs could be protected with digital rights or privacy and could
even include registered logos, graphics, designs, ... and even colorful
photographs and artworks.

I could imagine a lot of other similar applications... This does not
contradict the Unicode/ISO10646 goals which is to keep the 17 planes
open to everybody use and publicly accessible for global interchanges
of information, by a strict policy describing the correct usage of
codepoints assigned and unified by ISO's WG2 and Unicode.org's UTC.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Asmus Freytag

2003-10-16 21:21:48 UTC

Permalink

At 09:59 PM 10/16/03 +0200, Philippe Verdy wrote:

>We're not discussing about addition of characters standardized by joint
>efforts
>of Unicode's UTC and ISO's WG2, and I'm not expecting a lot of changes in
>this
>area. But about a more general scheme in which the Unicode/ISO10646 would
>become a part of a larger set of standards for encoding something else than
>just pure text.

I wasn't aware that you were desinging an unrelated standard. What you describe
in your message, and which I am not repeating here is not Unicode.

When you design your own standard, you don't need to worry about features of
Unicode, such as UTF-16.

A./

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Mark E. Shoulson

2003-10-16 22:26:54 UTC

Permalink

Philippe Verdy wrote:

>Due to that, there's a big risk that PUAs start being permanently assigned
>as part of a OS core charset, and that data created on distinct systems
>become mutually incompatible as they are using colliding subsets of PUAs
>(this is already the case in core fonts and script processors used in
>MS Windows, and a few private characters/logographs used by Apple in
>MacOS).
>
>
Yes. When I've made websites using the PUA section for Klingon, even
though my Unicode font contains the right glyphs there, it's unreadable
because some of those codepoints overlap some Adobe or Apple characters,
and my browser in its wisdom decides that Adobe wins. So some of the
characters show right, and some show (r) signs and suchlike.

(We're now *trying* to do more data interchange in Klingon characters,
the lack of which is what held Klingon back from encoding in the first
place... but we *can't* because we're not encoded! What a catch-22!)

~mark

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Michael Everson

2003-10-16 21:48:57 UTC

Permalink

At 18:26 -0400 2003-10-16, Mark E. Shoulson wrote:

>(We're now *trying* to do more data interchange in Klingon
>characters, the lack of which is what held Klingon back from
>encoding in the first place... but we *can't* because we're not
>encoded! What a catch-22!)

But we determined that the users of the Klingon language normally
write it with Latin letters. Were Okrand's dictionary to be reprinted
in Klingon characters and were people to actually use them, Klingon
might be considered to be an actual script used for Klingon. But
currently, the evidence is that it is just a cypher for Latin.
--
Michael Everson * * Everson Typography * * http://www.evertype.com

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Cowan

2003-10-16 20:12:27 UTC

Permalink

Asmus Freytag scripsit:

> PS: private answer to Jill: make sure that your characters are always
> represented internally by infinite precision integers.

Actually, the intractability of the transfinite ordinals shows us that
there can be no such thing. Ordinary "infinite precision" integers
begin with a fixed-size length word saying how many words (for whatever
definition of "word") follow. But eventually the fixed-size length
word will overflow, and must be replaced by a variable-sized length
word, itself prefixed by a length^2 word. But eventually the length^2
word will overflow, and must be replaced .... But eventually the number
of length words will become too large, and the length-length word will
overflow, requiring ... But then ...

You can't win.

--
The man that wanders far ***@reutershealth.com
from the walking tree http://www.reutershealth.com
--first line of a non-existent poem by: John Cowan

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-10-16 21:31:55 UTC

Permalink

On 16/10/2003 13:12, John Cowan wrote:

>Asmus Freytag scripsit:
>
>
>
>>PS: private answer to Jill: make sure that your characters are always
>>represented internally by infinite precision integers.
>>
>>
>
>Actually, the intractability of the transfinite ordinals shows us that
>there can be no such thing. Ordinary "infinite precision" integers
>begin with a fixed-size length word saying how many words (for whatever
>definition of "word") follow. But eventually the fixed-size length
>word will overflow, and must be replaced by a variable-sized length
>word, itself prefixed by a length^2 word. But eventually the length^2
>word will overflow, and must be replaced .... But eventually the number
>of length words will become too large, and the length-length word will
>overflow, requiring ... But then ...
>
>You can't win.
>
>
>
But what if you use a null terminated string, and precede it with
decimal digits (what's a bit of inefficiency when you are talking about
infinity?) or any encoding in which null is not a valid part of the
number? Then the precision is truly infinite, surely (at least to the
first transfinite number), except that if the universe is finite there
is an ultimate limit on storage capacity.

It all reminds me of a book I read, not intended as science fiction but
as a real contribution to science and philosophy, which predicted that
the universe would collapse but as it collapses be converted into a
giant computer whose speed would increase exponentially (or something)
as it collapsed in such a way that in the finite time before the
collapse it could perform an infinite number of calculations. Which had
some rather interesting consequences, but they go well off topic!

--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Michael Everson

2003-10-16 19:18:24 UTC

Permalink

Someone calculated that at the present rate of character encoding
(1000 a year) it would take something like 700 years to fill the
whole range of characters....
--
Michael Everson * * Everson Typography * * http://www.evertype.com

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Rick McGowan

2003-10-16 19:31:13 UTC

Permalink

Michael wrote...

> Someone calculated that at the present rate of character encoding
> (1000 a year) it would take something like 700 years to fill the
> whole range of characters....

I think Ken and I have both done similar calculations, which are a matter
of record in the mail list archives, if anyone cares to dig them up...

Rick

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Michael Everson

2003-10-16 19:18:23 UTC

Permalink

Philippe Verdy scripsit:

> > What would happen if ISO10646 decided to stop its work, giving up to let
>> IANA contract with external registrars, just to comply with an rapid
>> industry need to publish more medias and still interoperate?

There would be utter chaos, and some character tsar would have to be
appointed as has been done for RFC 3066, and it would be a right
bloody mess.
--
Michael Everson * * Everson Typography * * http://www.evertype.com

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Michael Everson

2003-10-16 21:40:59 UTC

Permalink

At 23:29 +0200 2003-10-16, Philippe Verdy wrote:

>I would definitely prefer to have a system in which any leakage of private
>uses could be controled under a well-known policy requiring a reservation
>in a publicly accessible registry, like domain names.

Well, you can't. Private Use is Private Use. You cannot restrict it.
You cannot control it. You can, as a private person, guide it, as
John and I do for some scripts in the CSUR. But that isn't standard,
and it isn't going to be. Ever. Guaranteed.

>If one designs an open reservation system in a global registry

Like the CSUR? No, that's closed, because John and I decide what we
let in and what we don't.

>(possibly with small annual fees to maintain this registration in a
>global registry),

Which no one would ever pay.....

>the reservation could be made much more safe.

Not at all.

>In addition, this would not prohibit rapid innovation or usage of
>new characters, and further standardization if needed in the
>Unicode/ISO10646 space, where these characters, now of public
>interest, could be assigned more permanently and without renewal=
>fees, provided that their usage is clearly documented by its author
>and interested groups of users...

Even *I* (who encourage all of you to contribute generously to the
Script Encoding Initiative, which actually *does* manage to get
characters encoded) would not dream of trying to charge money in such
a loony scheme.

>The main reason why those semi-private characters could be standardized
>later is for conservation of documents which could then be transcoded to
>the now safe Unicode/ISO10646 space.

There is no such thing as a semi-private character. There are
standardized characters (which have particular meanings), and there
are private use characters (which are guaranteed to have no meanings
at all).
--
Michael Everson * * Everson Typography * * http://www.evertype.com

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Cowan

2003-10-17 02:29:50 UTC

Permalink

Michael Everson scripsit:

> There is no such thing as a semi-private character. There are
> standardized characters (which have particular meanings), and there
> are private use characters (which are guaranteed to have no meanings
> at all).

There's glory (by which I mean: a nice knockdown argument) for you!

--
LEAR: Dost thou call me fool, boy? John Cowan
FOOL: All thy other titles http://www.ccil.org/~cowan
thou hast given away: ***@reutershealth.com
That thou wast born with. http://www.reutershealth.com

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Doug Ewell

2003-10-17 05:45:35 UTC

Permalink

Michael Everson <everson at evertype dot com> wrote:

> There is no such thing as a semi-private character. There are
> standardized characters (which have particular meanings), and there
> are private use characters (which are guaranteed to have no meanings
> at all).

I thought PUA characters had very specific meanings, defined by the
private parties who assign them, but utterly ignored by the standard.

What Philippe may have meant by "semi-private characters" is "characters
whose privately assigned meanings have been publicized." This would
include CSUR scripts, Apple's and Microsoft's assorted ligatures and
dingbats, and William Overington's stuff.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Philippe Verdy

2003-10-17 07:28:33 UTC

Permalink

From: "Doug Ewell" <***@adelphia.net>

> Michael Everson <everson at evertype dot com> wrote:
>
> > There is no such thing as a semi-private character. There are
> > standardized characters (which have particular meanings), and there
> > are private use characters (which are guaranteed to have no meanings
> > at all).
>
> I thought PUA characters had very specific meanings, defined by the
> private parties who assign them, but utterly ignored by the standard.
>
> What Philippe may have meant by "semi-private characters" is "characters
> whose privately assigned meanings have been publicized." This would
> include CSUR scripts, Apple's and Microsoft's assorted ligatures and
> dingbats, and William Overington's stuff.

As well as the many logographic characters that can be interchanged in
some limited ways, with conditions, and will not fit with the strict open
model needed by Unicode as they may be covered by copyrights or
digital rights. The existence of these restrictions, which limits, but in
fact
does not forbid interchanges of information is a grey area currently not
covered either by PUAs (not interchangeable), and by regulated parts
of Unicode due to its policy.

This is a place where limited agreements managed by a registration
authority could help avoid conflicts of assignments, simply not possible
with PUA.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Michael Everson

2003-10-17 13:26:05 UTC

Permalink

At 09:28 +0200 2003-10-17, Philippe Verdy wrote:

>This is a place where limited agreements managed by a registration
>authority could help avoid conflicts of assignments, simply not possible
>with PUA.

Philippe, if you'd like to pay me forty thousand a year I would be
delighted to manage a such a private registry. And I have such
excellent experience with the CSUR, too.
--
Michael Everson * * Everson Typography * * http://www.evertype.com

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Doug Ewell

2003-10-16 07:18:47 UTC

Permalink

Philippe Verdy <verdy_p at wanadoo dot fr> wrote:

> 2. The initial spec of UTF-32 and UTF-8 by ISO allowed much more
> planes with 31-bit codepoints, and may be there will be an agreement
> sometime in the future between ISO and Unicode to define new
> codepoints out of the current standard 17 first planes that can be
> safely converted with UTF-16, or a mechanism will be specified to
> allow mapping more planes to UTF-16, but this is currently not a
> priority as long as there remains unallocated space in the BMP to
> define new types and ranges of surrogates for "hyperplanes", something
> that is still possible near the Hangul block, just before the existing
> low and high surrogates).

Don't even begin to count on this. U+10FFFF will most assuredly be the
upper limit as long as you and I are here to talk about it.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Cowan

2003-10-16 10:40:51 UTC

Permalink

Doug Ewell scripsit:

> Don't even begin to count on this. U+10FFFF will most assuredly be the
> upper limit as long as you and I are here to talk about it.

Unless Earth joins the Galactic Federation, in which case we will have
to rethink the Ultra-Astral Planes.

--
Long-short-short, long-short-short / Dactyls in dimeter,
Verse form with choriambs / (Masculine rhyme): ***@reutershealth.com
One sentence (two stanzas) / Hexasyllabically http://www.reutershealth.com
Challenges poets who / Don't have the time. --robison who's at texas dot net

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Doug Ewell

2003-10-16 07:26:39 UTC

Permalink

Peter Kirk <peterkirk at qaya dot org> wrote:

> Does everyone agree that "This is not a performance issue"?

You can never tell whether something is going to be a "performance
issue" -- not just "measurably slower," but actually affecting
usability -- until you do some profiling. Guessing does no good.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-10-16 09:26:52 UTC

Permalink

On 16/10/2003 00:26, Doug Ewell wrote:

>Peter Kirk <peterkirk at qaya dot org> wrote:
>
>
>
>>Does everyone agree that "This is not a performance issue"?
>>
>>
>
>You can never tell whether something is going to be a "performance
>issue" -- not just "measurably slower," but actually affecting
>usability -- until you do some profiling. Guessing does no good.
>
>-Doug Ewell
> Fullerton, California
> http://users.adelphia.net/~dewell/
>
>
>
Well, did the people who wrote this in the standard do some profiling,
or did they just guess? There should be no place in a standard for
statements which are just guesses.

--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Doug Ewell

2003-10-16 15:56:48 UTC

Permalink

Peter Kirk <peterkirk at qaya dot org> wrote:

> On 16/10/2003 00:26, Doug Ewell wrote:
>
>> You can never tell whether something is going to be a "performance
>> issue" -- not just "measurably slower," but actually affecting
>> usability -- until you do some profiling. Guessing does no good.
>
> Well, did the people who wrote this in the standard do some profiling,
> or did they just guess? There should be no place in a standard for
> statements which are just guesses.

I don't know. I probably wouldn't have written it that way, though, nor
used the word "insignificant." I might have left it at "this is a
relatively inexpensive operation, compared to the time necessary to
carry out other work required for rendering."

It always helps to have some knowledge of the application. Just the
other day at work, I asserted -- without profiling -- that a certain
operation would *not* be a performance issue. It involved massaging an
integer value for backward compatibility before using it to populate a
dialog box:

switch (nOption)
{
case 0: nOption = 2; break;
case 2: nOption = 0; break;
default: break;
}

The process of populating this dialog box also involves several
underlying calls to the Windows API, and a call to a SQL database, which
are far more expensive than this simple integer operation.

In this case, I was familiar with the application, and knew that the
extra processing would not be performed in a loop or something.
*Unless* you have that type of knowledge, profiling is really the only
way to tell whether you have a performance issue.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Asmus Freytag

2003-10-16 17:56:32 UTC

Permalink

At 02:26 AM 10/16/03 -0700, Peter Kirk wrote:
>>You can never tell whether something is going to be a "performance
>>issue" -- not just "measurably slower," but actually affecting
>>usability -- until you do some profiling. Guessing does no good.
>>
>Well, did the people who wrote this in the standard do some profiling, or
>did they just guess? There should be no place in a standard for statements
>which are just guesses.

Oh don't we just love making categorical statements today.

Scripts where the issue is expected to actually matter include Arabic and
Hebrew. Both those scripts require the Bidi algorithm to be run (in
addition to all the other rendering related tasks). There are two phases to
that algorithm: level assignment and reversal. Assigning levels is a linear
process, but reversal depends on both the input size and the number of
levels. So, it's essentially equivalent to an O(Nxm) where m is a not quite
constant, but small number.

Arabic would need positional shaping in addition to the bid algorithm.

Normalization has mapping and reordering phases. The reordering is O(n
log(n)) where n is the length of a combining sequence. Realistically that's
a small number. The rest of the algorithm is O(N) with N the length of the
input. For NFC there's a decomposition and the composition phase, so the
number of steps per character is not as trivial as a strcpy, but then
again, neither is bidi.

The rest of rendering has to map characters to glyphs, add glyph extents,
calculate line breaks, determine glyph positions, and finally rasterize
outlines and copy bits. (When rendering to PDF, the last two steps would be
slightly different). That's a whole lot of passes over the data as well,
many of them with a non-trivial number of steps per input character.

Given this context, it's more than an educated guess that normalization at
rendering time will not dominate the performance, particularly not when
optimized.

Even for pure ASCII data (which never need normalization), the rendering &
display tasks will take more steps per character than a normalization quick
check (especially one opitmized for ASCII input ;-).

Therefore, I regard the statement in the text of the standard as quite
defensible (if understood in context) and to be better supported than a
mere 'guess'. It's a well-educated guess, probably even a PhD. guess.

However, if someone has measurements from a well-tuned system, it would be
nice to know some realistic values for the relative cost of normalization
and display.

A./

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Cowan

2003-10-16 19:35:17 UTC

Permalink

Asmus Freytag scripsit:

> Normalization has mapping and reordering phases. The reordering is O(n
> log(n)) where n is the length of a combining sequence. Realistically that's
> a small number.

I tried sorting 100,000,000 runs of random bytes where each run has a
randomly chosen length from 1 to 5 with a naive quicksort and a naive
bubblesort. Although quicksort is O(N log N) and bubblesort is O(N^2),
the bubblesort was just a hair faster, 220 seconds on my machine vs. 225
seconds, and the code is a lot shorter and easier to get right. A few
experiments suggest that these results are linear in the number of runs,
which is not surprising. Not invoking the sort algorithm when N = 1,
the most trivial possible optimization hack, improves the advantages of
bubblesort substantially, making the running time 206 seconds vs. 252
seconds. A 20% improvement, easily achieved, is not to be sneezed at.

YMMV.

--
I don't know half of you half as well John Cowan
as I should like, and I like less than half ***@reutershealth.com
of you half as well as you deserve. http://www.ccil.org/~cowan
--Bilbo http://www.reutershealth.com

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

t***@eatoni.com

2003-10-16 21:04:52 UTC

Permalink

>>>>> "John" == John Cowan <***@reutershealth.com> writes:
John> I tried sorting 100,000,000 runs of random bytes where each run
John> has a randomly chosen length from 1 to 5 with a naive quicksort
John> and a naive bubblesort. Although quicksort is O(N log N) and
John> bubblesort is O(N^2), the bubblesort was just a hair faster, 220
John> seconds on my machine vs. 225 seconds, and the code is a lot
John> shorter and easier to get right. A few experiments suggest that
John> these results are linear in the number of runs, which is not
John> surprising.

A few quick comments on this:

- Any discussion of big O notation (not that you brought it up) in
the context of sorting 5 elements is likely to be pretty irrelevant
and/or misleading. Constants matter (a lot) in small cases, and
various worst case O(nlgn) algorithms are likely to be slower than
bubble sort for a small number of elements (30 or so).

- The model of computation behind this big O analysis is comparisons
and with such a small number of elements the comparisons may be
just a minor part of what's needed (i.e., the model of computation
typically used in the traditional analysis of comparison-based
algorithms may well be entirely inappropriate in this context).

- Those big O numbers are for worst case behavior. In the wild I
expect (this could be entirely wrong) that in many cases one would
encounter sequences of combining chars that were already sorted -
and this is the best case for bubblesort (but not for various other
algorithms).

- If you really wanted to make this sorting quick, and you had a
concrete upper bound on the number of things you wanted to sort
you'd use specialized code (like your treatment of just one
element) for at least the first 5 cases. The optimal number of
comparisons is known for small values. Again though, that's talking
about comparisons, which may not be the most important factor in
doing this processing. For example, moving data around, argument
marshalling and function call/return, or other algorithm setup
costs may completely swamp the cost of doing half a dozen integer
comparisons).

- You mention it looks linear, and it certainly is. You're repeatedly
sorting 1 to 5 things. Each of your algorithms has some average
case time to do this and that's going to wind up as something that
looks extraordinarily like a constant over 100 million runs. Even
if you compared the sets of runs of length 1, 2 etc separately,
they would still all look linear / parallel (if that makes sense -
you need to imagine the graph I'm imagining :-)). That's because
the worst case big O understanding of the algorithms involved just
isn't going to be relevant for such small n.

Regards,
Terry.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Hudson

2003-10-16 18:20:43 UTC

Permalink

At 12:26 AM 10/16/2003, Doug Ewell wrote:

>Peter Kirk <peterkirk at qaya dot org> wrote:
>
> > Does everyone agree that "This is not a performance issue"?
>
>You can never tell whether something is going to be a "performance
>issue" -- not just "measurably slower," but actually affecting
>usability -- until you do some profiling. Guessing does no good.

And does everyone agree on that definition of 'performance issue'? I've
spoken with text processing engineers who certainly consider 'measurably
slower' to be a 'performance issue', especially if that decreased speed is
noticeable to users who do not benefit from changes to existing software.
For example -- knowing that it is on Peter's mind -- if an existing Hebrew
text engine is modified to be able to correctly render normalised Biblical
Hebrew -- e.g. by buffered re-ordering of characters from normalised order
to an order that can be processed by fonts -- is the measurably slower
performance an acceptable performance hit *if* your priority is modern
Hebrew text processing that does not require such re-ordering. While some
software developers have, happily, devoted considerable resources to
supporting minority languages and user communities, I can understand their
concern if supporting a minority use of a script has an impact on the
performance of that script for the majority users. Of course, I say this as
one of the minority users of the Hebrew script, who would really like to
see better support for Biblical Hebrew text processing, but I think it is
worthwhile trying to understand why there might be reluctance in some
quarters to pursue particular solutions.

John Hudson

Tiro Typeworks www.tiro.com
Vancouver, BC ***@tiro.com

I sometimes think that good readers are as singular,
and as awesome, as great authors themselves.
- JL Borges

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-10-16 21:48:47 UTC

Permalink

On 16/10/2003 11:20, John Hudson wrote:

> At 12:26 AM 10/16/2003, Doug Ewell wrote:
>
>> Peter Kirk <peterkirk at qaya dot org> wrote:
>>
>> > Does everyone agree that "This is not a performance issue"?
>>
>> You can never tell whether something is going to be a "performance
>> issue" -- not just "measurably slower," but actually affecting
>> usability -- until you do some profiling. Guessing does no good.
>
>
> And does everyone agree on that definition of 'performance issue'?
> I've spoken with text processing engineers who certainly consider
> 'measurably slower' to be a 'performance issue', especially if that
> decreased speed is noticeable to users who do not benefit from changes
> to existing software. For example -- knowing that it is on Peter's
> mind -- if an existing Hebrew text engine is modified to be able to
> correctly render normalised Biblical Hebrew -- e.g. by buffered
> re-ordering of characters from normalised order to an order that can
> be processed by fonts -- is the measurably slower performance an
> acceptable performance hit *if* your priority is modern Hebrew text
> processing that does not require such re-ordering. ...

Why should it be a performance hit for modern Hebrew? Most modern Hebrew
is unpointed, which means that it has no combining characters, and so
any reordering routines would never be triggered. In rare cases there
may be single combining characters, but as John Cowan realised there is
no need to call a sort routine to sort a single character. The sort
routines need only be called when they are needed.

Asmus mentioned a composition phase in producing NFC. This is not
actually relevant to Hebrew either as there are no precomposed
characters, apart from the alphabetic presentation forms which are
composition exceptions. A rendering system might actually choose to use
those APFs, but that is a separate issue.

--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Hudson

2003-10-17 00:23:44 UTC

Permalink

At 02:48 PM 10/16/2003, Peter Kirk wrote:

>Why should it be a performance hit for modern Hebrew? Most modern Hebrew
>is unpointed, which means that it has no combining characters, and so any
>reordering routines would never be triggered. In rare cases there may be
>single combining characters, but as John Cowan realised there is no need
>to call a sort routine to sort a single character. The sort routines need
>only be called when they are needed.

My understanding is that at least the rendering systems I'm most familiar
seem to be pretty much all or nothing: you either pass a string to a script
engine or you don't. If the string is passed to the script engine, that
engine goes through its steps. Having the engine decide whether or not to
go through a particular step would seem to be an extra step, i.e. an even
bigger hit on performance. In Uniscribe, for example, there is a single
engine for Hebrew: if you make changes to the engine to facilitate Biblical
Hebrew, what is the impact on processing speed, and is it acceptable if
your primary user group do not benefit from the change because they're
using modern Hebrew. Note that I don't claim to know what the impact is,
I'm just trying to understand why engineers I've spoken to about this are
unhappy about the idea of adding extra processing steps to existing engines
that work well for the majority user community.

John Hudson

Tiro Typeworks www.tiro.com
Vancouver, BC ***@tiro.com

I sometimes think that good readers are as singular,
and as awesome, as great authors themselves.
- JL Borges

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-10-17 09:56:33 UTC

Permalink

On 16/10/2003 17:23, John Hudson wrote:

> ...
> My understanding is that at least the rendering systems I'm most
> familiar seem to be pretty much all or nothing: you either pass a
> string to a script engine or you don't. If the string is passed to the
> script engine, that engine goes through its steps. Having the engine
> decide whether or not to go through a particular step would seem to be
> an extra step, i.e. an even bigger hit on performance. In Uniscribe,
> for example, there is a single engine for Hebrew: if you make changes
> to the engine to facilitate Biblical Hebrew, what is the impact on
> processing speed, and is it acceptable if your primary user group do
> not benefit from the change because they're using modern Hebrew. Note
> that I don't claim to know what the impact is, I'm just trying to
> understand why engineers I've spoken to about this are unhappy about
> the idea of adding extra processing steps to existing engines that
> work well for the majority user community.

Well, I don't claim to be able to rewrite anyone's script engine. But
this just doesn't add up. All Hebrew script has to be passed to a script
engine anyway for bidi processing etc. Within this script engine there
is already special processing for combining marks, different from what
is needed for base characters. In order to invoke this processing, there
must be some kind of test of whether each character is a combining
character. With unpointed Hebrew text, this test always fails and so the
special processing code for combining marks is never run. So, if the
extra complication of a reordering pass is added to this special
processing, it cannot affect the performance when working with unpointed
text.

--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

B***@sil.org

2003-10-16 09:04:36 UTC

Permalink

On 15/10/2003 21:44:20 Peter Kirk wrote:

>On 15/10/2003 10:48, Asmus Freytag wrote:
>>
>> So we conclude: "rendering any string as if it was normalized" is
>> *not* a performance issue.
>
>Thank you. This is the clarification I was looking for, and confirms my
>own suspicions. But are there any other views on this? I have heard
>them from implementers of rendering systems. But I have wondered if this
>is because of their reluctance to do the extra work required to conform
>to this requirement.

It may not be just the extra work that gives rise to such reluctance:
There may be pieces out of the implementer's control (e.g., fonts) that
would also have to change.

Bob

Peter Kirk

2003-10-16 10:16:08 UTC

Permalink

On 16/10/2003 02:04, ***@sil.org wrote:

>
> On 15/10/2003 21:44:20 Peter Kirk wrote:
>
> >On 15/10/2003 10:48, Asmus Freytag wrote:
> >>
> >> So we conclude: "rendering any string as if it was normalized" is
> >> *not* a performance issue.
> >
> >Thank you. This is the clarification I was looking for, and confirms my
> >own suspicions. But are there any other views on this? I have heard
> >them from implementers of rendering systems. But I have wondered if this
> >is because of their reluctance to do the extra work required to conform
> >to this requirement.
>
> It may not be just the extra work that gives rise to such reluctance:
> There may be pieces out of the implementer's control (e.g., fonts)
> that would also have to change.
>
> Bob

Surely not, in principle. If a font currently correctly renders the
canonical order (and perhaps other non-canonical orders) and the
rendering engine is changed to always present the text to the font in
canonical order, the rendering remains correct. It could be less
efficient, but is more likely to be more efficient because any attempted
reordering in the font is likely to be inefficient but will be bypassed
by this mechanism. And if a font correctly renders not the canonical
order but a tailored order, with permuted combining class weights as in
Table 5-6, and this tailored ordering is agreed and implemented by the
rendering engine provider and the font provider, rendering remains OK.

The problem only comes with fonts which have been written under the
assumption that they will be presented with a particular non-canonical
order, and designed to work with a rendering engine which does not
guarantee to render in that order. But such fonts are immediately
problematical in a number of ways e.g. they are unable to render
correctly HTML and XML text normalised as recommended. Unfortunately
such fonts are being produced - although this is not the font producers'
fault, for in some cases the Unicode canonical order is so far from the
logical order that it is not reasonably possible for the font to do the
reordering required to render the canonical order. The fix needs to be
made either in the canonical order (ruled out by short-sighted stability
guarantees) or in the rendering engine, not separately in each font.

By the way, as this permutation of combining class weights is for
rendering only and so does not need to be round trip, I don't see any
good reason for prohibiting combination of combining classes. As for the
prohibition on splitting classes, this is obviously necessary where
there is real typographical interference, to avoid unwanted reordering,
but in cases where the combining classes have been incorrectly allocated
so that there is no actual typographical interference and the rendering
must be independent of the actual order of the characters, splitting the
class or transferring the misallocated character to the class it should
have been in may be the best way to render correctly.

--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Jill Ramonsky

2003-10-16 11:14:25 UTC

Permalink

> -----Original Message-----
> From: Doug Ewell [mailto:***@adelphia.net]
> Sent: Thursday, October 16, 2003 8:19 AM
> To: Philippe Verdy; Nelson H. F. Beebe
> Cc: ***@unicode.org
> Subject: Re: Java char and Unicode 3.0+ (was:Canonical equivalence in
> rendering: mandatory or recommended?)
>
>
> > planes with 31-bit codepoints, and maybe there will be an agreement
> > sometime in the future between ISO and Unicode to define new
> > codepoints out of the current standard 17 first planes
>
> Don't even begin to count on this. U+10FFFF will most
> assuredly be the
> upper limit as long as you and I are here to talk about it.

As a scientist, I don't believe in clairvoyance. I do, however, think
that "maybe ... sometime in the future ..." is a reasonable enough
statement to make, and that "...will most assuredly ...as long as you
and I are here" is a very dangerous predicition to make (unless I'm
wrong about clairvoyance).

Don't count on anything. Even if Unicode stops at 10FFFF, there may be
other, future standards, of which Unicode is but a subset. I'm sure the
designers of ASCII thought it was amply large enough at the time.

It's a simple enough rule - never hard-code limitations into your design
if you don't have to. You may one day live to regret it. (Or you may not
... but no-one will ever critise you for erring on the side of safety).

Jill Ramonsky

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Cowan

2003-10-16 12:07:42 UTC

Permalink

Jill Ramonsky scripsit:

> As a scientist, I don't believe in clairvoyance. I do, however, think
> that "maybe ... sometime in the future ..." is a reasonable enough
> statement to make, and that "...will most assuredly ...as long as you
> and I are here" is a very dangerous predicition to make (unless I'm
> wrong about clairvoyance).

The Sun will continue to rise (or appear to do so), most assuredly, as
long as you and I are here and a good deal longer too. What is more,
the claim that Julius Caesar was assassinated on the ides of March in
the year 44 BCE is unlikely to be challenged. These predictions are
entirely consistent with the scientific worldview as understood by
reasonable persons.

Similarly, the number of characters used by the peoples of the Earth
for writing their various languages is not going to be expanded by
the discovery of 10,000 characters used for writing the lost script
of Atlantis. The earth is finite and small, and there's no place for
large writing systems to hide from the eagle eyes of the Roadmappers.

It's true that Unicode fairly recently expanded to incorporate planes 1
and 2, but the *need* to do so has been foreseen for more than a decade.
Plane 3 may also be pressed into service as more Han characters are
(quite literally) dug up.

> Don't count on anything.

Tell me (as the philosopher Carnap said to his younger colleague
Smullyan), have you bought yourself an extra glove just in case you wake
up one morning with a third arm?

> Even if Unicode stops at 10FFFF, there may be
> other, future standards, of which Unicode is but a subset.

Then they will not be character encoding standards. There are plenty
of integers to go around.

> I'm sure the
> designers of ASCII thought it was amply large enough at the time.

They knew perfectly well that it was a compromise between expressiveness
and concision in a world of 110-baud transmission lines and computers
more than a thousand times slower than today's desktop machines.

> It's a simple enough rule - never hard-code limitations into your design
> if you don't have to. You may one day live to regret it. (Or you may not
> ... but no-one will ever critise you for erring on the side of safety).

The U.S. budget is measured in trillions of dollars, but we can fairly
well exclude from our systems the possibilities that it will be measured
in trillion trillions some day. The Y2K bug was a serious concern
(motivated by storage costs something like 10^5 times the cost of
today's); the Y10K bug is not. The exhaustion of the 32-bit IPv4 space
is a reasonable concern, and was known to be so the day it was introduced;
the 128-bit IPv6 space is not subject to that concern.

--
"But I am the real Strider, fortunately," John Cowan
he said, looking down at them with his face ***@reutershealth.com
softened by a sudden smile. "I am Aragorn son http://www.ccil.org/~/cowan
of Arathorn, and if by life or death I can http://www.reutershealth.com
save you, I will." --LotR Book I Chapter 10

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Rick McGowan

2003-10-16 18:07:16 UTC

Permalink

John Cowan suggested:

> The earth is finite and small, and there's no place for
> large writing systems to hide from the eagle eyes of the Roadmappers.

Central Asia.

;-)
Rick

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Michael Everson

2003-10-16 19:18:26 UTC

Permalink

At 11:07 -0700 2003-10-16, Rick McGowan wrote:
>John Cowan suggested:
>
>> The earth is finite and small, and there's no place for
>> large writing systems to hide from the eagle eyes of the Roadmappers.
>
>Central Asia.

Our eyes are everywhere.
--
Michael Everson * * Everson Typography * * http://www.evertype.com

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-10-16 21:15:20 UTC

Permalink

On 16/10/2003 11:07, Rick McGowan wrote:

>John Cowan suggested:
>
>
>
>>The earth is finite and small, and there's no place for
>>large writing systems to hide from the eagle eyes of the Roadmappers.
>>
>>
>
>Central Asia.
>
>;-)
> Rick
>
>
>
>
>.
>
>
>
No, Michael has been there!

--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Cowan

2003-10-16 20:05:00 UTC

Permalink

Asmus Freytag scripsit:

> I disagree.

I knew some German or Hungarian was going to slug me over this one.
The Hungarian inflation of 1944-46 was something like 10^29, even worse
than the German one. The last six months, January-July 1946, can
be characterized by a *double exponential* growth rate of about e^e^2.7
per month!

> I'm sure no German living in 1919 ever expected such rates. Since we have
> had the benefit of knowing that this was possible at least once, our
> threshold for the truly 'unexpected' should be set even higher.

As a matter of sober fact rather than national hybris, hyperinflation of
the U.S. dollar would be an event that would stagger, if not overthrow,
civilization. Computer systems would be the least of the casualties.

--
And through this revolting graveyard of the universe the muffled, maddening
beating of drums, and thin, monotonous whine of blasphemous flutes from
inconceivable, unlighted chambers beyond Time; the detestable pounding
and piping whereunto dance slowly, awkwardly, and absurdly the gigantic
tenebrous ultimate gods -- the blind, voiceless, mindless gargoyles whose soul
is Nyarlathotep. (Lovecraft) John Cowan|***@reutershealth.com|ccil.org/~cowan

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Rick McGowan

2003-10-16 18:50:22 UTC

Permalink

Before everyone goes jumping off the deep end with wanting to reserve more
space on the BMP for hyper extended surrogates or whatever, can someone
please come up with more than 1 million things that need to be encoded?

Our best estimate, for all of human history, comes in around 250,000. Even
if we included, as characters, lots of stuff that is easily unified with
existing characters, or undeciphered, or just more dingbatty blorts, it
comes up nowhere near a million.

What you see on the roadmap is what we, in over 12 years of searching,
have been able to find. I challenge anyone to come up with enough
legitimate characters (approximately a million of them) that aren't on the
roadmap to fill the 17 planes.

Thanks.
Rick

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Addison Phillips [wM]

2003-10-16 20:12:26 UTC

Permalink

Here's a proposed solution then. I hereby submit it for use on that
incredibly distant day in which our oracle fails and a new 1 million code
point script is added to Unicode (e.g. never).

When all of the planes less than 16 are full and the possibility of
exhausting code points become actually apparent (but not before), the UTC
should reserve a range of code points in plane 16 to serve as "astral low
surrogates" and another to serve as "astral high surrogates". UTF-16 can the
use a pair of surrogate pairs to address the higher planes thereby exposed.
And we won't all have to muck with our implementations to support this
stuff.

Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility

432 Lakeside Drive, Sunnyvale, CA, USA
+1 408.962.5487 (office) +1 408.210.3569 (mobile)
mailto:***@webmethods.com

Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International/ws

Internationalization is an architecture.
It is not a feature.

> -----Original Message-----
> From: unicode-***@unicode.org [mailto:unicode-***@unicode.org]On
> Behalf Of Rick McGowan
> Sent: Thursday, October 16, 2003 12:50 PM
> To: ***@unicode.org
> Subject: Re: Beyond 17 planes, was: Java char and Unicode 3.0+
>
>
> Before everyone goes jumping off the deep end with wanting to
> reserve more
> space on the BMP for hyper extended surrogates or whatever, can someone
> please come up with more than 1 million things that need to be encoded?
>
> Our best estimate, for all of human history, comes in around
> 250,000. Even
> if we included, as characters, lots of stuff that is easily unified with
> existing characters, or undeciphered, or just more dingbatty blorts, it
> comes up nowhere near a million.
>
> What you see on the roadmap is what we, in over 12 years of searching,
> have been able to find. I challenge anyone to come up with enough
> legitimate characters (approximately a million of them) that
> aren't on the
> roadmap to fill the 17 planes.
>
> Thanks.
> Rick

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Philippe Verdy

2003-10-16 22:20:31 UTC

Permalink

From: "Addison Phillips [wM]" <***@webmethods.com>

> Here's a proposed solution then. I hereby submit it for use on that
> incredibly distant day in which our oracle fails and a new 1 million code
> point script is added to Unicode (e.g. never).
>
> When all of the planes less than 16 are full and the possibility of
> exhausting code points become actually apparent (but not before), the UTC
> should reserve a range of code points in plane 16 to serve as "astral low
> surrogates" and another to serve as "astral high surrogates". UTF-16 can
the
> use a pair of surrogate pairs to address the higher planes thereby
exposed.
> And we won't all have to muck with our implementations to support this
> stuff.

Too late for plane 16: it's currently assigned to PUAs...
Same thing for plane 15.

But such extension space is certainly available in the special
(spacial? astral? ;-)) plane 14 ...

Which could then be reserved for "hyper-surrogates", referencing
codepoints out of the 17 first planes, and assigned in a open
registry for interchangeable semi-private uses, such as corporate
logographs and other visual trademarks (including the famous
Apple logo character in the MacRoman encoding, or the extra
PUAs needed by Microsoft in its OpenType fonts for Office...)

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Michael Everson

2003-10-16 22:35:40 UTC

Permalink

At 00:20 +0200 2003-10-17, Philippe Verdy wrote:

>Which could then be reserved for "hyper-surrogates",

We don't need them.

>referencing codepoints out of the 17 first planes, and assigned in a open
>registry for interchangeable semi-private uses,

There is no such thing.

>such as corporate logographs and other visual trademarks (including
>the famous Apple logo character in the MacRoman encoding, or the
>extra PUAs needed by Microsoft in its OpenType fonts for Office...)

No.
--
Michael Everson * * Everson Typography * * http://www.evertype.com

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-10-16 23:50:38 UTC

Permalink

On 16/10/2003 15:20, Philippe Verdy wrote:

> ...
>
> the famous
>Apple logo character in the MacRoman encoding, ...
>
Another issue which goes to the core of Unicode? :-)

--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Addison Phillips [wM]

2003-10-17 00:11:44 UTC

Permalink

Maybe I should have said it the other way first: yuck. I don't want any
characters beyond the planes ever. Nor a change in encoding rules.

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility

432 Lakeside Drive, Sunnyvale, CA, USA
+1 408.962.5487 (office) +1 408.210.3569 (mobile)
mailto:***@webmethods.com

Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International/ws

Internationalization is an architecture.
It is not a feature.

> -----Original Message-----
> From: Philippe Verdy [mailto:***@wanadoo.fr]
> Sent: Thursday, October 16, 2003 4:21 PM
> To: ***@webmethods.com
> Cc: ***@unicode.org
> Subject: Re: Beyond 17 planes, was: Java char and Unicode 3.0+
>
>
> From: "Addison Phillips [wM]" <***@webmethods.com>
>
> > Here's a proposed solution then. I hereby submit it for use on that
> > incredibly distant day in which our oracle fails and a new 1
> million code
> > point script is added to Unicode (e.g. never).
> >
> > When all of the planes less than 16 are full and the possibility of
> > exhausting code points become actually apparent (but not
> before), the UTC
> > should reserve a range of code points in plane 16 to serve as
> "astral low
> > surrogates" and another to serve as "astral high surrogates". UTF-16 can
> the
> > use a pair of surrogate pairs to address the higher planes thereby
> exposed.
> > And we won't all have to muck with our implementations to support this
> > stuff.
>
> Too late for plane 16: it's currently assigned to PUAs...
> Same thing for plane 15.
>
> But such extension space is certainly available in the special
> (spacial? astral? ;-)) plane 14 ...
>
> Which could then be reserved for "hyper-surrogates", referencing
> codepoints out of the 17 first planes, and assigned in a open
> registry for interchangeable semi-private uses, such as corporate
> logographs and other visual trademarks (including the famous
> Apple logo character in the MacRoman encoding, or the extra
> PUAs needed by Microsoft in its OpenType fonts for Office...)

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Michael Everson

2003-10-16 20:31:41 UTC

Permalink

At 11:50 -0700 2003-10-16, Rick McGowan wrote:

>What you see on the roadmap is what we, in over 12 years of
>searching, have been able to find. I challenge anyone to come up
>with enough legitimate characters (approximately a million of them)
>that aren't on the roadmap to fill the 17 planes.

Thanks, Rick, for not telling them all about all those secret
characters that we have withheld from the Roadmaps.... ;-)
--
Michael Everson * * Everson Typography * * http://www.evertype.com

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Rick McGowan

2003-10-16 18:52:24 UTC

Permalink

Philippe Verdy wrote:

> It's true that there is no plan in Unicode to encode something
> else than plain text for existing or future actual scripts. But
> ISO10646 objectives are to also to offer support and integrate
> almost all other related ISO specifications that may need a
> unified codepoint space for encoding either plain text or their
> own objects.

Interesting. I've never heard this. Please point to a document that states
such objectives for WG2.

Rick

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Philippe Verdy

2003-10-16 20:16:01 UTC

Permalink

----- Original Message -----
From: "Rick McGowan" <***@unicode.org>
To: <***@unicode.org>
Sent: Thursday, October 16, 2003 8:52 PM
Subject: Re: Beyond 17 planes, was: Java char and Unicode 3.0+

> Philippe Verdy wrote:
>
> > It's true that there is no plan in Unicode to encode something
> > else than plain text for existing or future actual scripts. But
> > ISO10646 objectives are to also to offer support and integrate
> > almost all other related ISO specifications that may need a
> > unified codepoint space for encoding either plain text or their
> > own objects.
>
> Interesting. I've never heard this. Please point to a document that states
> such objectives for WG2.

I'm not quoting exactly their sentences. This is just the general idea
behind ISO which is to produce a coherent set of standards that should
be accepted and followed by governments, industries and people
developping interchangeable products or services.

Standards should always be designed with the idea of integrating well
with other standards, without introducing contradictory objectives.

It's true that none of the ISO standards are mandatory, but they
are generally accepted and implemented as best as possible within
some economical limits. It's also true that there are contradictions
between all the standards supported by ISO, and they are amended,
obsoleted or replaced when needed. But new standards are constantly
added and developped, and neither Unicode's UTC or ISO's WG2 will
be able to avoid that.

In a more recent history, the compatibility issues between the Unicode
standard and W3C's XML has received some focus. There are still
challenging and not all resolved and highly discussed to document
somewhere which conflicting rule must first apply and how the other
standard is affected or limited in its applications.

By saying "one standard fits all", you seem to pretend that all
interchanges of information are designed and assumed to be
global, forgetting that interchanges are much more complex and
occur in a mesh of partially related groups. To get a standard
usable and unifed worldwide is a long task, and most interchanges
can't wait for that delay as they don't need to be global to be
economically viable. What is really important is the mutual agreement
between involved parts, which is much easier to rich rapidly. Even in
that case, such agreements do not need to be permanent (infinite)
as these agreements can be amended by the interested people
themselves. Thanks, this allows innovation and creation and a rapid
growth of mutual exchanges.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Michael Everson

2003-10-16 20:47:00 UTC

Permalink

At 22:16 +0200 2003-10-16, Philippe Verdy wrote:

>I'm not quoting exactly their sentences. This is just the general idea
>behind ISO which is to produce a coherent set of standards that should
>be accepted and followed by governments, industries and people
>developping interchangeable products or services.

*That* is a very different thing from what you said:

> >> ISO10646 objectives are to also to offer support and integrate
> >> almost all other related ISO specifications that may need a
> >> unified codepoint space for encoding either plain text or their
> >> own objects.

There is no truth whatsoever in your supposition that ISO/IEC 10646's
objectives have anything to do with "integrating" other
specifications. ISO/IEC 10646's objectives are to provide an
architecture and character set.
--
Michael Everson * * Everson Typography * * http://www.evertype.com

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Asmus Freytag

2003-10-16 21:24:38 UTC

Permalink

At 10:16 PM 10/16/03 +0200, Philippe Verdy wrote:
>Standards should always be designed with the idea of integrating well
>with other standards, without introducing contradictory objectives.

This is what Americans call "motherhood and apple pie" - feel godd statements
that are lofty but do nothing to resolve the real word's real conflicts,
including those of standards with conflicting objectives and targets.

A./

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Constable

2003-10-16 19:38:58 UTC

Permalink

> -----Original Message-----
> From: unicode-***@unicode.org [mailto:unicode-***@unicode.org]
On
> Behalf Of Asmus Freytag

> >>Canonical equivalence must be taken into account in rendering
multiple
> >>accents, so that any two canonically equivalent sequences display as
the
> same.
>
> This statement goes to the core of Unicode. If it is followed, it
> guarantees that normalizing a string does not change its appearance
(and
> therefore it remains the 'same' string as far as the user is
concerned.)

I agree in principle. There are two ways in which the philosophy behind
this breaks down in real life, though:

1. There are cases of combining marks given a class of 0, meaning that
combinations of marks in different positions relative to the base will
be visually indistinguishable, but the encoded representations are not
the same, and not canonically equivalent. E.g. (taken from someone else
on the Indic list) Devanagari ka + i + u vs. ka + u + i.

2. Relying on normalization, and specifically canonical ordering, to
happen in a rendering engine IS liable to be a noticeable performance
issue. I suggest that whoever wrote

> Rendering systems should handle any of the canonically equivalent
> orders of combining marks. This is not a performance issue: The amount

> of time necessary to reorder combining marks is insignificant compared

> to the time necessary to carry out other work required for rendering.

was not speaking from experience.

> The interesting digressions on string libraries aside, the statement
made
> here is in the context of the tasks needed for rendering. If you take
a
> rendering library and add a normalization pass on the front of it,
you'll
> be hard-pressed to notice a difference in performance, especially for
any
> complex scripts.

If what is normalized is the backing store. If what is normalized is a
string at an intermediate stage in the rendering process, then this is
not the case. The reason is the number of times text-rendering APIs get
called. As you mention,

> However, from the other messages on this thread we conclude:
normalizing
> *every* string, *every time* it gets touched, *is* a performance
issue.

Peter

Peter Constable
Globalization Infrastructure and Font Technologies
Microsoft Windows Division

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-10-16 22:29:44 UTC

Permalink

On 16/10/2003 12:38, Peter Constable wrote:

>>-----Original Message-----
>>From: unicode-***@unicode.org [mailto:unicode-***@unicode.org]
>>
>>
>On
>
>
>>Behalf Of Asmus Freytag
>>
>>
>
>
>
>
>>>>Canonical equivalence must be taken into account in rendering
>>>>
>>>>
>multiple
>
>
>>>>accents, so that any two canonically equivalent sequences display as
>>>>
>>>>
>the
>
>
>>same.
>>
>>This statement goes to the core of Unicode. If it is followed, it
>>guarantees that normalizing a string does not change its appearance
>>
>>
>(and
>
>
>>therefore it remains the 'same' string as far as the user is
>>
>>
>concerned.)
>
>I agree in principle. There are two ways in which the philosophy behind
>this breaks down in real life, though:
>
>1. There are cases of combining marks given a class of 0, meaning that
>combinations of marks in different positions relative to the base will
>be visually indistinguishable, but the encoded representations are not
>the same, and not canonically equivalent. E.g. (taken from someone else
>on the Indic list) Devanagari ka + i + u vs. ka + u + i.
>
>
As we are talking about rendering rather than operations on the backing
store, this is actually irrelevant. If two sequences are visually
indistinguishable (with the particular font in use), a rendering engine
can safely map them together even if they are not canonically
equivalent, as long as the backing store is unchanged.

>2. Relying on normalization, and specifically canonical ordering, to
>happen in a rendering engine IS liable to be a noticeable performance
>issue. I suggest that whoever wrote
>
>
>
>>Rendering systems should handle any of the canonically equivalent
>>orders of combining marks. This is not a performance issue: The amount
>>
>>
>>of time necessary to reorder combining marks is insignificant compared
>>
>>
>>to the time necessary to carry out other work required for rendering.
>>
>>
>
>was not speaking from experience.
>
>
>
I wonder if anyone involved in this is speaking from real experience.
Peter, I don't think your old company actually tried to implement such
reordering; Sharon tells me that the idea was suggested, but rejected
for reasons unrelated to performance. I have heard that your new company
has tried it and has claimed that for Hebrew the performance hit is
unacceptable. I am still sceptical of this claim. Presumably this was
done by adding a reordering step to an existing rendering engine. But
was this reordering properly optimised in binary code, or was it just
bolted on to an unsuitable architecture using a high level language
designed for the different purpose of glyph level reordering?

Also, as I just pointed out in a separate posting, there should be no
performance hit for unpointed modern Hebrew as there are no combining
marks to be reordered. The relatively few users of pointed Hebrew would
prefer to see their text rendered correctly if a little slowly rather
than quickly but incorrectly.

If, as you agree in principle, this is an issue which goes to the core
of Unicode, should you not be prepared to take some small performance
hit in order to conform properly to the architecture?

> ...
>
>If what is normalized is the backing store. If what is normalized is a
>string at an intermediate stage in the rendering process, then this is
>not the case. The reason is the number of times text-rendering APIs get
>called. ...
>
If it is unavoidable to call the same routine (for sorting or any other
purpose) multiple times with the same data, the results can be cached so
that they do not have to be recalculated each time.

--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Constable

2003-10-16 19:42:19 UTC

Permalink

> -----Original Message-----
> From: unicode-***@unicode.org [mailto:unicode-***@unicode.org]
On
> Behalf Of Peter Kirk

> Thank you. This is the clarification I was looking for, and confirms
my
> own suspicions. But are there any other views on this? I have heard
> them from implementers of rendering systems. But I have wondered if
this
> is because of their reluctance to do the extra work required to
conform
> to this requirement.

This isn't something that can be fixed in rendering systems. It wouldn't
be hard to do; it's just too much of a performance issue. It has to be
addressed by the software calling the rendering system.

Peter

Peter Constable
Globalization Infrastructure and Font Technologies
Microsoft Windows Division

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-10-16 22:57:33 UTC

Permalink

On 16/10/2003 12:42, Peter Constable wrote:

>>-----Original Message-----
>>From: unicode-***@unicode.org [mailto:unicode-***@unicode.org]
>>
>>
>On
>
>
>>Behalf Of Peter Kirk
>>
>>
>
>
>
>>Thank you. This is the clarification I was looking for, and confirms
>>
>>
>my
>
>
>>own suspicions. But are there any other views on this? I have heard
>>them from implementers of rendering systems. But I have wondered if
>>
>>
>this
>
>
>>is because of their reluctance to do the extra work required to
>>
>>
>conform
>
>
>>to this requirement.
>>
>>
>
>This isn't something that can be fixed in rendering systems. It wouldn't
>be hard to do; it's just too much of a performance issue. It has to be
>addressed by the software calling the rendering system.
>
>
>Peter
>
>Peter Constable
>Globalization Infrastructure and Font Technologies
>Microsoft Windows Division
>
>
>
>
>
>
>
So, you seem to be suggesting that all applications, and the system
libraries which they generally use for character handling, should be
rewritten so that data is transformed into and stored in a particular
well-defined order. Presumably that order would be one of the Unicode
normalisation forms; most likely NFC as that matches the XML
recommendation. Conceivably it could be a different privately defined
form, e.g. based on a different set of combining classes (cf. the
permuted combining class weights of TUS Table 5-6) chosen to get round
some of the well known problems with the standardised combining classes.
Fonts would be required to display correctly only this one clearly
defined order.

It seems to me that this is a viable alternative approach to the
canonical equivalence issue, either globally, or within a particular
company's system architecture provided that there is proper support
within the system libraries and from the system fonts. It might lead to
increased overall efficiency, although with some danger of chaos in the
interim period. It is not the approach which Unicode has recommended in
its implementation guidelines, although I suppose that recommendation
could be changed. I wonder if it is the approach which has been agreed
and will be implemented across the board by any one company. It is
certainly not viable for a rendering group to unilaterally pass to an
applications group, or to a system libraries group, responsibility for
such an important matter ("goes to the core of Unicode") which the
Unicode standard currently clearly states to be a rendering issue.

--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Hudson

2003-10-16 20:14:42 UTC

Permalink

At 12:42 PM 10/16/2003, Peter Constable wrote:

>This isn't something that can be fixed in rendering systems. It wouldn't
>be hard to do; it's just too much of a performance issue. It has to be
>addressed by the software calling the rendering system.

This interests me, because it addresses the concern about the impact of
performance issues on users who do not benefit from what is being
performed. If, for example, re-ordering of normalised text into
font-renderable text takes place in the generic rendering engine, everyone
is hit by the same performance impact, regardless of whether the software
or the text they are working with requires the additional processing. If
the re-ordering of normalised text happens before calling the rendering
engine, within specific software applicaions, then it only needs to be done
when required.

Is that a reasonable interpretation of your comment, Peter?

John Hudson

Tiro Typeworks www.tiro.com
Vancouver, BC ***@tiro.com

I sometimes think that good readers are as singular,
and as awesome, as great authors themselves.
- JL Borges

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Constable

2003-10-16 23:13:25 UTC

Permalink

> -----Original Message-----
> From: Peter Kirk [mailto:***@qaya.org]

> I wonder if anyone involved in this is speaking from real experience.
> Peter, I don't think your old company actually tried to implement such
> reordering

No, but my new company has.

> I have heard that your new company
> has tried it and has claimed that for Hebrew the performance hit is
> unacceptable. I am still sceptical of this claim.

Well, you're more than welcome to create an implementation that
demonstrates otherwise and share it with us :-).

> Presumably this was
> done by adding a reordering step to an existing rendering engine. But
> was this reordering properly optimised in binary code, or was it just
> bolted on to an unsuitable architecture using a high level language
> designed for the different purpose of glyph level reordering?

I don't know the details; I suspect it was done within a finite-state
machine.

> If, as you agree in principle, this is an issue which goes to the core
> of Unicode, should you not be prepared to take some small performance
> hit in order to conform properly to the architecture?

There's more to life in the real world than conformance to a theoretical
principle, and unfortunately most of us live with constraints as a
result.

> If it is unavoidable to call the same routine (for sorting or any
other
> purpose) multiple times with the same data, the results can be cached
so
> that they do not have to be recalculated each time.

That's the kind of thing that would be up to a calling application, not
the rendering engine.

Peter

Peter Constable
Globalization Infrastructure and Font Technologies
Microsoft Windows Division

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-10-17 00:00:40 UTC

Permalink

On 16/10/2003 16:13, Peter Constable wrote:

> ...
>
>>I have heard that your new company
>>has tried it and has claimed that for Hebrew the performance hit is
>>unacceptable. I am still sceptical of this claim.
>>
>>
>
>Well, you're more than welcome to create an implementation that
>demonstrates otherwise and share it with us :-).
>
>
>
I am already in discussions about getting this added to Graphite (see
http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=RenderingGraphite),
an open source rendering engine which you know well. You are free to
incorporate it into your company's products.

>
>
>
>>Presumably this was
>>done by adding a reordering step to an existing rendering engine. But
>>was this reordering properly optimised in binary code, or was it just
>>bolted on to an unsuitable architecture using a high level language
>>designed for the different purpose of glyph level reordering?
>>
>>
>
>I don't know the details; I suspect it was done within a finite-state
>machine.
>
>
Doesn't sound to me like an efficient way to do a sort.

--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Constable

2003-10-17 09:01:53 UTC

Permalink

> -----Original Message-----
> From: unicode-***@unicode.org [mailto:unicode-***@unicode.org]
On
> Behalf Of Addison Phillips [wM]

> When all of the planes less than 16 are full and the possibility of
> exhausting code points become actually apparent (but not before), the
UTC
> should reserve a range of code points in plane 16 to serve as "astral
low
> surrogates" and another to serve as "astral high surrogates". UTF-16
can
> the
> use a pair of surrogate pairs to address the higher planes thereby
exposed.
> And we won't all have to muck with our implementations to support this
> stuff.

I believe I suggested pretty much the same thing when the very same
topic was discussed (to an equally fruitless end) about 5 years ago.

Peter

Peter Constable
Globalization Infrastructure and Font Technologies
Microsoft Windows Division

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Constable

2003-10-17 09:13:27 UTC

Permalink

> -----Original Message-----
> From: unicode-***@unicode.org [mailto:unicode-***@unicode.org]
On
> Behalf Of Philippe Verdy

> or the extra
> PUAs needed by Microsoft in its OpenType fonts for Office...)

(sigh)

OpenType fonts are not for only Office. PUAs are not needed or used by
Microsoft for OpenType implementations -- for Office or anything else.
(They have been used by MS in *non-OT* solutions to various problems).

Peter

Peter Constable
Globalization Infrastructure and Font Technologies
Microsoft Windows Division

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Jill Ramonsky

2003-10-17 10:36:40 UTC

Permalink

> -----Original Message-----
> From: Rick McGowan [mailto:***@unicode.org]
> Sent: Thursday, October 16, 2003 7:50 PM
> To: ***@unicode.org
> Subject: Re: Beyond 17 planes, was: Java char and Unicode 3.0+
>
>
> Before everyone goes jumping off the deep end with wanting to
> reserve more
> space on the BMP for hyper extended surrogates or whatever,
> can someone
> please come up with more than 1 million things that need to
> be encoded?

Every script that ever got rejected by the Unicode Consortium.

If we had an infinite codespace, we wouldn't /need/ a private use area.
If a private citizen said "I want some codepoints for some symbols I
just invented for my pet fish" then there would be no problem saying
"Sure thing, dude, you can have all the codepoints between
U+486766107743000 and U+486766107743FFF.

In such a system no application need ever be rejected, for any reason.
Inclusion would be automatic for every submission.

The curious thing is, the codespace wouldn't even need to be THAT BIG.
Even if we assigned 10000 unique symbols to every person currently alive
on the planet, you still wouldn't need more than 48 bits. We may then
consider the codepoints U+000000000000 to U+00000011FFFF as "already
assigned" (to the Unicode Consortium). The rest of the codespace would
be akin to a very large "private use area" - except that it could be
managed without a single conflict.

Jill

Doug Ewell

2003-10-17 15:38:14 UTC

Permalink

Jill Ramonsky wrote:

> If we had an infinite codespace, we wouldn't need a private use area.
> If a private citizen said "I want some codepoints for some symbols I
> just invented for my pet fish" then there would be no problem saying
> "Sure thing, dude, you can have all the codepoints between
> U+486766107743000 and U+486766107743FFF.

Please take a look at the following:
http://users.adelphia.net/~dewell/ewellic.html
http://www.evertype.com/standards/csur/ewellic.html

These pages describe a script I invented 23 years ago, before there was
a Unicode. The script turns out to have some features that fit neatly
into the Unicode model, such as combining marks and ligation, so it
seemed like a neat thing to propose for the CSUR.

But it does NOT, under any circumstances, belong in Unicode proper. It
has ONE USER that I am aware of, in all of history -- the one writing
this message -- plus two other one-time users who have sent me e-mails
in it. That does NOT justify adding it, along with all the other "pet
fish" symbols and TAFKAP glyphs and things other high school kids have
dreamed up, to a character encoding standard that will be implemented
and used worldwide.

For example, if you want to render my script correctly, you need to
perform mandatory ligation and honor certain requests for optional
ligation. You also need to position the combining mark correctly, over
the vowel portion of a VC or CV ligature, not over the center of the
ligature.

Should these details be added to the corpus of a universal standard, so
that users and implementers worldwide have to read it (or skip over it),
and commercial rendering systems be asked to implement this rendering
behavior, when there are more fish in my tank than worldwide users of
the script? Absolutely not. I mean, sure, it'd be NICE, but there are
5 billion of us. Let's get our priorities straight.

The keepers of CSUR, John Cowan and Michael Everson, who by that very
fact have shown above-average interest in the encoding of Klingon and
Zírí:nka and Ewellic, have already explained why these scripts don't
belong in Unicode. 21-bit space or 31-bit space is not the issue. My
script, if encoded, would occupy less than 0.006% of the space currently
available in Unicode.

> In such a system no application need ever be rejected, for any reason.
> Inclusion would be automatic for every submission.

Nobody would ever use such a standard.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/