Unicode Normalisaton Optimisation Experiments

Discussion:

Jon Hanna

2003-09-24 12:58:15 UTC

Hi,
I'm currently experimenting with various trade-offs for Unicode normalisation code. Any comments on these (particularly of the "that's insane, here's why, stop now!" variety) would be welcome.

The first is an optimisation of speed over size. Rather than perform the decomposition as a recursive operation the necessary data is stored to do so in a single pass. For example rather than compute <U+212B> -> <U+00C5> -> <U+0041, U+030A> recursively one can store the data to compute <U+212B> -> <U+0041, U+030A>. This reduces the amount of work to decompose each character, and further benefits from the fact that if there is no trailing combining characters (that is if the next character is a starter) then no re-ordering is required.

The second is an optimisation of both speed and size, with the disadvantage that data cannot be shared between NFC and NFD operations (which is perhaps a reasonable trade in the case of web code which might only need NFC code to be linked). In this version decompositions of stable codepoints are ommitted from the decompositon data. For example since following the decomposition <U+0104> -> <U+0041, U+0328> there can be no character that is unblocked from the U+0041 that will combine with it, hence there is no circumstance in which they will not be recombined to U+0104 and hence dropping that decomposition from the data will not affect NFC (the relevant data would still have to be in the composition table, as the sequence <U+0041, U+0328> might occur in the source code).

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Markus Scherer

2003-09-24 22:55:18 UTC

Permalink

Post by Jon Hanna
Hi,
I'm currently experimenting with various trade-offs for Unicode normalisation code. Any comments on these (particularly of the "that's insane, here's why, stop now!" variety) would be welcome.

You might want to look at, if not even use, the ICU open-source implementation:

http://oss.software.ibm.com/icu/
http://oss.software.ibm.com/cvs/icu/~checkout~/icu/source/common/unorm.cpp

Post by Jon Hanna
The first is an optimisation of speed over size. Rather than perform the decomposition as a recursive operation the necessary data is stored to do so in a single pass. ...

I believe that this is a very common technique. Used in ICU.

Post by Jon Hanna
The second is an optimisation of both speed and size, with the disadvantage that data cannot be shared between NFC and NFD operations (which is perhaps a reasonable trade in the case of web code which might only need NFC code to be linked). In this version decompositions of stable codepoints are ommitted from the decompositon data. For example since following the decomposition <U+0104> -> <U+0041, U+0328> there can be no character that is unblocked from the U+0041 that will combine with it, hence there is no circumstance in which they will not be recombined to U+0104 and hence dropping that decomposition from the data will not affect NFC (the relevant data would still have to be in the composition table, as the sequence <U+0041, U+0328> might occur in the source code).

Sounds possible and clever. As far as I remember, ICU uses the normalization quick check flags
(Unicode properties) to determine much of this, and should achieve the same in most cases.

markus

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

j***@spin.ie

2003-09-25 08:59:52 UTC

Permalink

Post by Jon Hanna

Post by Jon Hanna
Hi,
I'm currently experimenting with various trade-offs for Unicode

normalisation code. Any comments on these (particularly of the "that's
insane, here's why, stop now!" variety) would be welcome.
You might want to look at, if not even use, the ICU open-source
http://oss.software.ibm.com/icu/
http://oss.software.ibm.com/cvs/icu/~checkout~/icu/source/common/unorm.cpp

I did, but when I started this I was more interested in simply comparing various optimisations as a study into the related techniques. However I recently hit a practical need for such code for another task, and while it's nice that I've a bunch of "work" code already done as "fun" code maybe I should just use ICU...

Post by Jon Hanna

Post by Jon Hanna
The second is an optimisation of both speed and size, with the

disadvantage that data cannot be shared between NFC and NFD operations (which
is perhaps a reasonable trade in the case of web code which might only need NFC
code to be linked). In this version decompositions of stable codepoints are
ommitted from the decompositon data. For example since following the
decomposition <U+0104> -> <U+0041, U+0328> there can be no
character that is unblocked from the U+0041 that will combine with it, hence
there is no circumstance in which they will not be recombined to U+0104 and
hence dropping that decomposition from the data will not affect NFC (the
relevant data would still have to be in the composition table, as the sequence
<U+0041, U+0328> might occur in the source code).
Sounds possible and clever. As far as I remember, ICU uses the normalization
quick check flags
(Unicode properties) to determine much of this, and should achieve the same in
most cases.

The above would supplement use of quick check - indeed it would be a way of implementing the concept of "stable codepoints" that the UTR suggests using with quick check.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-09-25 10:08:47 UTC

Permalink

... For example since following the decomposition <U+0104> -> <U+0041, U+0328> there can be no character that is unblocked from the U+0041 that will combine with it, hence there is no circumstance in which they will not be recombined to U+0104 and hence dropping that decomposition from the data will not affect NFC (the relevant data would still have to be in the composition table, as the sequence <U+0041, U+0328> might occur in the source code).

Is this actually correct? For example, if I have in my data the string
<U+0104, U+05B0> (which I know is garbage, but that is irrelevant), that
will decompose and reorder to <U+0041, U+05B0, U+0328>, as U+05B0 has a
higher combining class (202) than U+05B0 (10). What does this become in
NFC? Is the reordering reversed and the combination reapplied?

This is not only a theoretical issue as the same applies to some real
combinations. There was discussion only last week on the bidi list of a
form which might be encoded <U+064A, U+0652, U+0654> but which would be
messed up if composed into <U+0626, U+0652>.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

j***@spin.ie

2003-09-25 10:27:06 UTC

Permalink

Post by Peter Kirk
Is this actually correct? For example, if I have in my data the string
<U+0104, U+05B0> (which I know is garbage, but that is irrelevant), that
will decompose and reorder to <U+0041, U+05B0, U+0328>, as U+05B0 has a
higher combining class (202) than U+05B0 (10). What does this become in
NFC? Is the reordering reversed and the combination reapplied?

First an attempt is made to compose U+0041 and U+05B0. There is no character allowing for this, so that attempt will fail. Then an attempt is made to compose U+0041 and U+0328 which will produce U+0104. U+0041 is replaced with U+0104 and U+0328 is removed resulting in <U+0104, U+05B0>.

It's not a reordering per se, as the first combining character is given the first "opportunity" to combine.

Post by Peter Kirk
This is not only a theoretical issue as the same applies to some real
combinations. There was discussion only last week on the bidi list of a
form which might be encoded <U+064A, U+0652, U+0654> but which would be
messed up if composed into <U+0626, U+0652>.

Yes, NFC would perform that composition. Are you sure it would be an issue? Applying bidi rules doesn't seem to make this an issue.
<U+064A, U+0652, U+0654>
bidi: Al, NSM, NSM
applying rule W1 from USA9:
Al, NSM, NSM -> Al, Al, NSM -> Al, Al, Al.

<U+0626, U+0652>
bidi: Al, NSM
applying rule W1:
Al, NSM -> Al, Al

Or is the issue with something else, but it came up on the bidi list?

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-09-25 12:53:06 UTC

Permalink

Post by j***@spin.ie

First an attempt is made to compose U+0041 and U+05B0. There is no character allowing for this, so that attempt will fail. Then an attempt is made to compose U+0041 and U+0328 which will produce U+0104. U+0041 is replaced with U+0104 and U+0328 is removed resulting in <U+0104, U+05B0>.
It's not a reordering per se, as the first combining character is given the first "opportunity" to combine.

Thanks for the clarification.

Post by j***@spin.ie

Yes, NFC would perform that composition. Are you sure it would be an issue? Applying bidi rules doesn't seem to make this an issue.
<U+064A, U+0652, U+0654>
bidi: Al, NSM, NSM
Al, NSM, NSM -> Al, Al, NSM -> Al, Al, Al.
<U+0626, U+0652>
bidi: Al, NSM
Al, NSM -> Al, Al
Or is the issue with something else, but it came up on the bidi list?

The problem isn't with the bidi rules but with more general Arabic
shaping etc. There are two issues, one the position of the hamza (in
this case it should be to the left of the sukun) and the other that the
medial form of U+064A has dots below, which are required in this
combination, but the medial form of U+0626 does not. But I think we
concluded that U+0654 alone is not suitable for encoding this particular
hamza.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Markus Scherer

2003-09-25 21:25:21 UTC

Permalink

Post by Peter Kirk

Post by j***@spin.ie
It's not a reordering per se, as the first combining character is
given the first "opportunity" to combine.

Thanks for the clarification.

In other words, yes, Unicode's NFC does perform "discontiguous composition". Some things might be
easier if only contiguous composition were used, but the current definition does give you the
shortest strings.

See also http://www.unicode.org/notes/tn5/#FCC (not a normative Unicode document).

markus

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-09-25 22:25:13 UTC

Permalink

Post by Markus Scherer

Post by Peter Kirk

Post by j***@spin.ie
It's not a reordering per se, as the first combining character is
given the first "opportunity" to combine.

Thanks for the clarification.

And this current definition cannot be changed because of the stability
policy, right?

Post by Markus Scherer
See also http://www.unicode.org/notes/tn5/#FCC (not a normative Unicode document).
markus

Thanks.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Markus Scherer

2003-09-26 00:40:02 UTC

Permalink

Post by Peter Kirk

Post by Markus Scherer
In other words, yes, Unicode's NFC does perform "discontiguous
composition". Some things might be easier if only contiguous
composition were used, but the current definition does give you the
shortest strings.

And this current definition cannot be changed because of the stability
policy, right?

Right.

markus

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

j***@spin.ie

2003-09-26 08:52:35 UTC

Permalink

Post by Peter Kirk

Post by Markus Scherer

Post by Peter Kirk

Post by j***@spin.ie
It's not a reordering per se, as the first combining character is
given the first "opportunity" to combine.

Thanks for the clarification.

A composition system that could produce shorter strings is possible (calculate every possible combination with the same decomposition, use the shortest) the system used by NFC is a compromise between conciseness of the output and the computational effort needed to generate it.

Post by Peter Kirk
And this current definition cannot be changed because of the stability
policy, right?

If there is a problem with this then it goes deeper than just NFC, but to the rules of how combining characters can or cannot be reordered, and the meaning that the resulting strings have. If there is a problem with that then the problem lies with those rules, rather than NFC which uses them.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-09-26 11:37:47 UTC

Permalink

Post by j***@spin.ie
...
If there is a problem with this then it goes deeper than just NFC, but to the rules of how combining characters can or cannot be reordered, and the meaning that the resulting strings have. If there is a problem with that then the problem lies with those rules, rather than NFC which uses them.

Actually, in my opinion based on experience with problem combinations in
Hebrew and Arabic, the problem is not so much with the reordering rules
as with the way that some canonical decompositions and combining classes
have been inappropriately defined, and with the stability policy which
decrees that even the most obvious mistakes cannot be corrected.

The problem is that the definitions in the Unicode Standard conflict

Post by j***@spin.ie
D46 Combining class: A numeric value given to each combining Unicode
character that
determines with which other combining characters it typographically
interacts.
• See Section 4.3, Combining Classes—Normative, for information about
the combining
classes for Unicode characters.
Characters have the same class if they interact typographically, and
different classes if they
do not.

This is simply untrue and so needs to be changed. I have well documented
examples from Hebrew and from Arabic of combining characters which do
not have the same combining class but do interact typographically.
(Unless D46 is read as a counter-intuitive definition of
"typographically interacts".) The obvious way of correcting this error,
to adjust the combining classes, is ruled out by the stability policy.
So the text of the standard, which can be changed in a new version,
needs to be changed to read something like:

Characters have the same class if according to the best information
available in 2001 (?) they were thought to interact typographically, and
different classes if they were thought not to.

Or else simply state that combining classes are assigned arbitrarily -
as also needs to happen with Unicode character names which similarly
contain uncorrectable errors.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Cowan

2003-09-26 12:28:52 UTC

Permalink

Post by Peter Kirk
Or else simply state that combining classes are assigned arbitrarily -
as also needs to happen with Unicode character names which similarly
contain uncorrectable errors.

Let's not overdo it. Names, or combining classes, that are not in
hindsight 100% appropriate does not justify saying that the assignments
are arbitrary. If we called "A" LATIN SMALL LETTER B, or OZMA VEEBLEFESTER
SCRITCHIFCHISTED, then there might be some justification for calling it
arbitrary.
--
"You know, you haven't stopped talking John Cowan
since I came here. You must have been http://www.reutershealth.com
vaccinated with a phonograph needle." ***@reutershealth.com
--Rufus T. Firefly http://www.ccil.org/~cowan

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-09-26 12:59:37 UTC

Permalink

Post by John Cowan

Post by Peter Kirk
Or else simply state that combining classes are assigned arbitrarily -
as also needs to happen with Unicode character names which similarly
contain uncorrectable errors.

Well, "arbitrary" perhaps goes too far. But if they are incorrect it
would be better if they were totally incorrect rather than a little bit
so - on the same basis that you are more likely to miss the bus because
of a clock that is five minutes slow than because of one that is
obviously telling totally the wrong time.

Anyway, some of the Unicode names for Hebrew accents are just as
inaccurate as calling "a" LATIN SMALL LETTER B, and some of the names of
extended Cyrillic characters are just as confusing as that.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/