Jon Hanna
2003-09-24 12:58:15 UTC
Hi,
I'm currently experimenting with various trade-offs for Unicode normalisation code. Any comments on these (particularly of the "that's insane, here's why, stop now!" variety) would be welcome.
The first is an optimisation of speed over size. Rather than perform the decomposition as a recursive operation the necessary data is stored to do so in a single pass. For example rather than compute <U+212B> -> <U+00C5> -> <U+0041, U+030A> recursively one can store the data to compute <U+212B> -> <U+0041, U+030A>. This reduces the amount of work to decompose each character, and further benefits from the fact that if there is no trailing combining characters (that is if the next character is a starter) then no re-ordering is required.
The second is an optimisation of both speed and size, with the disadvantage that data cannot be shared between NFC and NFD operations (which is perhaps a reasonable trade in the case of web code which might only need NFC code to be linked). In this version decompositions of stable codepoints are ommitted from the decompositon data. For example since following the decomposition <U+0104> -> <U+0041, U+0328> there can be no character that is unblocked from the U+0041 that will combine with it, hence there is no circumstance in which they will not be recombined to U+0104 and hence dropping that decomposition from the data will not affect NFC (the relevant data would still have to be in the composition table, as the sequence <U+0041, U+0328> might occur in the source code).
------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->
To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com
This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
I'm currently experimenting with various trade-offs for Unicode normalisation code. Any comments on these (particularly of the "that's insane, here's why, stop now!" variety) would be welcome.
The first is an optimisation of speed over size. Rather than perform the decomposition as a recursive operation the necessary data is stored to do so in a single pass. For example rather than compute <U+212B> -> <U+00C5> -> <U+0041, U+030A> recursively one can store the data to compute <U+212B> -> <U+0041, U+030A>. This reduces the amount of work to decompose each character, and further benefits from the fact that if there is no trailing combining characters (that is if the next character is a starter) then no re-ordering is required.
The second is an optimisation of both speed and size, with the disadvantage that data cannot be shared between NFC and NFD operations (which is perhaps a reasonable trade in the case of web code which might only need NFC code to be linked). In this version decompositions of stable codepoints are ommitted from the decompositon data. For example since following the decomposition <U+0104> -> <U+0041, U+0328> there can be no character that is unblocked from the U+0041 that will combine with it, hence there is no circumstance in which they will not be recombined to U+0104 and hence dropping that decomposition from the data will not affect NFC (the relevant data would still have to be in the composition table, as the sequence <U+0041, U+0328> might occur in the source code).
------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->
To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com
This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/