Discussion:
Unicode Collation Algorithm: 4.0 Update (beta)
Rick McGowan
2003-08-16 02:27:37 UTC
Permalink
The Unicode Technical Committee would like to announce availability of the
beta Default Unicode Collation Element Table for UCA 4.0. Feedback is
invited.

The primary goal of this release is to synchronize the repertoire of
strings for collation (sorting) with the repertoire of Unicode 4.0. For
future versions of the Unicode Standard that add characters, there will
also be versions of the UCA tables with synchronized repertoire.

A small number of additional changes have been made for consistency in
treatment of new and old characters; however, other changes await working
with SC22/WG2 so that future versions of ISO 14651 and UCA can be
synchronized.

The relevant data file is found here:

http://www.unicode.org/reports/tr10/allkeys-4.0.0d1.txt

Please also look at the corresponding proposed update version of Unicode
Technical Standard #10, The Unicode Collation Algorithm:

http://www.unicode.org/reports/tr10/tr10-10.html

Due to production difficulties, the beta period for this is quite short;
comments for this version must be submitted by end of day, August 26, 2003.
However, comments directed to the next version can be submitted after this
date. Please submit feedback with the reporting form at:

http://www.unicode.org/reporting.html

Regards,
Rick McGowan



------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Mark Davis
2003-08-18 00:37:57 UTC
Permalink
There are also beta collation charts in:

http://www.unicode.org/charts/collation/beta/

Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄

----- Original Message -----
From: "Rick McGowan" <***@unicode.org>
To: <***@unicode.org>
Sent: Friday, August 15, 2003 19:27
Subject: Unicode Collation Algorithm: 4.0 Update (beta)
Post by Rick McGowan
The Unicode Technical Committee would like to announce availability of the
beta Default Unicode Collation Element Table for UCA 4.0. Feedback is
invited.
The primary goal of this release is to synchronize the repertoire of
strings for collation (sorting) with the repertoire of Unicode 4.0. For
future versions of the Unicode Standard that add characters, there will
also be versions of the UCA tables with synchronized repertoire.
A small number of additional changes have been made for consistency in
treatment of new and old characters; however, other changes await working
with SC22/WG2 so that future versions of ISO 14651 and UCA can be
synchronized.
http://www.unicode.org/reports/tr10/allkeys-4.0.0d1.txt
Please also look at the corresponding proposed update version of Unicode
http://www.unicode.org/reports/tr10/tr10-10.html
Due to production difficulties, the beta period for this is quite short;
comments for this version must be submitted by end of day, August 26, 2003.
However, comments directed to the next version can be submitted after this
http://www.unicode.org/reporting.html
Regards,
Rick McGowan
------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Michael (michka) Kaplan
2003-08-18 01:32:14 UTC
Permalink
These collation tables have one and only one of the following two problems:

A) If these are intended to be language-specific tailorings then a strong
warning about the linguistic inapplicabilty needs to be added, since the
data is actually incorrect for the use of most of these scripts.

B) If these tables are just intended to be another view of the default
table, outlining the data by script, then this information should be more
prominently explained.

I suspect that the issue is the one outlined in (B) meaning it is just
better explaining what the data is meant to be, but I have had many
customers claim to me that "Unicode does not understand their
language/script" because they assumed that the issue was as outlined in (A).

Assuming that it is (B), this problem is unfortunately exacerbated by the
many times that the language name == the script name (e.g. in Tamil,
Bengali, and many others). Having the items on the left called out at the
top as SCRIPTS rather than LANGUAGES would probably help with that issue
(though some people do not distinguish even when the difference is explained
clearly).

Of course there is the issue that the UTS does not really seem reference
this page, but maybe there is a reference somewhere else that is a bit more
of a challenge to find. :-)

MichKa

----- Original Message -----
From: "Mark Davis" <***@jtcsv.com>
To: <***@unicode.org>
Cc: <***@unicode.org>; <***@unicode.org>
Sent: Sunday, August 17, 2003 5:37 PM
Subject: [indic] Re: Unicode Collation Algorithm: 4.0 Update (beta)
Post by Mark Davis
http://www.unicode.org/charts/collation/beta/
Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄
----- Original Message -----
Sent: Friday, August 15, 2003 19:27
Subject: Unicode Collation Algorithm: 4.0 Update (beta)
Post by Rick McGowan
The Unicode Technical Committee would like to announce availability of the
beta Default Unicode Collation Element Table for UCA 4.0. Feedback is
invited.
The primary goal of this release is to synchronize the repertoire of
strings for collation (sorting) with the repertoire of Unicode 4.0. For
future versions of the Unicode Standard that add characters, there will
also be versions of the UCA tables with synchronized repertoire.
A small number of additional changes have been made for consistency in
treatment of new and old characters; however, other changes await working
with SC22/WG2 so that future versions of ISO 14651 and UCA can be
synchronized.
http://www.unicode.org/reports/tr10/allkeys-4.0.0d1.txt
Please also look at the corresponding proposed update version of Unicode
http://www.unicode.org/reports/tr10/tr10-10.html
Due to production difficulties, the beta period for this is quite short;
comments for this version must be submitted by end of day, August 26, 2003.
However, comments directed to the next version can be submitted after this
http://www.unicode.org/reporting.html
Regards,
Rick McGowan
------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Mark Davis
2003-08-18 03:35:01 UTC
Permalink
comments below.

Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄

----- Original Message -----
From: "Michael (michka) Kaplan" <***@trigeminal.com>
To: "Mark Davis" <***@jtcsv.com>; <***@unicode.org>
Cc: <***@unicode.org>; <***@unicode.org>
Sent: Sunday, August 17, 2003 18:32
Subject: Re: [indic] Re: Unicode Collation Algorithm: 4.0 Update (beta)
Post by Michael (michka) Kaplan
A) If these are intended to be language-specific tailorings then a strong
warning about the linguistic inapplicabilty needs to be added, since the
data is actually incorrect for the use of most of these scripts.
B) If these tables are just intended to be another view of the default
table, outlining the data by script, then this information should be more
prominently explained.
I suspect that the issue is the one outlined in (B) meaning it is just
better explaining what the data is meant to be, but I have had many
customers claim to me that "Unicode does not understand their
language/script" because they assumed that the issue was as outlined in (A).
B is the case; this is simply a different view of the UCA.
Post by Michael (michka) Kaplan
Assuming that it is (B), this problem is unfortunately exacerbated by the
many times that the language name == the script name (e.g. in Tamil,
Bengali, and many others). Having the items on the left called out at the
top as SCRIPTS rather than LANGUAGES would probably help with that issue
(though some people do not distinguish even when the difference is explained
clearly).
Note, this is not a new page, just an update of an existing. However, your
comments are good.
Post by Michael (michka) Kaplan
Of course there is the issue that the UTS does not really seem reference
this page, but maybe there is a reference somewhere else that is a bit more
of a challenge to find. :-)
They are linked on http://www.unicode.org/charts/. We can add a link in the UTS.
Maybe in
http://www.unicode.org/reports/tr10/tr10-10.html#Common_Misperceptions -- what
do you think?
Post by Michael (michka) Kaplan
MichKa
----- Original Message -----
Sent: Sunday, August 17, 2003 5:37 PM
Subject: [indic] Re: Unicode Collation Algorithm: 4.0 Update (beta)
Post by Mark Davis
http://www.unicode.org/charts/collation/beta/
Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄
----- Original Message -----
Sent: Friday, August 15, 2003 19:27
Subject: Unicode Collation Algorithm: 4.0 Update (beta)
Post by Rick McGowan
The Unicode Technical Committee would like to announce availability of
the
Post by Mark Davis
Post by Rick McGowan
beta Default Unicode Collation Element Table for UCA 4.0. Feedback is
invited.
The primary goal of this release is to synchronize the repertoire of
strings for collation (sorting) with the repertoire of Unicode 4.0. For
future versions of the Unicode Standard that add characters, there will
also be versions of the UCA tables with synchronized repertoire.
A small number of additional changes have been made for consistency in
treatment of new and old characters; however, other changes await
working
Post by Mark Davis
Post by Rick McGowan
with SC22/WG2 so that future versions of ISO 14651 and UCA can be
synchronized.
http://www.unicode.org/reports/tr10/allkeys-4.0.0d1.txt
Please also look at the corresponding proposed update version of Unicode
http://www.unicode.org/reports/tr10/tr10-10.html
Due to production difficulties, the beta period for this is quite short;
comments for this version must be submitted by end of day, August 26,
2003.
Post by Mark Davis
Post by Rick McGowan
However, comments directed to the next version can be submitted after
this
Post by Mark Davis
Post by Rick McGowan
http://www.unicode.org/reporting.html
Regards,
Rick McGowan
------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Mark Davis
2003-08-18 20:36:08 UTC
Permalink
I'm sorry that you haven't gotten responses before. I have searched through my
email archive, and can't find anything like the message, and I don't think it
was brought up to the UTC formally.

The first one seems odd, and as you say, it would seem to only affect a
vanishingly small number of characters; since these are final character, one
presumes there would be subsequent characters that would form a larger
difference anyway.

Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄

----- Original Message -----
From: "Matitiahu Allouche" <***@il.ibm.com>
To: "Mark Davis" <***@jtcsv.com>
Cc: <***@unicode.org>; <***@unicode.org>; <***@unicode.org>
Sent: Monday, August 18, 2003 10:08
Subject: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)
I have submitted the following text on the Unicode Reporting form.
This report relates to the collation tables for Hebrew as displayed in
http://www.unicode.org/charts/collation/beta/chart_Hebrew.html
I have already formulated the following remarks in the past, but no action
has been taken, so I repeat them here.
1) Precedence of Dagesh over Final/non-Final: in the chart, the presence
or absence of Dagesh is a Secundary difference, while Final/non-Final is a
Tertiary difference. This is relevant only for letters Kaf and Pe. My
gut feeling says that Final/non-Final should have precedence over
Dagesh/no-Dagesh.
Note that the number of actual cases where this would make a difference is
probably *very* small.
2) There is something strange in the combinations of Shin with Dagesh and
dots: for all other letters, the form without Dagesh sorts before the form
with Dagesh. But Shin with Sin/Shin dot sort after their corresponding
combinations with Dagesh. I cannot imagine a justification for that.
I have submitted those reservations to the Technical Committee 2109 of the
SII (Standards Institution of Israel, the Israeli NB), which deals with
Hebrew-related standards in IT, and the committee endorsed my point of
view. I can ask the committee to send you a confirmation letter if
required.
I would like to see some action taken on my remarks this time, or at least
some justified refutation.
Shalom (Regards), Mati
Bidi Architect
Globalization Center Of Competency - Bidirectional Scripts
IBM Israel
Phone: +972 2 5888802 Fax: +972 2 5870333 Mobile: +972 52
554160
Subject: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)
http://www.unicode.org/charts/collation/beta/
Mark
------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Peter Kirk
2003-08-18 21:07:20 UTC
Permalink
Post by Mark Davis
I'm sorry that you haven't gotten responses before. I have searched through my
email archive, and can't find anything like the message, and I don't think it
was brought up to the UTC formally.
The first one seems odd, and as you say, it would seem to only affect a
vanishingly small number of characters; since these are final character, one
presumes there would be subsequent characters that would form a larger
difference anyway.
Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄
----- Original Message -----
Sent: Monday, August 18, 2003 10:08
Subject: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)
I have submitted the following text on the Unicode Reporting form.
This report relates to the collation tables for Hebrew as displayed in
http://www.unicode.org/charts/collation/beta/chart_Hebrew.html
I have already formulated the following remarks in the past, but no action
has been taken, so I repeat them here.
1) Precedence of Dagesh over Final/non-Final: in the chart, the presence
or absence of Dagesh is a Secundary difference, while Final/non-Final is a
Tertiary difference. This is relevant only for letters Kaf and Pe. My
gut feeling says that Final/non-Final should have precedence over
Dagesh/no-Dagesh.
Note that the number of actual cases where this would make a difference is
probably *very* small.
2) There is something strange in the combinations of Shin with Dagesh and
dots: for all other letters, the form without Dagesh sorts before the form
with Dagesh. But Shin with Sin/Shin dot sort after their corresponding
combinations with Dagesh. I cannot imagine a justification for that.
I have submitted those reservations to the Technical Committee 2109 of the
SII (Standards Institution of Israel, the Israeli NB), which deals with
Hebrew-related standards in IT, and the committee endorsed my point of
view. I can ask the committee to send you a confirmation letter if
required.
I would like to see some action taken on my remarks this time, or at least
some justified refutation.
Shalom (Regards), Mati
Bidi Architect
Globalization Center Of Competency - Bidirectional Scripts
IBM Israel
Phone: +972 2 5888802 Fax: +972 2 5870333 Mobile: +972 52
554160
Subject: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)
http://www.unicode.org/charts/collation/beta/
Mark
Mati, I am interested to see that you and SII have been giving attention to collation in Hebrew. I have also been doing so, on the recently set up Unicode Hebrew list (***@unicode.org), as I was also concerned about ordering of shin dots, dagesh etc. I have not yet had any reply to my posting on this subject which I made two days ago. I will forward my posting to you in case you have not seen it.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/




------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Mark Davis
2003-08-19 14:24:12 UTC
Permalink
Ah, that explains it. You had filed this against ICU, not UCA; that explains why
I couldn't find it in the Unicode reports.

A. Final.
1) Precedence of Dagesh over Final/non-Final: in the chart, the presence
or absence of Dagesh is a Secundary difference, while Final/non-Final is a
Tertiary difference. This is relevant only for letters Kaf and Pe. My
gut feeling says that Final/non-Final should have precedence over
Dagesh/no-Dagesh.
Note that the number of actual cases where this would make a difference is
probably *very* small.
So there are two issues for final vs non-final: strength and ordering.

A1. Ordering is easy to change; in ICU or UCA we could put the final values
before the independent letters. In ICU they are just rules, while in UCA they
follow
http://www.unicode.org/reports/tr10/tr10-10.html#Tertiary_Weight_Table. The
easiest in UCA would be to give the 5 independent forms that have finals the
value <isolated>.

Note: there is one minor fallout in ICU: we optimize the sortkey compression of
tertiary values of NONE; if we change the ordering then each instance of the
<isolated> letters will mean about a 2-3 byte increase in sort-key sizes.

A2. For Strength, it is not as clear cut. If Final vs non-Final is more
important than dagesh, etc, the easiest thing is to make it a primary
difference; but that would make

Zayin Yod PeFinal

sort before all words

Zayin Yod Pe XXX

But I'm guessing that is probably not desired for Hebrew.

In ICU we could make Final vs non-Final be a secondary difference, and have
Dagesh, etc. be tertiary differences. The disadvantage is that people tend to
expect the 2nd level to be 'accent-like', and there might be more
inconsistencies in practice than you would gain by having the current situation.
In Unicode, the UCA has more production restrictions as per
http://www.unicode.org/reports/tr10/tr10-10.html#Tertiary_Weight_Table, so it
would be a bit harder to make that change.

So if SII would like this change, I'd recommend that we make the ordering change
in UCA (which will then affect ICU), but not make a stength change (it would
have to be extremely exotic for that to make a difference).

Cf. http://www.unicode.org/charts/collation/chart_Hebrew.html

B. Dagesh
2) There is something strange in the combinations of Shin with Dagesh and
dots: for all other letters, the form without Dagesh sorts before the form
with Dagesh. But Shin with Sin/Shin dot sort after their corresponding
combinations with Dagesh. I cannot imagine a justification for that.
We have currently in UCA the following (from UCA 4.0.0d1 (beta))
05B0 ; [.0000.00B2.0002.05B0] # HEBREW POINT SHEVA
05B1 ; [.0000.00B3.0002.05B1] # HEBREW POINT HATAF SEGOL
05B2 ; [.0000.00B4.0002.05B2] # HEBREW POINT HATAF PATAH
05B3 ; [.0000.00B5.0002.05B3] # HEBREW POINT HATAF QAMATS
05B4 ; [.0000.00B6.0002.05B4] # HEBREW POINT HIRIQ
05B5 ; [.0000.00B7.0002.05B5] # HEBREW POINT TSERE
05B6 ; [.0000.00B8.0002.05B6] # HEBREW POINT SEGOL
05B7 ; [.0000.00B9.0002.05B7] # HEBREW POINT PATAH
05B8 ; [.0000.00BA.0002.05B8] # HEBREW POINT QAMATS
05B9 ; [.0000.00BB.0002.05B9] # HEBREW POINT HOLAM
05BB ; [.0000.00BC.0002.05BB] # HEBREW POINT QUBUTS
05BC ; [.0000.00BD.0002.05BC] # HEBREW POINT DAGESH OR MAPIQ
05BF ; [.0000.00C0.0002.05BF] # HEBREW POINT RAFE
05C1 ; [.0000.00C1.0002.05C1] # HEBREW POINT SHIN DOT
05C2 ; [.0000.00C2.0002.05C2] # HEBREW POINT SIN DOT
FB1E ; [.0000.00C3.0002.FB1E] # HEBREW POINT JUDEO-SPANISH VARIKA

To make this change, we would move Dagesh to after SIN DOT. Question: should it
also go after VARIKA or not?

Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄

----- Original Message -----
From: "Matitiahu Allouche" <***@il.ibm.com>
To: "Mark Davis" <***@jtcsv.com>
Cc: <***@unicode.org>; <***@unicode.org>; <***@unicode.org>
Sent: Tuesday, August 19, 2003 01:21
Subject: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)
Hello, Mark!
There must be some hole in your email archive :-), since you yourself
expressed your personal take on the issues. On 04/05/03 (probably 4th of
<QUOTE>
Subject: Bug on Hebrew Collation
Importance: Urgent
http://www.jtcsv.com/cgibin/icu-bugs/collation?id=1489;user=guest
Mati, your comments look reasonable. I am, however, a little nervous since
as far as I know, the Israeli government committee had input into the
basic table for ISO 14651, which is reflected in the UCA. (We don't modify
it for Hebrew). Can you confirm with them that these tailorings should be
made?
Mark
</QUOTE>
I did not formally submit anything to the UTC, though, so I may be
responsible for my own misfortune. At that time, I had 4 remarks. It
seems that 2 of them have been implemented, and the 2 others have not.
I have second thoughts about the tertiary weight allocated to final
letters (0019) as compared to that allocated to non-final letters (0002).
That means that final letters are collated *after* the corresponding
non-final letters. This goes against accepted Hebrew usage. In normal
cases, the non-final letter will be followed by some more letters, so that
there will be a primary difference, but exotic cases will be sorted
improperly. An example that comes to mind is transliteration of
non-Hebrew words. For instance a "zip" file will be transliterated as
"Zayin Yod Pe" (Google gives 2840 hits for this orthograph). There is a
Hebrew word pronounced "zif" (meaning "bristle") which is written
identically except that the last letter is a Final Pe. I expect the "zip"
file to be collated *after* the "bristle", but this will not happen with
the current collation table.
a) Final letters had a smaller weight than the corresponding non-final
letters (for some level >1).
b) The level associated with final/non-final was more significant than the
level associated with diacritics (Dagesh and/or other Hebrew points).
It is not that I have so many really convincing examples that would be
broken with the current collation definition, but I think that having
weights which reflect the linguistic guidelines is more likely to
successfully handle the cases that we have not considered.
Shalom (Regards), Mati
Bidi Architect
Globalization Center Of Competency - Bidirectional Scripts
IBM Israel
Phone: +972 2 5888802 Fax: +972 2 5870333 Mobile: +972 52
554160
Subject: Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)
I'm sorry that you haven't gotten responses before. I have searched
through my
email archive, and can't find anything like the message, and I don't think
it
was brought up to the UTC formally.
The first one seems odd, and as you say, it would seem to only affect a
vanishingly small number of characters; since these are final character,
one
presumes there would be subsequent characters that would form a larger
difference anyway.
Mark
------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Mark Davis
2003-08-19 21:23:11 UTC
Permalink
Three points.

First, While we try to make the the UCA collation table (DUCET) as reasonable as
possible for the main languages of a given script, it is not guaranteed to
produce the correct sorting for any particular language. The UCA *is* designed
so that it provides a default base ordering for all of Unicode, and individual
languages can be given tailorings of the DUCET that handle the specifics of
their string comparison requirements.

Thus if there are changes that improve the handling of the UCA for the major
languages using a given script, and do not destabilize others, those are
candidates for change in a version. For example, if it turned out that a
particular Tamil character (or sequence of characters!) was not sorted correctly
according to the DUCET (e.g. on http://www.unicode.org/charts/collation/beta/),
then it would be a candidate, and should be submitted on the form.

Second, we do and should favor modern language communities when making
incompatible tradeoffs. So if we have the choice between making French sort
correctly without tailoring, or have Latin sort correctly without tailoring, we
should choose the modern community. The Latin community can always use a
tailored UCA, in any event.

Third, there is often a serious confusion between sorting weight and canonical
ordering. The fact that a grave accent precedes a cedilla in canonical order is
*completely independent of* whatever collation weights each of them has, either
in a tailoring or in the DUCET. The only substantive issue is how each of these
sorts separately or in combination. And making the combination (sequence) of
grave and cedilla sort before grave, after grave, before cedilla, or after
cedilla are all possible; all of those can be handled by the UCA as
contractions. See http://www.unicode.org/reports/tr10/tr10-10.html for more
information.

Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄

----- Original Message -----
From: "Peter Kirk" <***@ntlworld.com>
To: "Mark Davis" <***@jtcsv.com>
Cc: "Matitiahu Allouche" <***@il.ibm.com>; <***@unicode.org>;
<***@unicode.org>; <***@unicode.org>; "Joan Wardell" <***@sil.org>
Sent: Tuesday, August 19, 2003 13:59
Subject: Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)
Post by Mark Davis
B. Dagesh
2) There is something strange in the combinations of Shin with Dagesh and
dots: for all other letters, the form without Dagesh sorts before the form
with Dagesh. But Shin with Sin/Shin dot sort after their corresponding
combinations with Dagesh. I cannot imagine a justification for that.
We have currently in UCA the following (from UCA 4.0.0d1 (beta))
05B0 ; [.0000.00B2.0002.05B0] # HEBREW POINT SHEVA
05B1 ; [.0000.00B3.0002.05B1] # HEBREW POINT HATAF SEGOL
05B2 ; [.0000.00B4.0002.05B2] # HEBREW POINT HATAF PATAH
05B3 ; [.0000.00B5.0002.05B3] # HEBREW POINT HATAF QAMATS
05B4 ; [.0000.00B6.0002.05B4] # HEBREW POINT HIRIQ
05B5 ; [.0000.00B7.0002.05B5] # HEBREW POINT TSERE
05B6 ; [.0000.00B8.0002.05B6] # HEBREW POINT SEGOL
05B7 ; [.0000.00B9.0002.05B7] # HEBREW POINT PATAH
05B8 ; [.0000.00BA.0002.05B8] # HEBREW POINT QAMATS
05B9 ; [.0000.00BB.0002.05B9] # HEBREW POINT HOLAM
05BB ; [.0000.00BC.0002.05BB] # HEBREW POINT QUBUTS
05BC ; [.0000.00BD.0002.05BC] # HEBREW POINT DAGESH OR MAPIQ
05BF ; [.0000.00C0.0002.05BF] # HEBREW POINT RAFE
05C1 ; [.0000.00C1.0002.05C1] # HEBREW POINT SHIN DOT
05C2 ; [.0000.00C2.0002.05C2] # HEBREW POINT SIN DOT
FB1E ; [.0000.00C3.0002.FB1E] # HEBREW POINT JUDEO-SPANISH VARIKA
To make this change, we would move Dagesh to after SIN DOT. Question: should it
also go after VARIKA or not?
Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄
Please, don't rush any changes to the UCA here. We need a proper review
of what is required for biblical as well as modern Hebrew (hopefully the
same but possibly not), not just a quick conclusion that we fix things
by reordering dagesh.
A lot of the problem with dagesh etc comes from the highly inappropriate
canonical combining classes for U+05B0 to U+05C4. I was told not long
ago that the ordering of these didn't matter, only the distinctions do,
but the ordering sure does matter when it comes to collation. Shin with
dagesh and patah is logically <shin, shin dot, dagesh, patah> and
should probably be collated on the basis of that ordering, i.e. sort
first by the sin/shin dot, then by whether there is dagesh or not, then
by the vowel. But the canonically ordered NFD which is the input to
collation is <shin, patah, dagesh, shin dot>. So somehow the collation
algorithm has to be asked to undo the damage which normalisation did and
collate these things in the right order.
And please don't discuss Hebrew here in isolation from the discussion of
the same subject on the Hebrew list - at least the discussion which I
was raising there on the understanding that matters of Hebrew were
supposed to be discussed there.
--
Peter Kirk
http://www.qaya.org/
------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Peter Kirk
2003-08-19 22:11:23 UTC
Permalink
Resending with the correct address...
Post by Mark Davis
Three points.
First, While we try to make the the UCA collation table (DUCET) as reasonable as
possible for the main languages of a given script, it is not guaranteed to
produce the correct sorting for any particular language. The UCA *is* designed
so that it provides a default base ordering for all of Unicode, and individual
languages can be given tailorings of the DUCET that handle the specifics of
their string comparison requirements.
Thus if there are changes that improve the handling of the UCA for the major
languages using a given script, and do not destabilize others, those are
candidates for change in a version. For example, if it turned out that a
particular Tamil character (or sequence of characters!) was not sorted correctly
according to the DUCET (e.g. on http://www.unicode.org/charts/collation/beta/),
then it would be a candidate, and should be submitted on the form.
Understood. On this basis, the DUCET sorting for the Hebrew block should
be based on the requirements for modern Hebrew, with Yiddish, Ladino etc
also being taken into acount.
Post by Mark Davis
Second, we do and should favor modern language communities when making
incompatible tradeoffs. So if we have the choice between making French sort
correctly without tailoring, or have Latin sort correctly without tailoring, we
should choose the modern community. The Latin community can always use a
tailored UCA, in any event.
Understood. I accept the primacy of the modern language in this case.
There may be some issues on which the modern language has no
preference, especially for characters only used in older Hebrew, and in
such cases it would make sense to follow the preferences of ancient
Hebrew scholars. If it becomes necessary to use a tailored UCA for
biblical work, so be it, but I would prefer not to. We have come close
to having to use a separate set of vowels for biblical Hebrew simply
because decisions were rushed and then frozen on the basis of modern
Hebrew requirements. I don't want any danger of falling into the same
kind of trap with collation.
Post by Mark Davis
Third, there is often a serious confusion between sorting weight and canonical
ordering. The fact that a grave accent precedes a cedilla in canonical order is
*completely independent of* whatever collation weights each of them has, either
in a tailoring or in the DUCET. The only substantive issue is how each of these
sorts separately or in combination. And making the combination (sequence) of
grave and cedilla sort before grave, after grave, before cedilla, or after
cedilla are all possible; all of those can be handled by the UCA as
contractions. See http://www.unicode.org/reports/tr10/tr10-10.html for more
information.
Yes, I understand that the collation weights are quite independent of
the canonical combining classes. But collation does become trickier
when the canonical ordering is not the expected one, because of the
assumption that collation is based on the order of the string i.e. based
on the first character, then the second etc.

Well, I am glad that contractions provide a way around that problem. So
perhaps we ought to be looking at using them for Hebrew in DUCET. I
guess we should consider defining contractions for each case of
<consonant, dagesh> which differ from the consonant at the second level
only, perhaps also the same for rafe, and similarly for each combination
of shin, shin/sin dot and dagesh. The problem comes that the vowels
intrude between the consonant and the dagesh, and meteg comes before
shin/sin dot, so there is a potential need for a rather large number of
contractions, especially if we consider a shin with a right meteg which
might come out as:

<shin, dagesh, meteg, CGJ, {any one of 11 vowels}, {optional shin dot |
sin dot}, masora circle>

with the CGJ inhibiting complete canonical reordering, and the shin/sin
dot must be contracted with the shin.

Perhaps we need to specify that dagesh and shin/sin dot must always come
BEFORE any CGJ in such combinations so that they don't get separated too
far from the base character. In fact I think I will change my document
to specify that.

PS Is there a problem with the Unicode Hebrew list? Nothing seems to
have appeared on it today, including my previous posting on this thread
and Mark's reply to it.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/



------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Mark Davis
2003-08-19 22:22:01 UTC
Permalink
I forgot the most important point of all:

The goal for UCA 4.0 is to top it up to the Unicode 4.0 repertoire. The
timeframe for that is quite short -- it was to have been done some time ago --
and we don't want to make any changes that we would want to pull out later when
we work with SC22/WG20. So we will only make "safe and obvious" changes in this
version.

Of course, you should still continue to work on any more extensive comments for
a later version, so that they are prepared well in advance; after all, all of
these issues are on collation features that have been in since 3.1 and before!

Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄

----- Original Message -----
From: "Peter Kirk" <***@ntlworld.com>
To: "Mark Davis" <***@jtcsv.com>
Cc: "Matitiahu Allouche" <***@il.ibm.com>; <***@unicode.org>;
<***@unicode.org>; <***@unicode.org>; "Joan Wardell" <***@sil.org>;
<***@unicode.org>
Sent: Tuesday, August 19, 2003 14:55
Subject: Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)
Post by Peter Kirk
Post by Mark Davis
Three points.
First, While we try to make the the UCA collation table (DUCET) as reasonable as
possible for the main languages of a given script, it is not guaranteed to
produce the correct sorting for any particular language. The UCA *is* designed
so that it provides a default base ordering for all of Unicode, and individual
languages can be given tailorings of the DUCET that handle the specifics of
their string comparison requirements.
Thus if there are changes that improve the handling of the UCA for the major
languages using a given script, and do not destabilize others, those are
candidates for change in a version. For example, if it turned out that a
particular Tamil character (or sequence of characters!) was not sorted correctly
according to the DUCET (e.g. on
http://www.unicode.org/charts/collation/beta/),
Post by Peter Kirk
Post by Mark Davis
then it would be a candidate, and should be submitted on the form.
Understood. On this basis, the DUCET sorting for the Hebrew block should
be based on the requirements for modern Hebrew, with Yiddish, Ladino etc
also being taken into acount.
Post by Mark Davis
Second, we do and should favor modern language communities when making
incompatible tradeoffs. So if we have the choice between making French sort
correctly without tailoring, or have Latin sort correctly without tailoring, we
should choose the modern community. The Latin community can always use a
tailored UCA, in any event.
Understood. I accept the primacy of the modern language in this case.
There may be some issues on which the modern language has no
preference, especially for characters only used in older Hebrew, and in
such cases it would make sense to follow the preferences of ancient
Hebrew scholars. If it becomes necessary to use a tailored UCA for
biblical work, so be it, but I would prefer not to. We have come close
to having to use a separate set of vowels for biblical Hebrew simply
because decisions were rushed and then frozen on the basis of modern
Hebrew requirements. I don't want any danger of falling into the same
kind of trap with collation.
Post by Mark Davis
Third, there is often a serious confusion between sorting weight and canonical
ordering. The fact that a grave accent precedes a cedilla in canonical order is
*completely independent of* whatever collation weights each of them has, either
in a tailoring or in the DUCET. The only substantive issue is how each of these
sorts separately or in combination. And making the combination (sequence) of
grave and cedilla sort before grave, after grave, before cedilla, or after
cedilla are all possible; all of those can be handled by the UCA as
contractions. See http://www.unicode.org/reports/tr10/tr10-10.html for more
information.
Yes, I understand that the collation weights are quite independent of
the canonical combining classes. But collation does become trickier
when the canonical ordering is not the expected one, because of the
assumption that collation is based on the order of the string i.e. based
on the first character, then the second etc.
Well, I am glad that contractions provide a way around that problem. So
perhaps we ought to be looking at using them for Hebrew in DUCET. I
guess we should consider defining contractions for each case of
<consonant, dagesh> which differ from the consonant at the second level
only, perhaps also the same for rafe, and similarly for each combination
of shin, shin/sin dot and dagesh. The problem comes that the vowels
intrude between the consonant and the dagesh, and meteg comes before
shin/sin dot, so there is a potential need for a rather large number of
contractions, especially if we consider a shin with a right meteg which
<shin, dagesh, meteg, CGJ, {any one of 11 vowels}, {optional shin dot |
sin dot}, masora circle>
with the CGJ inhibiting complete canonical reordering, and the shin/sin
dot must be contracted with the shin.
Perhaps we need to specify that dagesh and shin/sin dot must always come
BEFORE any CGJ in such combinations so that they don't get separated too
far from the base character. In fact I think I will change my document
to specify that.
PS Is there a problem with the Unicode Hebrew list? Nothing seems to
have appeared on it today, including my previous posting on this thread
and Mark's reply to it.
--
Peter Kirk
http://www.qaya.org/
------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Peter Kirk
2003-08-19 22:10:18 UTC
Permalink
Resending with the correct address...
Post by Mark Davis
B. Dagesh
2) There is something strange in the combinations of Shin with Dagesh and
dots: for all other letters, the form without Dagesh sorts before the form
with Dagesh. But Shin with Sin/Shin dot sort after their corresponding
combinations with Dagesh. I cannot imagine a justification for that.
We have currently in UCA the following (from UCA 4.0.0d1 (beta))
05B0 ; [.0000.00B2.0002.05B0] # HEBREW POINT SHEVA
05B1 ; [.0000.00B3.0002.05B1] # HEBREW POINT HATAF SEGOL
05B2 ; [.0000.00B4.0002.05B2] # HEBREW POINT HATAF PATAH
05B3 ; [.0000.00B5.0002.05B3] # HEBREW POINT HATAF QAMATS
05B4 ; [.0000.00B6.0002.05B4] # HEBREW POINT HIRIQ
05B5 ; [.0000.00B7.0002.05B5] # HEBREW POINT TSERE
05B6 ; [.0000.00B8.0002.05B6] # HEBREW POINT SEGOL
05B7 ; [.0000.00B9.0002.05B7] # HEBREW POINT PATAH
05B8 ; [.0000.00BA.0002.05B8] # HEBREW POINT QAMATS
05B9 ; [.0000.00BB.0002.05B9] # HEBREW POINT HOLAM
05BB ; [.0000.00BC.0002.05BB] # HEBREW POINT QUBUTS
05BC ; [.0000.00BD.0002.05BC] # HEBREW POINT DAGESH OR MAPIQ
05BF ; [.0000.00C0.0002.05BF] # HEBREW POINT RAFE
05C1 ; [.0000.00C1.0002.05C1] # HEBREW POINT SHIN DOT
05C2 ; [.0000.00C2.0002.05C2] # HEBREW POINT SIN DOT
FB1E ; [.0000.00C3.0002.FB1E] # HEBREW POINT JUDEO-SPANISH VARIKA
To make this change, we would move Dagesh to after SIN DOT. Question: should it
also go after VARIKA or not?
Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄
Please, don't rush any changes to the UCA here. We need a proper review
of what is required for biblical as well as modern Hebrew (hopefully the
same but possibly not), not just a quick conclusion that we fix things
by reordering dagesh.

A lot of the problem with dagesh etc comes from the highly inappropriate
canonical combining classes for U+05B0 to U+05C4. I was told not long
ago that the ordering of these didn't matter, only the distinctions do,
but the ordering sure does matter when it comes to collation. Shin with
dagesh and patah is logically <shin, shin dot, dagesh, patah> and
should probably be collated on the basis of that ordering, i.e. sort
first by the sin/shin dot, then by whether there is dagesh or not, then
by the vowel. But the canonically ordered NFD which is the input to
collation is <shin, patah, dagesh, shin dot>. So somehow the collation
algorithm has to be asked to undo the damage which normalisation did and
collate these things in the right order.

And please don't discuss Hebrew here in isolation from the discussion of
the same subject on the Hebrew list - at least the discussion which I
was raising there on the understanding that matters of Hebrew were
supposed to be discussed there.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/



------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Mark Davis
2003-08-21 15:21:00 UTC
Permalink
OK, it sounds like we have clarity on two items:

- change the non-final characters to <isolated>
- moving dagesh to after RAFE, SHIN DOT, SIN DOT, VARIKA

I'll talk to Ken about whether we have time to get them into UCA 4.0.0, and in
any event we can get them into ICU 2.8 for Hebrew.

As far as the strength issue of final vs dagesh, I don't think we should take
any immediate action. The collation strength also affects matching. If a user
sets the sorting or matching level to "ignore accents", for example, they
probably expect the dots to be ignored then, as well as graves, acutes, etc. If
this showed up in a lot of words, then it would still be worth doing, I suspect.
But because the number of cases is so very small where you would have a
combination of dageshes and finals that would make a difference, I would
recommend that SII approach this very carefully. If we are going to do anything,
it should be in the next version of UCA so that we have time to consider all of
the ramifications. I would not recommend it for ICU 2.8 either, even though we
have more time (and flexibility) there.
I am no expert in Judeo-Spanish, but since FB1E Varika is a glyph variant
of 05BF Rafe, it makes sense that Dagesh be in the same relation to both,
so Dagesh should go after Varika.
From that, it would also appear that VARIKA should either have the same weight
as RAFE or at least be adjacent to. This would would only be an issue for users
of that character, so probably difficult to establish the right behavior, and
thus one we would not even try to get into UCA this round.

We should probably take this discussion off of ***@unicode.org, and just
have it on ***@unicode.org and ***@unicode.org. Any people interested in
this topic should be on those groups anyway.

Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄

----- Original Message -----
From: "Matitiahu Allouche" <***@il.ibm.com>
To: "Mark Davis" <***@jtcsv.com>
Cc: <***@unicode.org>; <***@unicode.org>
Sent: Thursday, August 21, 2003 00:55
Subject: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)
Hello, Mark!
In order to address your points in order, I will put excerpts of your note
within <MARK> . . . </MARK> tags, and my comments as untagged text.
<MARK>
A. Final.
1) Precedence of Dagesh over Final/non-Final: in the chart, the presence
or absence of Dagesh is a Secundary difference, while Final/non-Final is
a
Tertiary difference. This is relevant only for letters Kaf and Pe. My
gut feeling says that Final/non-Final should have precedence over
Dagesh/no-Dagesh.
Note that the number of actual cases where this would make a difference
is
probably *very* small.
So there are two issues for final vs non-final: strength and ordering.
A1. Ordering is easy to change; in ICU or UCA we could put the final
values
before the independent letters. In ICU they are just rules, while in UCA
they
follow
http://www.unicode.org/reports/tr10/tr10-10.html#Tertiary_Weight_Table.
The
easiest in UCA would be to give the 5 independent forms that have finals
the
value <isolated>.
Note: there is one minor fallout in ICU: we optimize the sortkey
compression of
tertiary values of NONE; if we change the ordering then each instance of
the
<isolated> letters will mean about a 2-3 byte increase in sort-key sizes.
</MARK>
I like giving the value <isolated> to the 5 independent forms that have
finals. As for the increase in sort-key sizes, this is what cheap memory
is made for :-)
<MARK>
A2. For Strength, it is not as clear cut. If Final vs non-Final is more
important than dagesh, etc, the easiest thing is to make it a primary
difference; but that would make
Zayin Yod PeFinal
sort before all words
Zayin Yod Pe XXX
But I'm guessing that is probably not desired for Hebrew.
</MARK>
Why? This is exactly what I desire for Hebrew. But I am afraid that
making primary differences for Final vs non-Final will make searches using
in most cases, the difference between Final vs non-Final must be ignored
for searches.
<MARK>
In ICU we could make Final vs non-Final be a secondary difference, and
have
Dagesh, etc. be tertiary differences. The disadvantage is that people tend
to
expect the 2nd level to be 'accent-like', and there might be more
inconsistencies in practice than you would gain by having the current
situation.
</MARK>
I don't think that there is enough experience accumulated to create people
expectations. If this is the right thing (and I think it is), it is still
early enough to do it now.
In Unicode, the UCA has more production restrictions as per
http://www.unicode.org/reports/tr10/tr10-10.html#Tertiary_Weight_Table, so
it
would be a bit harder to make that change.
So if SII would like this change, I'd recommend that we make the ordering
change
in UCA (which will then affect ICU), but not make a stength change (it
would
have to be extremely exotic for that to make a difference).
</MARK>
Personally, I would go for the strength change, but I understand the
adverse considerations. I will have to take the matter to SII.
<MARK>
B. Dagesh
2) There is something strange in the combinations of Shin with Dagesh
and
dots: for all other letters, the form without Dagesh sorts before the
form
with Dagesh. But Shin with Sin/Shin dot sort after their corresponding
combinations with Dagesh. I cannot imagine a justification for that.
We have currently in UCA the following (from UCA 4.0.0d1 (beta))
05B0 ; [.0000.00B2.0002.05B0] # HEBREW POINT SHEVA
05B1 ; [.0000.00B3.0002.05B1] # HEBREW POINT HATAF SEGOL
05B2 ; [.0000.00B4.0002.05B2] # HEBREW POINT HATAF PATAH
05B3 ; [.0000.00B5.0002.05B3] # HEBREW POINT HATAF QAMATS
05B4 ; [.0000.00B6.0002.05B4] # HEBREW POINT HIRIQ
05B5 ; [.0000.00B7.0002.05B5] # HEBREW POINT TSERE
05B6 ; [.0000.00B8.0002.05B6] # HEBREW POINT SEGOL
05B7 ; [.0000.00B9.0002.05B7] # HEBREW POINT PATAH
05B8 ; [.0000.00BA.0002.05B8] # HEBREW POINT QAMATS
05B9 ; [.0000.00BB.0002.05B9] # HEBREW POINT HOLAM
05BB ; [.0000.00BC.0002.05BB] # HEBREW POINT QUBUTS
05BC ; [.0000.00BD.0002.05BC] # HEBREW POINT DAGESH OR MAPIQ
05BF ; [.0000.00C0.0002.05BF] # HEBREW POINT RAFE
05C1 ; [.0000.00C1.0002.05C1] # HEBREW POINT SHIN DOT
05C2 ; [.0000.00C2.0002.05C2] # HEBREW POINT SIN DOT
FB1E ; [.0000.00C3.0002.FB1E] # HEBREW POINT JUDEO-SPANISH VARIKA
should it
also go after VARIKA or not?
</MARK>
I am no expert in Judeo-Spanish, but since FB1E Varika is a glyph variant
of 05BF Rafe, it makes sense that Dagesh be in the same relation to both,
so Dagesh should go after Varika.
Shalom (Regards), Mati
Bidi Architect
Globalization Center Of Competency - Bidirectional Scripts
IBM Israel
Phone: +972 2 5888802 Fax: +972 2 5870333 Mobile: +972 52
554160
------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Peter Kirk
2003-08-21 17:05:23 UTC
Permalink
Post by Mark Davis
- change the non-final characters to <isolated>
- moving dagesh to after RAFE, SHIN DOT, SIN DOT, VARIKA
I'll talk to Ken about whether we have time to get them into UCA 4.0.0, and in
any event we can get them into ICU 2.8 for Hebrew.
These changes are certainly a move in the right direction, but only part
of the way. If we can get these in quickly, that would be good. But we
mustn't let things rest there.
Post by Mark Davis
As far as the strength issue of final vs dagesh, I don't think we should take
any immediate action. The collation strength also affects matching. If a user
sets the sorting or matching level to "ignore accents", for example, they
probably expect the dots to be ignored then, as well as graves, acutes, etc. If
this showed up in a lot of words, then it would still be worth doing, I suspect.
But because the number of cases is so very small where you would have a
combination of dageshes and finals that would make a difference, ...
Yes, the number of cases where the relative ordering of dagesh and final
forms is important is vanishingly small, because final forms are nearly
always predictable anyway.

Nevertheless, this is an important issue. It is important, certainly in
the biblical context, that the difference between regular and final
forms is ignored in a basic "ignore accents" type of search. And Mati
seems to agree: he wrote: "in most cases, the difference between Final
vs non-Final must be ignored for searches". Compare for example ignoring
upper and lower case differences in English. I would propose putting
the final/non-final difference at the same level as that one.
Post by Mark Davis
... I would
recommend that SII approach this very carefully. If we are going to do anything,
it should be in the next version of UCA so that we have time to consider all of
the ramifications. I would not recommend it for ICU 2.8 either, even though we
have more time (and flexibility) there.
Indeed. The issue is a lot more complex than it seems here.
Post by Mark Davis
I am no expert in Judeo-Spanish, but since FB1E Varika is a glyph variant
of 05BF Rafe, it makes sense that Dagesh be in the same relation to both,
so Dagesh should go after Varika.
From that, it would also appear that VARIKA should either have the same weight
as RAFE or at least be adjacent to. This would would only be an issue for users
of that character, so probably difficult to establish the right behavior, and
thus one we would not even try to get into UCA this round.
this topic should be on those groups anyway.
Agreed. But it seems, Mark, that you are not on the Hebrew list, as your
posting has not reached there. So I am copying your whole posting, plus
my additions, to the Hebrew list.

By the way, I am not on the bidi group because I am interested mainly
in the kinds of Hebrew issues which are independent ot specific bidi
matters. Am I in fact missing out on important discussion of Hebrew?
Post by Mark Davis
Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄
----- Original Message -----
Sent: Thursday, August 21, 2003 00:55
Subject: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)
Hello, Mark!
In order to address your points in order, I will put excerpts of your note
within <MARK> . . . </MARK> tags, and my comments as untagged text.
<MARK>
A. Final.
1) Precedence of Dagesh over Final/non-Final: in the chart, the presence
or absence of Dagesh is a Secundary difference, while Final/non-Final is
a
Tertiary difference. This is relevant only for letters Kaf and Pe. My
gut feeling says that Final/non-Final should have precedence over
Dagesh/no-Dagesh.
Note that the number of actual cases where this would make a difference
is
probably *very* small.
So there are two issues for final vs non-final: strength and ordering.
A1. Ordering is easy to change; in ICU or UCA we could put the final
values
before the independent letters. In ICU they are just rules, while in UCA
they
follow
http://www.unicode.org/reports/tr10/tr10-10.html#Tertiary_Weight_Table.
The
easiest in UCA would be to give the 5 independent forms that have finals
the
value <isolated>.
Note: there is one minor fallout in ICU: we optimize the sortkey
compression of
tertiary values of NONE; if we change the ordering then each instance of
the
<isolated> letters will mean about a 2-3 byte increase in sort-key sizes.
</MARK>
I like giving the value <isolated> to the 5 independent forms that have
finals. As for the increase in sort-key sizes, this is what cheap memory
is made for :-)
<MARK>
A2. For Strength, it is not as clear cut. If Final vs non-Final is more
important than dagesh, etc, the easiest thing is to make it a primary
difference; but that would make
Zayin Yod PeFinal
sort before all words
Zayin Yod Pe XXX
But I'm guessing that is probably not desired for Hebrew.
</MARK>
Why? This is exactly what I desire for Hebrew. But I am afraid that
making primary differences for Final vs non-Final will make searches using
in most cases, the difference between Final vs non-Final must be ignored
for searches.
<MARK>
In ICU we could make Final vs non-Final be a secondary difference, and
have
Dagesh, etc. be tertiary differences. The disadvantage is that people tend
to
expect the 2nd level to be 'accent-like', and there might be more
inconsistencies in practice than you would gain by having the current
situation.
</MARK>
I don't think that there is enough experience accumulated to create people
expectations. If this is the right thing (and I think it is), it is still
early enough to do it now.
In Unicode, the UCA has more production restrictions as per
http://www.unicode.org/reports/tr10/tr10-10.html#Tertiary_Weight_Table, so
it
would be a bit harder to make that change.
So if SII would like this change, I'd recommend that we make the ordering
change
in UCA (which will then affect ICU), but not make a stength change (it
would
have to be extremely exotic for that to make a difference).
</MARK>
Personally, I would go for the strength change, but I understand the
adverse considerations. I will have to take the matter to SII.
<MARK>
B. Dagesh
2) There is something strange in the combinations of Shin with Dagesh
and
dots: for all other letters, the form without Dagesh sorts before the
form
with Dagesh. But Shin with Sin/Shin dot sort after their corresponding
combinations with Dagesh. I cannot imagine a justification for that.
We have currently in UCA the following (from UCA 4.0.0d1 (beta))
05B0 ; [.0000.00B2.0002.05B0] # HEBREW POINT SHEVA
05B1 ; [.0000.00B3.0002.05B1] # HEBREW POINT HATAF SEGOL
05B2 ; [.0000.00B4.0002.05B2] # HEBREW POINT HATAF PATAH
05B3 ; [.0000.00B5.0002.05B3] # HEBREW POINT HATAF QAMATS
05B4 ; [.0000.00B6.0002.05B4] # HEBREW POINT HIRIQ
05B5 ; [.0000.00B7.0002.05B5] # HEBREW POINT TSERE
05B6 ; [.0000.00B8.0002.05B6] # HEBREW POINT SEGOL
05B7 ; [.0000.00B9.0002.05B7] # HEBREW POINT PATAH
05B8 ; [.0000.00BA.0002.05B8] # HEBREW POINT QAMATS
05B9 ; [.0000.00BB.0002.05B9] # HEBREW POINT HOLAM
05BB ; [.0000.00BC.0002.05BB] # HEBREW POINT QUBUTS
05BC ; [.0000.00BD.0002.05BC] # HEBREW POINT DAGESH OR MAPIQ
05BF ; [.0000.00C0.0002.05BF] # HEBREW POINT RAFE
05C1 ; [.0000.00C1.0002.05C1] # HEBREW POINT SHIN DOT
05C2 ; [.0000.00C2.0002.05C2] # HEBREW POINT SIN DOT
FB1E ; [.0000.00C3.0002.FB1E] # HEBREW POINT JUDEO-SPANISH VARIKA
should it
also go after VARIKA or not?
</MARK>
I am no expert in Judeo-Spanish, but since FB1E Varika is a glyph variant
of 05BF Rafe, it makes sense that Dagesh be in the same relation to both,
so Dagesh should go after Varika.
Shalom (Regards), Mati
Bidi Architect
Globalization Center Of Competency - Bidirectional Scripts
IBM Israel
Phone: +972 2 5888802 Fax: +972 2 5870333 Mobile: +972 52
554160
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/




------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Mark Davis
2003-08-21 17:20:11 UTC
Permalink
There will be future versions of the UCA, so the door is not closed. The key is
to start discussing the trickier issues long in advance.

bidi has traditionally been a place for Arabic or Hebrew issues to be discussed.
Fine details on hebrew biblical usage are probably better on the hebrew list,
then raised to the bidi list once there is more consensus.

Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄

----- Original Message -----
From: "Peter Kirk" <***@qaya.org>
To: "Mark Davis" <***@jtcsv.com>
Cc: "Matitiahu Allouche" <***@il.ibm.com>; <***@unicode.org>;
<***@unicode.org>; "Vladimir Weinstein" <***@us.ibm.com>
Sent: Thursday, August 21, 2003 10:05
Subject: Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)
Post by Peter Kirk
Post by Mark Davis
- change the non-final characters to <isolated>
- moving dagesh to after RAFE, SHIN DOT, SIN DOT, VARIKA
I'll talk to Ken about whether we have time to get them into UCA 4.0.0, and in
any event we can get them into ICU 2.8 for Hebrew.
These changes are certainly a move in the right direction, but only part
of the way. If we can get these in quickly, that would be good. But we
mustn't let things rest there.
Post by Mark Davis
As far as the strength issue of final vs dagesh, I don't think we should take
any immediate action. The collation strength also affects matching. If a user
sets the sorting or matching level to "ignore accents", for example, they
probably expect the dots to be ignored then, as well as graves, acutes, etc. If
this showed up in a lot of words, then it would still be worth doing, I suspect.
But because the number of cases is so very small where you would have a
combination of dageshes and finals that would make a difference, ...
Yes, the number of cases where the relative ordering of dagesh and final
forms is important is vanishingly small, because final forms are nearly
always predictable anyway.
Nevertheless, this is an important issue. It is important, certainly in
the biblical context, that the difference between regular and final
forms is ignored in a basic "ignore accents" type of search. And Mati
seems to agree: he wrote: "in most cases, the difference between Final
vs non-Final must be ignored for searches". Compare for example ignoring
upper and lower case differences in English. I would propose putting
the final/non-final difference at the same level as that one.
Post by Mark Davis
... I would
recommend that SII approach this very carefully. If we are going to do anything,
it should be in the next version of UCA so that we have time to consider all of
the ramifications. I would not recommend it for ICU 2.8 either, even though we
have more time (and flexibility) there.
Indeed. The issue is a lot more complex than it seems here.
Post by Mark Davis
I am no expert in Judeo-Spanish, but since FB1E Varika is a glyph variant
of 05BF Rafe, it makes sense that Dagesh be in the same relation to both,
so Dagesh should go after Varika.
From that, it would also appear that VARIKA should either have the same weight
as RAFE or at least be adjacent to. This would would only be an issue for users
of that character, so probably difficult to establish the right behavior, and
thus one we would not even try to get into UCA this round.
this topic should be on those groups anyway.
Agreed. But it seems, Mark, that you are not on the Hebrew list, as your
posting has not reached there. So I am copying your whole posting, plus
my additions, to the Hebrew list.
By the way, I am not on the bidi group because I am interested mainly
in the kinds of Hebrew issues which are independent ot specific bidi
matters. Am I in fact missing out on important discussion of Hebrew?
Post by Mark Davis
Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄
----- Original Message -----
Sent: Thursday, August 21, 2003 00:55
Subject: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)
Hello, Mark!
In order to address your points in order, I will put excerpts of your note
within <MARK> . . . </MARK> tags, and my comments as untagged text.
<MARK>
A. Final.
1) Precedence of Dagesh over Final/non-Final: in the chart, the presence
or absence of Dagesh is a Secundary difference, while Final/non-Final is
a
Tertiary difference. This is relevant only for letters Kaf and Pe. My
gut feeling says that Final/non-Final should have precedence over
Dagesh/no-Dagesh.
Note that the number of actual cases where this would make a difference
is
probably *very* small.
So there are two issues for final vs non-final: strength and ordering.
A1. Ordering is easy to change; in ICU or UCA we could put the final
values
before the independent letters. In ICU they are just rules, while in UCA
they
follow
http://www.unicode.org/reports/tr10/tr10-10.html#Tertiary_Weight_Table.
The
easiest in UCA would be to give the 5 independent forms that have finals
the
value <isolated>.
Note: there is one minor fallout in ICU: we optimize the sortkey
compression of
tertiary values of NONE; if we change the ordering then each instance of
the
<isolated> letters will mean about a 2-3 byte increase in sort-key sizes.
</MARK>
I like giving the value <isolated> to the 5 independent forms that have
finals. As for the increase in sort-key sizes, this is what cheap memory
is made for :-)
<MARK>
A2. For Strength, it is not as clear cut. If Final vs non-Final is more
important than dagesh, etc, the easiest thing is to make it a primary
difference; but that would make
Zayin Yod PeFinal
sort before all words
Zayin Yod Pe XXX
But I'm guessing that is probably not desired for Hebrew.
</MARK>
Why? This is exactly what I desire for Hebrew. But I am afraid that
making primary differences for Final vs non-Final will make searches using
in most cases, the difference between Final vs non-Final must be ignored
for searches.
<MARK>
In ICU we could make Final vs non-Final be a secondary difference, and
have
Dagesh, etc. be tertiary differences. The disadvantage is that people tend
to
expect the 2nd level to be 'accent-like', and there might be more
inconsistencies in practice than you would gain by having the current
situation.
</MARK>
I don't think that there is enough experience accumulated to create people
expectations. If this is the right thing (and I think it is), it is still
early enough to do it now.
In Unicode, the UCA has more production restrictions as per
http://www.unicode.org/reports/tr10/tr10-10.html#Tertiary_Weight_Table, so
it
would be a bit harder to make that change.
So if SII would like this change, I'd recommend that we make the ordering
change
in UCA (which will then affect ICU), but not make a stength change (it
would
have to be extremely exotic for that to make a difference).
</MARK>
Personally, I would go for the strength change, but I understand the
adverse considerations. I will have to take the matter to SII.
<MARK>
B. Dagesh
2) There is something strange in the combinations of Shin with Dagesh
and
dots: for all other letters, the form without Dagesh sorts before the
form
with Dagesh. But Shin with Sin/Shin dot sort after their corresponding
combinations with Dagesh. I cannot imagine a justification for that.
We have currently in UCA the following (from UCA 4.0.0d1 (beta))
05B0 ; [.0000.00B2.0002.05B0] # HEBREW POINT SHEVA
05B1 ; [.0000.00B3.0002.05B1] # HEBREW POINT HATAF SEGOL
05B2 ; [.0000.00B4.0002.05B2] # HEBREW POINT HATAF PATAH
05B3 ; [.0000.00B5.0002.05B3] # HEBREW POINT HATAF QAMATS
05B4 ; [.0000.00B6.0002.05B4] # HEBREW POINT HIRIQ
05B5 ; [.0000.00B7.0002.05B5] # HEBREW POINT TSERE
05B6 ; [.0000.00B8.0002.05B6] # HEBREW POINT SEGOL
05B7 ; [.0000.00B9.0002.05B7] # HEBREW POINT PATAH
05B8 ; [.0000.00BA.0002.05B8] # HEBREW POINT QAMATS
05B9 ; [.0000.00BB.0002.05B9] # HEBREW POINT HOLAM
05BB ; [.0000.00BC.0002.05BB] # HEBREW POINT QUBUTS
05BC ; [.0000.00BD.0002.05BC] # HEBREW POINT DAGESH OR MAPIQ
05BF ; [.0000.00C0.0002.05BF] # HEBREW POINT RAFE
05C1 ; [.0000.00C1.0002.05C1] # HEBREW POINT SHIN DOT
05C2 ; [.0000.00C2.0002.05C2] # HEBREW POINT SIN DOT
FB1E ; [.0000.00C3.0002.FB1E] # HEBREW POINT JUDEO-SPANISH VARIKA
should it
also go after VARIKA or not?
</MARK>
I am no expert in Judeo-Spanish, but since FB1E Varika is a glyph variant
of 05BF Rafe, it makes sense that Dagesh be in the same relation to both,
so Dagesh should go after Varika.
Shalom (Regards), Mati
Bidi Architect
Globalization Center Of Competency - Bidirectional Scripts
IBM Israel
Phone: +972 2 5888802 Fax: +972 2 5870333 Mobile: +972 52
554160
--
Peter Kirk
http://www.qaya.org/
------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Loading...