UTS #10 : comment on Hangul Jamo(Letter) collation

Discussion:

Jungshik Shin

2003-08-25 16:38:17 UTC

Hello,

I've just submitted via the web feedback form at Unicode.org the following
comment on Hangul Letter(Jamos) collation in UTS #10. I believe most, if
not all, issues were resolved at least between Mark and me back in May,
but nonetheless I guess it has to be formally submitted to be considered
by the UTC. I'm also sending it to the Unicode list and WG20 list because
I'm afraid in the web form, lines were wrapped rather badly, which makes
it a bit hard to read my submission.

Jungshik

P.S. My email forwarding service provider has some trouble keeping the
machine up with flood of emails (infected with W32/Sobig.F) I've been
getting (at the peak, it was 50/minute). I was taken off the unicode
list last weekend and had to resubscribe. Please, use
jshin aet i18nl10 daht com if you want to reply to me off-line.

P.P.S. My comment is geared toward the collation as widely used in South
Korea. North Korea uses a different sorting order, which requires
a separate tailoring as outlined in Kent's work.

Enc. my comment on UTS #10.

Re: Public issue #14 Unicode Collation Algorithm 4.0.0 Beta

Sorting Hangul letters (Jamos) according to the current version
of allkeys.txt is rather like sorting Latin letters according to
the Unicode 4.0 code points. Because this is well known, UTS #10
goes to a length to explain how to properly Hangul letters(Jamos).
However, as it stands, there are a few issues to be clarified.

In mid May this year after a proposed update of UTS #10 had been posted,
there was a thread of discussion about treatment of Hangul letters (Jamos)
in UCA. In the thread, I raised the following issue (interleaving issue
and different treatment of cluster jamos depending on whether they're
given separate code points of their own in U+1100 block or they have to
be represented as sequences of Jamos encoded).

After a thread of emails exchanged, Mark Davis and I found that both of us
are more or less in the same page as to how Hangul letters be collated.
In summary,

1. Weights for T, V, and L should be assigned in such a way that
T < V < L for all T, V, and L's

2. Expand precomposed (cluster) Jamos into sequences of component
basic Jamos

3. Terminate every syllable with 'TERM' that has a lower weight than
all T's (there's an alternative to this, but both favors this
more than the alternative)

While Hangul collation issue is being worked out with ISO/IEC
JTC1/SC22/WG20, I'd like the above tailoring (which is rather straightforward
in my opinion) to be laid out clearly in UTS #10 along with alternatives
(if the authors wish to). I'm also wondering if allkeys.txt with the
above tailoring can be released.

Thank you for your consideration.

P.S. The following is a recap of emails exchanged about the issues.

JS> Specifically, U+1102 (ᄂ) HANGUL CHOSEONG NIEUN, U+1103 (ᄃ) HANGUL
JS> CHOSEONG TIKEUT and U+1113 (ᄓ) HANGUL CHOSEONG NIEUN-KIYEOK are given
JS> the primary weight of 1832, 1833 and 1844, respectively. With these,
JS> U+1113 (ᄓ) HANGUL CHOSEONG NIEUN-KIYEOK will be sorted after U+1103
JS> (ᄃ) HANGUL CHOSEONG TIKEUT, right? Or am I missing something (I
JS> haven't read UTS #10 through, yet)?
JS>
JS> The order is different from the way (South) Koreans (at least, most
JS> Korean dictionary editors) expect them to be sorted. We expect U+1113
JS> (ᄓ) HANGUL CHOSEONG NIEUN-KIYEOK (and other cluster consonants whose
JS> first component is U+1102 (ᄂ) HANGUL CHOSEONG NIEUN. They're U+1114
JS> (ᄔ) HANGUL CHOSEONG SSANGNIEUN, U+1115 (ᄕ) HANGUL CHOSEONG
JS> NIEUN-TIKEUT, U+1116 (ᄖ) HANGUL CHOSEONG NIEUN-PIEUP) to be put after
JS> U+1102 (ᄂ) HANGUL CHOSEONG NIEUN but before U+1103 (ᄃ) HANGUL CHOSEONG
JS> TIKEUT. The same is true of any cluster Jamos.

JS> In the first approach, the treatment of cluster Jamos depends on
JS> whether they're assigned separate code points or not. For instance,
JS> U+1113(ᄓ : HANGUL CHOSEONG NIEUN-KIYEOK) is treated in a different
JS> way from a cluster Jamo (HANGUL CHOSEONG NIEUN-SIOS) of which the only
JS> possible representation is the sequence of U+1102(ᄂ : HANGUL CHOSEONG
JS> NIEUN) and U+1109(ᄉ : HANGUL CHOSEONG SIOS) [1]. Moreover, depending on
JS> implementations, U+1113(ᄓ : HANGUL CHOSEONG NIEUN-KIYEOK) and the
JS> sequence of U+1102(ᄂ : HANGUL CHOSEONG NIEUN) and U+1109 (ᄀ : HANGUL
JS> CHOSEONG KIYEOK) can be treated differently. This is in contrast
JS> to the treatment of Latin/Greek/Cyrillic letters with diacritic marks.
JS> For them, whether precomposed letters (base + diacritic marks) are
JS> separately encoded or not and whether they're represented by precomposed
JS> characters or base + diacritics don't affect their collation.

Mark Davis responed to that as following:

MD> 1. If you reorder all T < V < L, then when you get a sequence:
MD>
MD> L V
MD> L L
MD>
MD> and the L's are equal, then the second is always greater.
MD>
MD> 2. The same goes for:
MD>
MD> L V T
MD> L V V
MD>
MD> With all V's greater than all T's, then any sequences that are equal
MD> up to the T/V comparison will take the right ordering.
MD>
MD> 3. The problem is then only with sequences like:
MD>
MD> L V X
MD> L V T
MD>
MD> If X is not a Jamo, or starts a new syllable, then you have to make
MD> sure that X is always less than T. There are two ways to do this:
MD>
MD> 3a. terminate every syllable.
MD> 3b. make V & T higher than all X (including L).

JS> Another missing part in my eyes is as to how to deal with U+111A(ᄚ :
JS> HANGUL CHOSEONG RIEUL-HIEUH) and the sequence of U+1105(ᄅ : HANGUL
JS> CHOSEONG RIEUL) and U+1112(ᄒ: HANGUL CHOSEONG HIEUH). IMO, they
JS> should be treated identically, but UTS 10(draft) is rather silent on
JS> that perhaps deferring to tailorings.

Further along, he also wrote, in response to my question (as shown right
above), that [1]

MD> 1. For the "precomposed" jamos, there are two solutions.
MD>
MD> Suppose we have:
MD>
MD> U+1105(ᄅ) HANGULCHOSEONG RIEUL) => X
MD> U+1112(ᄒ: HANGUL CHOSEONG HIEUH) => Y
MD>
MD> a. decompose them.
MD>
MD> U+111A(ᄚ HANGUL CHOSEONG RIEUL-HIEUH) => X Y

MD> b. interleave them and treat their constitutent sequences as
MD> contractions.
MD>
MD> U+111A(ᄚ HANGUL CHOSEONG RIEUL-HIEUH) => X'
MD> U+1105(ᄅ) HANGULCHOSEONG RIEUL), U+1112(ᄒ: HANGUL CHOSEONG HIEUH)
MD> => X'

In addition, he wrote that he's more in favor of (a) than (b). I also wrote
that I prefer (a) to (b) because of the following problem with (b).

JS> What I don't like is the inflexibility of having to collect all the
JS> known occurrence of cluster Jamos and giving each of them the
JS> primary weight in such a way (interleaving) that they can get
JS> collated the way expected by (South) Koreans

Mark also wrote the following, which I missed at first. As a result,
I wrote some more articles [2] until it's finally clarified in
the last article in the thread [3]

MD> I agree that longer sequences should expand in weights to be
MD> equivalent, and that this should be done in the UCA. As I said, it is
MD> just taking a while working with WG20*, and in the meantime people
MD> need to tailor it.

[1] http://www.unicode.org/mail-arch/unicode-ml/y2003-m05/0362.html
[2] http://www.unicode.org/mail-arch/unicode-ml/y2003-m05/0364.html

JS> To take the same example as I took in my previous email, I don't see
JS> how S1,S2 and S3 could be sorted S1 < S2 < S3 (instead of S1 < S3 < S2)
JS> without contracting the sequence of 'U+1169 (ㅗ:HANGUL JUNGSEONG O)
JS> U+1163 (ㅑ:HANGUL JUNGSEONG YA)'?
JS>
JS> S1: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+1169 (ㅗ:HANGUL JUNGSEONG O)
JS> U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK)
JS> S2: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+116A (ㅘ:HANGUL JUNGSEONG WA)
JS> U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK)
JS> S3: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+1169 (ㅗ:HANGUL JUNGSEONG O)
JS> U+1163 (ㅑ:HANGUL JUNGSEONG YA) U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK)
JS>
JS> where the primary weights of each Jamo are given as following,
JS>
JS> U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) : 301
JS> U+1161 (ㅏ:HANGUL JUNGSEONG A) : 201
JS> U+1163 (ㅑ:HANGUL JUNGSEONG YA) : 231
JS> U+1169 (ㅗ:HANGUL JUNGSEONG O) : 251
JS> U+116A (ㅘ:HANGUL JUNGSEONG WA) : 255
JS> U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK) : 101

[3]
http://www.unicode.org/mail-arch/unicode-ml/y2003-m05/0426.html

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kent Karlsson

2003-08-26 09:01:22 UTC

Permalink

Post by Jungshik Shin
Sorting Hangul letters (Jamos) according to the current version
of allkeys.txt is rather like sorting Latin letters according to
the Unicode 4.0 code points. Because this is well known, UTS #10
goes to a length to explain how to properly Hangul letters(Jamos).
However, as it stands, there are a few issues to be clarified.
In mid May this year after a proposed update of UTS #10 had
been posted,
there was a thread of discussion about treatment of Hangul
letters (Jamos)
in UCA. In the thread, I raised the following issue
(interleaving issue
and different treatment of cluster jamos depending on whether they're
given separate code points of their own in U+1100 block or
they have to
be represented as sequences of Jamos encoded).

You may wish to look at
http://std.dkuug.dk/JTC1/SC22/WG20/docs/n1051-hangulsort.pdf
which contains a much updated version of my paper on the subject.
The table entries are also found in plain text form at
http://std.dkuug.dk/JTC1/SC22/WG20/docs/n1051t-table-hangulctt6.txt
(the "28" at the end is spurious...)

Post by Jungshik Shin
After a thread of emails exchanged, Mark Davis and I found
that both of us
are more or less in the same page as to how Hangul letters be
collated.
In summary,
1. Weights for T, V, and L should be assigned in such a way that
T < V < L for all T, V, and L's

That would be L < T < V; but that is complicated by the actual need for
(the superficially contradictory) V < L < T < V, with the latter T and V
after all scripts. The Vs at two radically different positions in the table
is for different positions of the V in a syllable; V < L is for first V in
a syllable, T < V is for non-first Vs in a syllable.

Post by Jungshik Shin
2. Expand precomposed (cluster) Jamos into sequences of component
basic Jamos

Needed for covering all combinations of Jamos. If limited to (a superset)
of modern Jamo, this expansion can be avoided. For details, see my paper
referenced above, which lists the weightings and contractions needed for
avoiding this expansion in many (but not all) cases.

Post by Jungshik Shin
3. Terminate every syllable with 'TERM' that has a lower weight than
all T's (there's an alternative to this, but both favors this
more than the alternative)

This can be avoided if the weighting is done in a particular way.
See my paper for details.

/kent k

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Jungshik Shin

2003-08-30 10:17:35 UTC

Permalink

On Tue, 26 Aug 2003, Kent Karlsson wrote:

Kent,

Thank you for your work on Korean sorting and sorry for my late reply.
I'll be very brief because I have something urgent to take care of.

Post by Kent Karlsson
You may wish to look at
http://std.dkuug.dk/JTC1/SC22/WG20/docs/n1051-hangulsort.pdf
which contains a much updated version of my paper on the subject.
The table entries are also found in plain text form at
http://std.dkuug.dk/JTC1/SC22/WG20/docs/n1051t-table-hangulctt6.txt

Wow, you've created all these entries. Thanks.

Post by Kent Karlsson

That would be L < T < V; but that is complicated by the actual need for
(the superficially contradictory) V < L < T < V, with the latter T and V
after all scripts.

I'm not following you here. 'T < V < L' works well in Mark's
and my scheme for the most generic form of Korean syllables, 'L+V+T*'
as far as South Korean collation rules are concerned.

Post by Kent Karlsson
The Vs at two radically different positions in the table
is for different positions of the V in a syllable; V < L is for first V in
a syllable, T < V is for non-first Vs in a syllable.

Aha, you're talking about your scheme.

Post by Kent Karlsson

Post by Jungshik Shin
2. Expand precomposed (cluster) Jamos into sequences of component
basic Jamos

Needed for covering all combinations of Jamos. If limited to (a superset)
of modern Jamo, this expansion can be avoided.

Absolutely.

Post by Kent Karlsson
referenced above, which lists the weightings and contractions needed for
avoiding this expansion in many (but not all) cases.

Post by Jungshik Shin
3. Terminate every syllable with 'TERM' that has a lower weight than
all T's (there's an alternative to this, but both favors this
more than the alternative)

This can be avoided if the weighting is done in a particular way.
See my paper for details.

Indeed. However, I'm wondering if avoiding TERM is a better
trade-off than avoiding seemingly more complex(than Mark's and mine)
scheme of yours that also requires pre-handling. Could you give me some
rationale behind your preferring yours to the other? Is it because it's
better suited to tailoring for North Korean? I haven't given much thought
to North Korean collation rules recently (at the moment, I have to look
them up again to refresh my memory.)

Jungshik

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/