kMandarin and kCantonese in Unihan

Discussion:

Anthony Fok

2003-10-07 13:42:09 UTC

Re: Errors Chinese pronunciations in Unihan

In Unihan-4.0.1d1b.txt:

U+4C5B kMandarin XU4M

The trailing "M" is extraneous. I do not know about the actual
pronunciation of the U+4C5B character, however. :-)

The Cantonese pronunciations of characters in CJK Extension A seem
problematic. There seems to be a _consistent_ (?) mix-up of "AA" and
"A" (long "a" and short "a"). There also seems to be an _occasional_
(?) mix-up of "J" and "Y" (probably due to the confusion between Yale
and Jyutping romanization?).

For example, if U+3400's kDefinition claims that it is same as U+4E18,
then it should be pronounced as "YAU1", not "JAAU1". (I have no idea
about the "KAAU1" reading.)

U+3558 shows another error. It is listed as "CHAM1 SAM1". Here, only
CHAM1 is incorrect; it should be listed as "CHAAM1 SAM1" instead. SAM1
here means Ginseng. Hmm... speaking of which, its more conventional
forms (U+53C2, U+53C3, U+53C4) are missing the "SAM1" pronunciation as
well as the corresponding "Ginseng" definition!

On the other hand, some "J"s are correct, e.g. "JUNG3" for U+343A.

Some kCantonese pronunciations are joined together. For instance, the
following grep command yields:

$ grep kCantonese.*[0-9][A-Z] Unihan-4.0.1d1b.txt
U+36D3 kCantonese CHI1HEI1 DOU1
U+36DB kCantonese SAAN1DZAAN3
U+3851 kCantonese HAU1DZIU2
U+3997 kCantonese GAAM1GAAM3 KAAM4 NAAP1
U+3BA7 kCantonese WU1WAAT1
U+3C04 kCantonese JIN1DZIN3
U+3C7E kCantonese GOI1HOI1
U+3C80 kCantonese DAAI2 JAAN1DZEUN1 SAAN4
U+3C8E kCantonese DAAU1 LAAU4 SYU1JYU4
U+3CD9 kCantonese GYUN1JYUN5
U+3DD1 kCantonese JAAN1 JIN1 SEUNG1NIM6
U+3E62 kCantonese GA1GO1
U+3F39 kCantonese HONG1HONG1
U+4003 kCantonese DEUI1SEUI1 TEUI4
U+4050 kCantonese JING1JING3
U+4053 kCantonese JUNG1GAI3
U+4167 kCantonese JAAM1JAAM3 JIM3
U+4185 kCantonese CHI4 JI1DAIK1
U+423E kCantonese SAU1SOK3 SE3
U+441F kCantonese HONG6 NGAAU1GONG2
U+4492 kCantonese JAAU4 JIU5 SEUI1WAAI2 TIU4
U+44D6 kCantonese KEA1WU4 KUNG4
U+4543 kCantonese JAAM1JAAM3
U+4CC9 kCantonese DUNG1DAM1 DUNG6

I also caught the following error by chance:

U+4C8E kCantonese NEOYU5

What is a good place for discussions on these issues? And which
personnel and which sources are involved with esp. the CJK-Ext-A
kCantonese data? It would be nice to talk with the original people to
find out how these errors crept in, e.g. errors of the original source?
Systematic errors due to mistakes in conversion from e.g. Jyutping to
Yale? Inappropriate use of "Fanqie"? Other human errors? etc. so
that we can find a good ways to correct these mistakes.

Furthermore, is there something like CVS web or changelogs to see the
history of modifications of Unihan? (when, by whom, and why, from what
source, etc.) What other fixes have been done to Unihan.txt since
19 June 2003?

Many thanks!

Anthony Fok
--
Anthony Fok Tung-Ling
ThizLinux Laboratory <***@thizlinux.com> http://www.thizlinux.com/
Debian Chinese Project <***@debian.org> http://www.debian.org/intl/zh/
Come visit Our Lady of Victory Camp! http://www.olvc.ab.ca/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Andrew C. West

2003-10-07 16:04:28 UTC

Permalink

Post by Anthony Fok
What is a good place for discussions on these issues? And which
personnel and which sources are involved with esp. the CJK-Ext-A
kCantonese data? It would be nice to talk with the original people to
find out how these errors crept in, e.g. errors of the original source?
Systematic errors due to mistakes in conversion from e.g. Jyutping to
Yale? Inappropriate use of "Fanqie"? Other human errors? etc. so
that we can find a good ways to correct these mistakes.

The latest draft version of the Unihan database (Unihan-4.0.1d1.txt) is
currently subject to public review (see
http://www.unicode.org/versions/beta.html).

This forum is a suitable place for discussing the Unihan database, but in order
to ensure that your errata are taken note of you should report them using the
Unicode reporting form (http://www.unicode.org/unicode/reporting.html) by
October 27.

The failings of the Unihan database have been the subject of much discussion in
the past, especially the kMandarin field which got rather mangled in Unicode
3.1. Happily the 4.0.1d1 version of Unihan fixes most of the kMandarin problems,
although the quality of many of the provided Mandarin readings still leaves much
to be desired. (The Mandarin readings really need to be completely overhauled,
based on a single authoritative source such as _Hanyu Da Zidian_ ... but that's
just my personal opinion).

Post by Anthony Fok
Furthermore, is there something like CVS web or changelogs to see the
history of modifications of Unihan? (when, by whom, and why, from what
source, etc.) What other fixes have been done to Unihan.txt since
19 June 2003?

There is no public CVS repository, but the various incarnations of the Unihan
database may be downloaded from the "Official Unicode Online Data" site at
http://www.unicode.org/Public

I suppose there won't be another release of Unihan until after the public review
period ends at the end of this month.

Andrew

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Jenkins

2003-10-07 16:45:31 UTC

Permalink

Post by Andrew C. West
The failings of the Unihan database have been the subject of much discussion in
the past, especially the kMandarin field which got rather mangled in Unicode
3.1. Happily the 4.0.1d1 version of Unihan fixes most of the kMandarin problems,
although the quality of many of the provided Mandarin readings still leaves much
to be desired. (The Mandarin readings really need to be completely overhauled,
based on a single authoritative source such as _Hanyu Da Zidian_ ... but that's
just my personal opinion).

I think it's a reasonable suggestion, but with the usual question when
issues about Unihan.txt come up: who's going to do the work?

With Cantonese, of course, we've got a whole other mess to deal with,
since there is no single, reasonably authoritative source, and while
we're trying to base the Cantonese readings on solid authorities, it
isn't hard to come up with instances where they disagree, particularly
on the tone. And occasionally we have to resort to the "man in the
street" (or the disembodied voice on the Hong Kong subway), since the
characters just haven't made it into any dictionary. (E.g., does
anyone know how to pronounce U+40DF?)

And the Japanese and Korean readings need to be overhauled as well.

Not to mention the kDefinition field. If nothing else, it needs to be
able to distinguish general use, general Chinese, Mandarin, classical
Chinese, Cantonese, Japanese, Korean, and Vietnamese usages, plus, of
course, other Chinese dialects or non-standard forms.

========
John H. Jenkins
***@apple.com
***@mac.com
http://homepage..mac.com/jhjenkins/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/