Discussion:
Web Form: Other Question: CJK
Magda Danish (Unicode)
2003-10-10 21:48:58 UTC
Permalink
Roberto,

I am forwarding your question to the Unicode mailing list for possible
answers from the list's subscribers.

Regards,

Magda Danish
Administrative Director
The Unicode Consortium
650-693-3921


> -----Original Message-----
> Date/Time: Thu Oct 9 10:20:19 EDT 2003
> Contact: ***@ampersoftware.it
> Report Type: Other Question, Problem, or Feedback
>
> Hi at all,
> i have a little question:
> Characters in the unicode range U+4E00 and U+9FFF are Unified
> Ideographs for
> CJK languages. This means that all characters are togheter
> for Chinense,
> Japanese and Korean languages? If i take a charcters for,
> example U+4E01,
> this is a valid character for all three languages?
> My problem is to recognize from the 32 bit value of unicode
> character if this
> is a chinese character or korean or japanese. How can do this?
>
> I develop international application under win98, win200 with
> Visual Studio 6.0
>
> thanks a lot.
>
> Roberto (ITALY)
>
> -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
> (End of Report)
>
>
>


------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Chris Jacobs
2003-10-11 02:38:26 UTC
Permalink
If you have a scalar value then you can look it up in the UniHan database.

http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=4e01

I would not rely on the mappings to major standards to determine the
language, I can imagine that maybe the chinese include some non-chinese
kanji in their standards because they come up in their foreign affairs.

I would go for the Phonetic Data.

If there only is an entry for Cantonese or Mandarin pronunciation then
surely it is Chinese.
If there only is an entry for Japanese Kun pronunciation then surely it is
Japanese.

Do not try to guess the language of a text from just one value, if the text
contains kana then assume the kanji are Japanese too, if the text contains
hangul then assume the kanji are Korean too.

If you don't want to consult the UniHan database over the WWW then the data
files for it are available at ftp://ftp.unicode.org/

----- Original Message -----
From: "Magda Danish (Unicode)" <v-***@microsoft.com>
To: <***@unicode.org>
Cc: <***@ampersoftware.it>
Sent: Friday, October 10, 2003 11:48 PM
Subject: FW: Web Form: Other Question: CJK


> Roberto,
>
> I am forwarding your question to the Unicode mailing list for possible
> answers from the list's subscribers.
>
> Regards,
>
> Magda Danish
> Administrative Director
> The Unicode Consortium
> 650-693-3921
>
>
> > -----Original Message-----
> > Date/Time: Thu Oct 9 10:20:19 EDT 2003
> > Contact: ***@ampersoftware.it
> > Report Type: Other Question, Problem, or Feedback
> >
> > Hi at all,
> > i have a little question:
> > Characters in the unicode range U+4E00 and U+9FFF are Unified
> > Ideographs for
> > CJK languages. This means that all characters are togheter
> > for Chinense,
> > Japanese and Korean languages? If i take a charcters for,
> > example U+4E01,
> > this is a valid character for all three languages?
> > My problem is to recognize from the 32 bit value of unicode
> > character if this
> > is a chinese character or korean or japanese. How can do this?
> >
> > I develop international application under win98, win200 with
> > Visual Studio 6.0
> >
> > thanks a lot.
> >
> > Roberto (ITALY)
> >
> > -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
> > (End of Report)
> >
> >
> >
>
>



------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Edward H. Trager
2003-10-11 15:31:05 UTC
Permalink
On Friday 2003.10.10 14:48:58 -0700, Magda Danish (Unicode) wrote:
> Roberto,
>
> I am forwarding your question to the Unicode mailing list for possible
> answers from the list's subscribers.
>
> Regards,
>
> Magda Danish
> Administrative Director
> The Unicode Consortium
> 650-693-3921
>
>
> > -----Original Message-----
> > Date/Time: Thu Oct 9 10:20:19 EDT 2003
> > Contact: ***@ampersoftware.it
> > Report Type: Other Question, Problem, or Feedback
> >
> > Hi at all,
> > i have a little question:
> > Characters in the unicode range U+4E00 and U+9FFF are Unified
> > Ideographs for
> > CJK languages. This means that all characters are togheter
> > for Chinense,
> > Japanese and Korean languages?

Yes, that's why they are called "unified".

> > If i take a charcters for,
> > example U+4E01,
> > this is a valid character for all three languages?

Most likely. There are some characters that only occur in
modern simplified Chinese, some that for the most part only occur in modern
traditional Chinese (such as used in Taiwan or Hong Kong), some that only
occur in Japanese.

> > My problem is to recognize from the 32 bit value of unicode
> > character if this
> > is a chinese character or korean or japanese. How can do this?

You can't, so don't try to do it on a character-by-character basis. It
is useless. As a human looking at a string of text, you can tell what
language it is from the context. Of course for Japanese or Korean you
will expect to see Hiragana or Katakana (for Japanese) or Korean syllables.
But there is every possibility that a Korean text might contain embedded
Chinese quotations, or Japanese containing embedded Korean, or ... you
get the idea ...


> >
> > I develop international application under win98, win200 with
> > Visual Studio 6.0
> >
> > thanks a lot.
> >
> > Roberto (ITALY)
> >
> > -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
> > (End of Report)
> >
> >
> >
>


------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
John Delacour
2003-10-11 16:49:09 UTC
Permalink
> > Contact: ***@ampersoftware.it
> > Report Type: Other Question, Problem, or Feedback
> >
> > My problem is to recognize from the 32 bit value of unicode
> > character if this is a chinese character or korean or japanese.
> How can do this?

You can tell if it is NOT from a legacy character set such as
shift_jis or big5 by failing to convert it to that character set. Or
you can look it up in unihan.txt
<http://www.unicode.org/Public/UNIDATA/Unihan.txt> (25 megabytes,
also at the ftp site). There are also Perl routines for getting at
the information.

U+4E01 kAlternateKangXi 0075.003

JD




------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
John Jenkins
2003-10-12 00:19:04 UTC
Permalink
On 2003年10月10日, at 下午2:48, Magda Danish (Unicode) wrote:

>> My problem is to recognize from the 32 bit value of unicode
>> character if this
>> is a chinese character or korean or japanese. How can do this?
>>

It's basically impossible and largely meaningless. It's the equivalent
of asking if "a" is an English letter or a French one. There are
*some* characters where one can guess based on the source information
in Unihan.txt that it's traditional Chinese, simplified Chinese,
Japanese, Korean, or Vietnamese, but there are too many exceptions to
make this really reliable. (For example, one particularly nasty
obscenity in Cantonese would probably have never been encoded for
Cantonese, but has made it in for the sake of Korean, where one hopes
it isn't nearly as obscene.)

The phonetic data in Unihan.txt should not be used for this purpose. A
blank in the phonetic data means that nobody's supplied a reading, not
that a reading doesn't exist. Because updating the Unihan database is
an ongoing process, these fields will be increasingly filled out as
time goes on, but they should never be taken as absolutely complete.
In particular, there are obscure characters where it is known that
there *is* a reading, but since the character does not occur in
standard dictionaries, we are unable to supply it (e.g., U+40DF in
Cantonese).

A better solution is to look at the text as a whole: if there's a fair
amount of kana, it's probably Japanese, and if there's a fair amount of
hangul, it's probably Korean.

The only proper mechanism is, as for determining whether "chat" is
spelled correctly in English or French, is to use a higher-level
protocol.

========
John H. Jenkins
***@apple.com
***@mac.com
http://homepage..mac.com/jhjenkins/



------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Continue reading on narkive:
Loading...