unicode on Linux

Discussion:

unicode on Linux

Shao, Yiying

2003-10-20 19:21:35 UTC

Hi,

I am new to Linux world. Just wondering if anybody knowss how unicode is on Linux? Is unicode ready for all language, including double byte languages, on Red Hat and SuSe?

Here is something that I read:

Before you start experimenting with UTF-8 under Linux, update your installation to a recent distribution with up-to-date UTF-8 support, such as SuSE 8.1 or Red Hat 8.0. Some earlier distributions provided already at least UTF-8 locales and some ISO10646-1 X11 fonts, but they lacked many of the UTF-8 extensions that have recently been made to numerous application programs. Red Hat Linux 8.0 has already made UTF-8 the default encoding for all locales other than Chinese/Japanese/Korean.

On Red Hat Linux, if UTF-8 is not made as the default encoding for Chnese/Japanese/Korean, what it is using for those double byte languages? Does later Red Had Linux makes the UTF-8 the default encoding for them? Any details regarding UTF-8 on SuSe?

Thanks,
Yiying

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Stefan Persson

2003-10-20 20:14:22 UTC

Permalink

Post by Shao, Yiying
Just wondering if anybody knowss how unicode is on Linux?

Very good support. Default charset for recent versions of some popular
distributions.

Post by Shao, Yiying
Is unicode ready for all language, including double byte languages, on Red Hat and SuSe?

Yes.

Post by Shao, Yiying
On Red Hat Linux, if UTF-8 is not made as the default encoding for Chnese/Japanese/Korean, what it is using for those double byte languages?

The old multi-byte character sets.

Post by Shao, Yiying
Does later Red Had Linux makes the UTF-8 the default encoding for them?

AFAIK only if you manually set it to a UTF-8 locale, e.g.
LANG=zh-CN.UTF-8. Notice, though, that some older software will not be
aware of this change, so many characters will not be displayed properly.

Stefan

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Stephane Bortzmeyer

2003-10-21 12:43:43 UTC

Permalink

On Mon, Oct 20, 2003 at 10:14:22PM +0200,

Post by Stefan Persson

Post by Shao, Yiying
Just wondering if anybody knowss how unicode is on Linux?

Very good support.

Very optimistic.

Kernel
*****

1) File names in Unicode: no (well, the Linux kernel is 8-bits clean
so you can always encode in UTF-8, but the kernel does not do any
normalization and the applications do not expect UTF-8, for instance
ls sorts alphabetically but dot not know Unicode sorting).

2) User names: worse since utilities to create an account refuses
UTF-8.

Applications
************

3) grep: no Unicode regexp

4) xterm (or similar virtual terminals): No BiDi support at all

5) shells: I'm not aware of any line-editing shell (zsh, tcsh)
that have Unicode character semantics (back-character should move one
character, not one byte)

6) databases: I'm not aware of a free DBMS which has support for
Unicode sorting (SQL's ORDER BY) or regexps (SQL's LIKE).

7) Serious word processing: LaTeX has only very minimum Unicode

Also, many applications (exmh, emacs) are ten times slower when
running in UTF-8 mode.

At the present time, using Unicode on Unix is an act of faith.

Post by Stefan Persson
Default charset for recent versions of some popular distributions.

Yes, RedHat changed the default charset to Unicode without thinking
that text files were no longer readable.

See:

http://www.cl.cam.ac.uk/~mgk25/unicode.html
ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html
http://melkor.dnp.fmph.uniba.sk/~garabik/debian-utf8/howto.html

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Edward H. Trager

2003-10-21 15:32:28 UTC

Permalink

Post by Stephane Bortzmeyer
On Mon, Oct 20, 2003 at 10:14:22PM +0200,

Post by Stefan Persson

Post by Shao, Yiying
Just wondering if anybody knowss how unicode is on Linux?

Very good support.

I think there can be big debates about
whether a Linux (or any *nix kernel, for that matter) has any business normalizing
file names. Personally I think Unicode normalization is not the kernel's business.
This is better left to the userland applications.

Are you sure about ls? ls should sort UTF-8-encoded file names in raw Unicode order,
n'est-ce pas? Of course, that may not be what one wants! Take Chinese for example:
there are many different methods for sorting Chinese used in Chinese dictionaries
(phonetic, radical+stroke count, four corner method, ... ). The order of the unified
Hanzi/Kanji in Unicode used the Kangxi (stroke-order based) dictionary as a primary
basis, and the Dai Kanwa Ziten as a secondary basis. So the result is a hybrid Chinese
plus Japanese ordering. Plus, the CJK Joint Research Group had to deal with the placement
of all of the simplified Chinese characters that were not listed in the historical KangXi
dictionary (originally compiled between 1710-1716). It is nice that Unicode in some
sense preserves the great tradition first established by the KangXi ZiDian, but that sort order
may not be what any one modern native Chinese, Japanese, or other user needs or wants for
his particular purpose. Similar stories exist for other scripts and languages.

Post by Stephane Bortzmeyer
2) User names: worse since utilities to create an account refuses
UTF-8.
Applications
************
3) grep: no Unicode regexp

What about ICU's regexp package?
(http://oss.software.ibm.com/icu/userguide/regexp.html)
You should be able to use ICU on *any* platform.
Linux does not yet having a Unicode grep
and to my knowledge Windows does not yet have grep at all ...

Most of my pattern searching and string manipulation needs
-- which includes searching through documents and data encoded in UTF-8 --
are fully met using egrep and Perl (I happen to use Linux, but of course
Perl is available on every platform). So it is clear that everything
depends on evaluating one's needs, and then figuring out which software
will meet those needs. There is now enough Unicode-aware software on Linux
to meet many people's needs. See http://eyegene.ophthy.med.umich.edu/unicode/.

Post by Stephane Bortzmeyer
4) xterm (or similar virtual terminals): No BiDi support at all

Use mlterm instead. It has BiDi support and support for complex text
layout as required for Arabic, Indic, and Indic-derived scripts. See
http://eyegene.ophthy.med.umich.edu/unicode/#termemulator .

Post by Stephane Bortzmeyer
5) shells: I'm not aware of any line-editing shell (zsh, tcsh)
that have Unicode character semantics (back-character should move one
character, not one byte)
6) databases: I'm not aware of a free DBMS which has support for
Unicode sorting (SQL's ORDER BY) or regexps (SQL's LIKE).

I thought both Postgres and MySQL already have, or are working on this
issue?

Post by Stephane Bortzmeyer
7) Serious word processing: LaTeX has only very minimum Unicode

Many would argue that Open Office 1.1 needs to be included in the
category of "serious word processing" and it has good
Unicode support.

Post by Stephane Bortzmeyer
Also, many applications (exmh, emacs) are ten times slower when
running in UTF-8 mode.

exmh is written in Tcl/Tk: isn't everything written in Tcl/Tk sssllowww?
When was the last time that it really mattered how fast your
editor worked? If emacs is slow, use vi ;-) ... oops, I forgot
this might provoke some people (it's a joke)!

Post by Stephane Bortzmeyer
At the present time, using Unicode on Unix is an act of faith.

That is not an accurate statement.

Are you talking about proprietary Unixes or Linux? I thought the
questions were about support on Linux. With regard to Unicode support on
Linux, I completely disagree with you. I use Unicode for serious
work on Linux everyday.

Clearly it really depends on what you want to do. And that is the case
on other OSes as well.

Post by Stephane Bortzmeyer

Post by Stefan Persson
Default charset for recent versions of some popular distributions.

Yes, RedHat changed the default charset to Unicode without thinking
that text files were no longer readable.
http://www.cl.cam.ac.uk/~mgk25/unicode.html
ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html
http://melkor.dnp.fmph.uniba.sk/~garabik/debian-utf8/howto.html

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-10-21 17:08:07 UTC

Permalink

Post by Edward H. Trager
...
and to my knowledge Windows does not yet have grep at all ...

There seem to be various Windows ports of grep available. I have been
using GNU grep on Windows for many years. Well, technically in a DOS box
and a Windows 2000 pseudo-DOS box, but then is there a GUI-based grep on
any platform?
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Andy Heninger

2003-10-21 17:35:51 UTC

Permalink

Post by Edward H. Trager

Post by Stephane Bortzmeyer
3) grep: no Unicode regexp

What about ICU's regexp package?
(http://oss.software.ibm.com/icu/userguide/regexp.html)
You should be able to use ICU on *any* platform.

ICU does have unicode regular expressions, but it's a library with an
API, not a grep tool. ICU does have a simple grep-like sample, but
it's intended only as an illustration of how to use the regexp API, and
lacks nearly all the command line options one would expect in a real
grep replacement.

Post by Edward H. Trager
and to my knowledge Windows does not yet have grep at all ...

Cygwin! http://www.cygwin.com/

-- Andy Heninger
***@us.ibm.com

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Benjamin Peterson

2003-10-21 20:19:10 UTC

Permalink

Post by Edward H. Trager
and to my knowledge Windows does not yet have grep at all ...

Oh, a curse on Bill Gates and his newfangled Micro$loth systems :) To
_my_ knowledge, however...

There's cygwin.

Or better yet, no cygwin! http://unxutils.sourceforge.net/

Or, GNU grep! http://gnuwin32.sourceforge.net/
There's a different build of GNU grep here:
http://members.ozemail.com.au/~crn/grep.html

Or, you could use the MS equivalent, findstr, which works on multibyte
characters provided it can guess the encoding from the current codepage
(i.e. you have to set code page to 932 to make it work on a shift-JIS
file, and so on). You'd think you could use it on utf-8 by setting
codepage to 65001 but it doesn't happen for me. On the other hand it
does recurse into directories.

Or, there's the DJGPP version of grep: http://www.delorie.com/djgpp/

And related to it, there's the version that uses the PW32 project:
http://pw32.sourceforge.net/

Or, there's cgrep and jgrep; but I don't know what particular encodings
they work with and I don't have the URLs to hand.

Or, there's a modified GNU grep here:
http://www.interlog.com/~tcharron/grep.html

...and so on. I usually use unxutils.
--
Benjamin Peterson
***@imap.cc

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-10-21 16:56:16 UTC

Permalink

Post by Stephane Bortzmeyer
...
http://www.cl.cam.ac.uk/~mgk25/unicode.html

In this page, Markus Kuhn is damaging his credibility by continuing to
refer in several places to Unicode 3.0, although the page was updated
some time after the release of Unicode 4.0. Is the rest of this material
similarly out of date?
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-10-21 18:31:44 UTC

Permalink

Post by Peter Kirk

Post by Stephane Bortzmeyer
...
http://www.cl.cam.ac.uk/~mgk25/unicode.html

In this page, Markus Kuhn is damaging his credibility by continuing to
refer in several places to Unicode 3.0, although the page was updated
some time after the release of Unicode 4.0. ...

This has been fixed already. Well done, Markus.

Post by Peter Kirk
... Is the rest of this material similarly out of date?

I presume not. If anything is, then please tell Markus.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Shao, Yiying

2003-10-20 20:31:49 UTC

Permalink

Thanks for your info.

Post by Stefan Persson

Post by Shao, Yiying
Just wondering if anybody knowss how unicode is on Linux?

Very good support. Default charset for recent versions of some popular

distributions.

What are those popular distributions and which version?

Post by Stefan Persson

Post by Shao, Yiying
On Red Hat Linux, if UTF-8 is not made as the default encoding for Chnese/Japanese/Korean, what it is using for those double byte languages?

The old multi-byte character sets.

So, how should I implement my code? Do I have to say if this is Japanese (for example), convert the unicode (UTF-8) to multi-byte character? That seems very painful.

Post by Stefan Persson

Post by Shao, Yiying
Does later Red Had Linux makes the UTF-8 the default encoding for them?

AFAIK only if you manually set it to a UTF-8 locale, e.g.
LANG=zh-CN.UTF-8. Notice, though, that some older software will not be
aware of this change, so many characters will not be displayed properly.

So, is this setting available from Red Hat 8.0 or later? Also, you mean some old version of Linux may not aware of this setting?

Besides, do you happen to know ICU from IBM? Does it take care of the unicode problems with double byte language for Linux?

Thanks,
Yiying

-----Original Message-----
From: Stefan Persson [mailto:***@yahoo.se]
Sent: Monday, October 20, 2003 1:14 PM
To: Shao, Yiying
Cc: ***@unicode.org
Subject: Re: unicode on Linux

Post by Stefan Persson
Just wondering if anybody knowss how unicode is on Linux?

Very good support. Default charset for recent versions of some popular
distributions.

Post by Stefan Persson
Is unicode ready for all language, including double byte languages, on Red Hat and SuSe?

Yes.

Post by Stefan Persson
On Red Hat Linux, if UTF-8 is not made as the default encoding for Chnese/Japanese/Korean, what it is using for those double byte languages?

The old multi-byte character sets.

Post by Stefan Persson
Does later Red Had Linux makes the UTF-8 the default encoding for them?

Edward H. Trager

2003-10-21 13:35:43 UTC

Permalink

Post by Shao, Yiying
Thanks for your info.

Post by Stefan Persson

Post by Shao, Yiying
Just wondering if anybody knowss how unicode is on Linux?

Very good support. Default charset for recent versions of some popular

distributions.
What are those popular distributions and which version?

Post by Stefan Persson

Post by Shao, Yiying
On Red Hat Linux, if UTF-8 is not made as the default encoding for Chnese/Japanese/Korean, what it is using for those double byte languages?

The old multi-byte character sets.

So, how should I implement my code? Do I have to say if this is Japanese (for example), convert the unicode (UTF-8) to multi-byte character? That seems very painful.

No. Forget about old multi-byte encodings. Just set your locale to a UTF-8 locale and use UTF-8
for all languages. In my experience (on SuSE 7.3, 8.1, 8.2, and the 9.0 betas) all of the "important"
applications handle CJK languages perfectly well under a UTF-8 locale. The "important" applications
for me are things like Open Office 1.1, Konsole, vim, MySQL, and Mozilla. For CJK input, use SCIM
(http://ns.turbolinux.com.cn/~suzhe/scim/index.html). For many other details about Unicode
on Linux, see my page at http://eyegene.ophthy.med.umich.edu/unicode/index.html.

Post by Shao, Yiying

Post by Stefan Persson

Post by Shao, Yiying
Does later Red Had Linux makes the UTF-8 the default encoding for them?

AFAIK only if you manually set it to a UTF-8 locale, e.g.
LANG=zh-CN.UTF-8. Notice, though, that some older software will not be
aware of this change, so many characters will not be displayed properly.
So, is this setting available from Red Hat 8.0 or later? Also, you mean some old version of Linux may not aware of this setting?
Besides, do you happen to know ICU from IBM? Does it take care of the unicode problems with double byte language for Linux?

Most likely. But I think your life will be easier if you just use UTF-8 for all languages and forget about legacy
encodings. I'm sure ICU must have very robust UTF-8 support.

Post by Shao, Yiying
Thanks,
Yiying