Non-ascii string processing?

Discussion:

Non-ascii string processing?

Theodore H. Smith

2003-10-04 18:31:20 UTC

Hi lists,

I'm wondering how people tend to do their non-ascii string processing.

I'm wondering, if anyone really needs anything other than byte oriented
code? I'm using UTF8 as my character format, and UTF8 is variable
width, of course. I offer the option of processing UTF8, with byte
functions, however.

EG:

Start = MyString.InStr( "<" )
End = MyString.InStr( Start + 1, "> )

things like this, it really doesn't matter if your data is UTF8, you
can still process it like bytes! Leading to faster speed, and simpler
code.

So, I'm wondering, in fact, is there ANY code that needs explicit UTF8
processing? Heres a few I've thought of.

1) Spell checking - needs UTF8 character based iteration
2) lexical processing - needs UTF8 mode to be able to match "å" to "a".

Can anyone tell me any more? Please feel free to go into great detail
in your answers. The more detail the better.

Thanks a lot!

I'm just wondering if I can simplify my string processing library, and
if anyone really needs anything except byte-level processing, for most
functions, except maybe a few for the two I mentioned above!

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Doug Ewell

2003-10-04 21:07:17 UTC

Post by Theodore H. Smith
I'm wondering how people tend to do their non-ascii string processing.
I'm wondering, if anyone really needs anything other than byte
oriented code? I'm using UTF8 as my character format, and UTF8 is
variable width, of course. I offer the option of processing UTF8, with
byte functions, however.
Start = MyString.InStr( "<" )
End = MyString.InStr( Start + 1, "> )
things like this, it really doesn't matter if your data is UTF8, you
can still process it like bytes! Leading to faster speed, and simpler
code.

If you really aren't processing anything but the ASCII characters within
your strings, like "<" and ">" in your example, you can probably get
away with keeping your existing byte-oriented code. At least you won't
get false matches on the ASCII characters (this was a primary design
goal of UTF-8).

However, if your goal is to simplify processing of arbitrary UTF-8 text,
including non-ASCII characters, I haven't found a better way than to
read in the UTF-8, convert it on the fly to UTF-32, and THEN do your
processing on the fixed-width UTF-32. That way you don't have to do one
thing for Basic Latin characters and something else for the rest.

You will probably hear from some very prominent Unicode people that
converting to UTF-16 is better, because "most" characters are in the
BMP, for which UTF-16 uses half as much memory. But this approach
doesn't really solve the variable-width problem -- it merely moves it,
from "ASCII vs. non-ASCII" to "BMP vs. non-BMP." Unless you are keeping
large amounts of text in memory, or are working with a small device such
as a handheld, the extra size of UTF-32 compared to UTF-16 is unlikely
to be a big problem, and you have the advantage of dealing with a
fixed-width representation for the entire Unicode code space.

All of this assumes that you don't have multi-character processing
issues, like combining characters and normalization, or culturally
appropriate sorting, in which case your character processing WILL be
more complex than ASCII no matter which CES you use.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Theodore H. Smith

2003-10-05 22:19:46 UTC

Hi Doug,

heres some things I think.

Post by Doug Ewell
If you really aren't processing anything but the ASCII characters within
your strings, like "<" and ">" in your example, you can probably get
away with keeping your existing byte-oriented code. At least you won't
get false matches on the ASCII characters (this was a primary design
goal of UTF-8).

Yes, and in fact, UTF8 doesn't generate any false matches when
searching for a valid UTF8 string, within another valid UTF8 string.

In fact, if there is UTF8 between the < and >, the processing works
just fine.

Post by Doug Ewell
However, if your goal is to simplify processing of arbitrary UTF-8 text,
including non-ASCII characters, I haven't found a better way than to
read in the UTF-8, convert it on the fly to UTF-32, and THEN do your
processing on the fixed-width UTF-32. That way you don't have to do one
thing for Basic Latin characters and something else for the rest.

Well, I can do most processing just fine, as I said. I only have a
problem with lexical string processing (A = å), or spell checking. And
in fact, lexical string processing is already so complex, it probably
won't make much difference with UTF32 or UTF8, because of conjoining
characters and that.

Post by Doug Ewell
You will probably hear from some very prominent Unicode people that
converting to UTF-16 is better, because "most" characters are in the
BMP, for which UTF-16 uses half as much memory. But this approach
doesn't really solve the variable-width problem -- it merely moves it,
from "ASCII vs. non-ASCII" to "BMP vs. non-BMP." Unless you are keeping
large amounts of text in memory, or are working with a small device such
as a handheld, the extra size of UTF-32 compared to UTF-16 is unlikely
to be a big problem, and you have the advantage of dealing with a
fixed-width representation for the entire Unicode code space.

Unfortunately, I'm more concerned about the speed of converting the
UTF8 to UTF32, and back. This is because usually, I can process my UTF8
with byte functions.

Post by Doug Ewell
All of this assumes that you don't have multi-character processing
issues, like combining characters and normalization, or culturally
appropriate sorting, in which case your character processing WILL be
more complex than ASCII no matter which CES you use.

Yes. Actually, I haven't yet seen any reasons to not use
byte-oriented-only functions for UTF8, now. Thanks for trying though!

Maybe someone whose native language isn't English and who spends a lot
of time writing string processing code could help me with suggestions
for tasks that need character modes? (like lexical processing a=å, and
spell checking).

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Doug Ewell

2003-10-05 23:10:48 UTC

Post by Theodore H. Smith

Post by Doug Ewell
If you really aren't processing anything but the ASCII characters
within your strings, like "<" and ">" in your example, you can
probably get away with keeping your existing byte-oriented code.
At least you won't get false matches on the ASCII characters (this
was a primary design goal of UTF-8).

Yes, and in fact, UTF8 doesn't generate any false matches when
searching for a valid UTF8 string, within another valid UTF8 string.
In fact, if there is UTF8 between the < and >, the processing works
just fine.

Depends on what "processing" you are talking about. Just to cite the
most obvious case, passing a non-ASCII, UTF-8 string to byte-oriented
strlen() will fail dramatically.

Post by Theodore H. Smith

Post by Doug Ewell
However, if your goal is to simplify processing of arbitrary UTF-8
text, including non-ASCII characters, I haven't found a better way
than to read in the UTF-8, convert it on the fly to UTF-32, and THEN
do your processing on the fixed-width UTF-32. That way you don't
have to do one thing for Basic Latin characters and something else
for the rest.

Well, I can do most processing just fine, as I said. I only have a
problem with lexical string processing (A = å), or spell checking. And
in fact, lexical string processing is already so complex, it probably
won't make much difference with UTF32 or UTF8, because of conjoining
characters and that.

You mean, it's so complex to keep track of canonical equivalences, we
might as well just treat it all as a sequence of isolated bytes?
Doesn't sound like Unicode text processing to me.

Post by Theodore H. Smith
Unfortunately, I'm more concerned about the speed of converting the
UTF8 to UTF32, and back. This is because usually, I can process my
UTF8 with byte functions.

Check your assumptions about speed again. Converting between UTF-8 and
Unicode scalar values really isn't a computationally expensive
operation. It's best to do some profiling before assuming UTF-8
conversion will slow you down much.

Post by Theodore H. Smith
Maybe someone whose native language isn't English and who spends a lot
of time writing string processing code could help me with suggestions
for tasks that need character modes? (like lexical processing a=å, and
spell checking).

You are using the rather loose term "lexical processing" to refer to
setting up equivalence classes between characters (e.g. between U+0061
and U+00E5). This is language-dependent, and complex enough on its own,
but trying to do it while you continue to treat U+00E5 as the sequence
<0xC3, 0xA5> is much harder and much slower than if you had just
converted the UTF-8 in the first place.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Marco Cimarosti

2003-10-06 10:09:34 UTC

Post by Doug Ewell
Depends on what "processing" you are talking about. Just to cite the
most obvious case, passing a non-ASCII, UTF-8 string to byte-oriented
strlen() will fail dramatically.

Why? The purpose of strlen() is counting the number of *bytes* needed to
store a certain string, and this works just as fine for UTF-8 as it does for
SBCS's or DBCS's.

What strlen() cannot do is countîng the number of *characters* in a string.
But who cares? I can imagine very few situations where someone such an
information would be useful.

_ Marco

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Stephane Bortzmeyer

2003-10-06 11:16:26 UTC

On Mon, Oct 06, 2003 at 12:09:34PM +0200,

Post by Marco Cimarosti
What strlen() cannot do is countîng the number of *characters* in a string.
But who cares? I can imagine very few situations where someone such an
information would be useful.

It is one thing to explain that strlen() has byte semantics and not
character semantics. It is another to assume that character semantics
are useless. Most text-processing software allow you to count the
number of characters in a document, for instance.

Any decent Unicode programmaing environment should give you two
functions, one for byte semantics and one for character
semantics. Both are useful.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-10-06 12:16:05 UTC

Post by Marco Cimarosti

Post by Doug Ewell
Depends on what "processing" you are talking about. Just to cite the
most obvious case, passing a non-ASCII, UTF-8 string to byte-oriented
strlen() will fail dramatically.

Why? The purpose of strlen() is counting the number of *bytes* needed to
store a certain string, and this works just as fine for UTF-8 as it does for
SBCS's or DBCS's.
What strlen() cannot do is countîng the number of *characters* in a string.
But who cares? I can imagine very few situations where someone such an
information would be useful.
_ Marco

This depends on what kind of operations you are wanting to do with the
text. Of course if you are concerned only with storage and transmission
of the text, you don't need to count characters rather than bytes,
except that, as you mention in another posting, you may need to avoid
splitting strings in the middle of characters (and there is actually a
very simple algorithm to avoid that, never split before a byte
10xxxxxx). But if you want to render the text, the rendering system
needs to split the text into characters at some point. And if you want
to do to the text the kinds of processing which I as a linguist am
interested in, you absolutely need to work with characters rather than
bytes, and it can be very important to know the number of characters in
a string - although this number may get confused by normalisation issues.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Delacour

2003-10-06 13:15:02 UTC

Post by Marco Cimarosti
What strlen() cannot do is countîng the number of *characters* in a string.
But who cares? I can imagine very few situations where someone such an
information would be useful.

#!/usr/bin/perl
print "ab, \x{aaaa}\x{aaab}" ;
printf "\n%s, %s", length "ab" , length "\x{aaaa}\x{aaab}" ;

ab, ÍÍ´
2, 2

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Marco Cimarosti

2003-10-06 10:50:09 UTC

Post by Theodore H. Smith
Hi lists,

Hi, member.

Post by Theodore H. Smith
I'm wondering how people tend to do their non-ascii string processing.

I think no one has been doing ASCII string processing for decades. :-) But I
guess you meant non-SBCS ("single byte character set") string processing.

Post by Theodore H. Smith
[...]
So, I'm wondering, in fact, is there ANY code that needs
explicit UTF8 processing? Heres a few I've thought of.

In general, you need to UTF-32 whenever you need to access the single
*characters* in a string. This is needed for all kinds of lexical or
typographic processing, e.g.:

- case matching or conversion ("â" vs. "Â");

- loose matching ("â" vs. "a");

- displaying the text;

Post by Theodore H. Smith
Can anyone tell me any more? Please feel free to go into great detail
in your answers. The more detail the better.

There is at least one case in which you need UTF-8-aware code even if not
accessing single characters: it is when you *trim* a string at an arbitrary
byte position. E.g.:

char str1 [9] = "abc";
char * str2 = "αßγ";

strncat(str1, str2, sizeof(str1));

If strncat() is UTF-8 aware: str1 will be "abcαß" + null terminator (8
bytes). But if strncat() is *not* UTF-8 aware, str1 will contain an invalid
UTF-8 string: "abcαß" + an *llegal* byte (0xCE) + null terminator.

_ Marco

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Marco Cimarosti

2003-10-06 11:52:26 UTC

Post by Stephane Bortzmeyer
On Mon, Oct 06, 2003 at 12:09:34PM +0200,

Post by Marco Cimarosti
What strlen() cannot do is countîng the number of

*characters* in a string.

Post by Marco Cimarosti
But who cares? I can imagine very few situations where

someone such an

Post by Marco Cimarosti
information would be useful.

It is one thing to explain that strlen() has byte semantics and not
character semantics. It is another to assume that character semantics
are useless.

I never said that character semantics are useless: I said that it is almost
always useless to count the *number* of Unicode characters in a string.

One of the few cases in which such a count could be useful is to
pre-allocate a buffer for an UTF-8 to UTF-32 conversion. But there is no
need of a general purpose API function for such a special need.

Post by Stephane Bortzmeyer
Most text-processing software allow you to count the
number of characters in a document, for instance.

Yes. And:

1) That is a very special need of a very special kind of application (a word
processor), so it doesn't justify a general purpose API function for that:
people don't normally write word processors every day.

2) That count cannot be done by counting Unicode "characters" (i.e.,
encoding units): you have to count the object that the user perceives as
"typographical characters". E.g., control or formatting "characters" should
be ignored, sequences of two or more space "characters" should be counted as
one, and a word like "élite" is always counted as five characters,
regardless that it might be encoded as six Unicode "characters". In an Indic
or Korean text, each syllable counts as a single character, although it may
be encoded as long sequences of Unicode "characters".

3) That is a very silly count anyway. If you want to have an idea of the
"size" of a document, lines or words are much more useful units.

Post by Stephane Bortzmeyer
Any decent Unicode programmaing environment should give you two
functions, one for byte semantics and one for character
semantics. Both are useful.

OK. But the length in "characters" of a string is not "character semantics":
it's plain nonsense, IMHO.

_ Marco

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

'Stephane Bortzmeyer'

2003-10-06 12:37:44 UTC

On Mon, Oct 06, 2003 at 01:52:26PM +0200,

a word like "élite" is always counted as five characters, regardless
that it might be encoded as six Unicode "characters".

I assume that everybody on this list knows that you count characters
only after a proper normalization... (like many operations on Unicode
texts).

3) That is a very silly count anyway. If you want to have an idea of the
"size" of a document, lines or words are much more useful units.

Tell that to the editor (editors of paper publications still talk with
this unit "3 000 characters, no more, for tommorrow morning").

it's plain nonsense, IMHO.

I disagree.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

j***@spin.ie

2003-10-06 11:28:55 UTC

Post by Doug Ewell

Post by Doug Ewell
If you really aren't processing anything but the ASCII characters
within
your strings, like "<" and ">" in your example,

you can probably get

Post by Doug Ewell
away with keeping your existing byte-oriented code. At least you won't
get false matches on the ASCII characters (this was a primary design
goal of UTF-8).

Yes, and in fact, UTF8 doesn't generate any false matches when
searching for a valid UTF8 string, within another valid UTF8 string.

However it will generate false misses when searching for a valid UTF-8 string within an invalid UTF-8 string. In important cases this can lead to severe security issues, for example if you were doing the searching to filter disallowed sequences (say "<script" in an HTML filter or "../" in a URI filter) and the UTF-8 is later converted by a toleratant converter then the disallowed sequences can be sneaked past the filter by sending invalid UTF-8. Certainly there have been cases of this in the past (IIS for example could be fooled into accessing files outside of the webroot).
Hence your search function must either include a check for invalid UTF-8 or be used only in a situation where you know that this won't cause problems (either because invalid UTF-8 will raise an error elsewhere, or becuase there is no possible security problems from such data). In particular if it was part of a library that might be used elsewhere there could be problems as the user of the library might assume you are doing more checking than you are, and neglect to check him- or herself.

Post by Doug Ewell
Unfortunately, I'm more concerned about the speed of converting the
UTF8 to UTF32, and back. This is because usually, I can process my UTF8
with byte functions.

This is a "swings and roundabouts" situation. Granted dealing with a large array or transmitting a stream of 8-bit units will generally be faster than dealing with a similarly sized stream of 32-bit units (they will be similarly sized if they mainly have ASCII data - and even the worse-case scenario for UTF-8 won't be larger than the equivalent UTF-32 for valid Unicode characters). At the same time though dealing with a single 32-bit unit is generally faster than dealing with a single 8-bit unit on most modern machines; the 8-bit unit will generally be converted to and from 32-bit or larger units anyway - so if you have an average of 1.2 (say it's mainly from the ASCII) octets per character in UTF-8 you are really dealing with 1.2 times as many 32-bit units as if you used UTF-32. If you are coming closer to an average of 4 octets per character in UTF-8 then you are qadrupling the number of 32-bit units to process, as well as possible conversion overhead.

The effects of this on processing efficiency is going to depend on just what you are doing with the characters, and what optimisations can be applied (whether by the programmer or the compiler). For some operations UTF-8 can be considerably less efficient than UTF-32.

It also depends on how much the properties you are dealing with are "hidden" by UTF-8. On the one hand the character-based strlen mentioned in this thread is easy to write for UTF-8:

size_t charlen(const char* str){//assumes valid UTF-8
size_t ret = 0;
while (*str)
if ((*str++ & 0x80) = 0)
++ret;
return ret;

}

How this compares with the UTF-32 equivalent will vary. Note that it still has validity issues. Generally though UTF-8 doesn't have many problems with this. On the other hand while it is certainly possible to use UTF-8 to do the property lookup needed for most functionality that threats Unicode as more than just a bunch of 21-bit numbers encoded in various ways, it is easier and more efficient (often including memory size of the program) to do much of it with UTF-16 or UTF-32.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

j***@spin.ie

2003-10-06 13:07:38 UTC

Post by Marco Cimarosti

Post by Marco Cimarosti
a word like "élite" is always counted as five characters,

regardless

Post by Marco Cimarosti
that it might be encoded as six Unicode "characters".

I assume that everybody on this list knows that you count characters
only after a proper normalization... (like many operations on Unicode
texts).

A word like "élite" will be counted as either five or size things depending on just what the things are in a given context. Whether you call those things "characters" or not is another matter.

Normalisation might result in that string being five or six Unicode characters in length, depending on the normalisation form used. Even while NFC would mean that characters and grapheme-clusters would coincide in this case, that does not apply to all uses of combining characters, so a character count on NFC Unicode is not a reliable means to give a character count.

However a byte count is probably of even less use to an end user anyway (except in so far as diskspace and download times go, and then a rough estimate would serve their purposes). Both byte counts and Unicode-character counts have uses within the implementation of higher-level functionality, and as such both are required.

Post by Marco Cimarosti
> 3) That is a very silly count anyway. If you want to have an idea of
the
> "size" of a document, lines or words are much more useful
units.

To estimate column-inches that will be used characters are much more useful than words, and far more than lines (which will vary according to column-width, font, justification algorithm, etc.)

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Marco Cimarosti

2003-10-06 15:15:25 UTC

Post by 'Stephane Bortzmeyer'

Post by Marco Cimarosti
OK. But the length in "characters" of a string is not
it's plain nonsense, IMHO.

I disagree.

Feel free.

But I still don't see any use in knowing how many characters are in an UTF-8
string, apart the use that I already mentioned: allocating a buffer for a
UTF-8 to UTF-32 conversion.

_ Marco

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Edward H. Trager

2003-10-06 17:11:07 UTC

Post by Marco Cimarosti

Post by 'Stephane Bortzmeyer'

Post by Marco Cimarosti
OK. But the length in "characters" of a string is not
it's plain nonsense, IMHO.

I disagree.

Feel free.
But I still don't see any use in knowing how many characters are in an UTF-8
string, apart the use that I already mentioned: allocating a buffer for a
UTF-8 to UTF-32 conversion.
_ Marco

Well, I know a good use for it: a console or terminal-based application which
displays information using fixed-width fonts in a tabular form, such as a subset
of records from a database table. To calculate how wide to display each column, knowing the
maximum number of characters in the strings for each column is a useful starting
place.

Of course, that might not be enough by itself if, for example, (1) one has
to display Hanzi or Kanji which are twice the width of Latin characters when
displayed on a terminal, or (2) one has to display scripts where ligatures
(as in Arabic) or other attributes of the script, such as over-the-letter/
under-the-letter vowels in Indic and Indic-derived scripts, change the display
width of a string from what it would be if just counting characters. But it is
still a good place to start.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

j***@spin.ie

2003-10-06 17:06:55 UTC

Post by Marco Cimarosti
But I still don't see any use in knowing how many characters are in an UTF-8
string, apart the use that I already mentioned: allocating a buffer for a
UTF-8 to UTF-32 conversion.

I wouldn't use it for that at all. I'd assume a worse-case of 32-bit word in the UTF-32 per octet in the UTF-8 or else stream it out, and hence avoid allocating a buffer for the entire string at all.

You would need to be able to count UTF-8 characters if you were implementing an spec defined in terms of characters rather than bytes, notably since XML is implemented in terms of characters any mention of string lengths or indices into strings is defined in terms of characters (e.g. in XSLT, XPointer and elsewhere).

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Jill Ramonsky

2003-10-06 18:31:26 UTC

Could you try that again with codepoints > U+FFFF please? I'd be curious
to know what happens.
Jill

-----Original Message-----
Sent: Monday, October 06, 2003 2:15 PM
Subject: RE: Non-ascii string processing?

What strlen() cannot do is count�ng the number of

*characters* in a string.

But who cares? I can imagine very few situations where

someone such an

information would be useful.

#!/usr/bin/perl
print "ab, \x{aaaa}\x{aaab}" ;
printf "\n%s, %s", length "ab" , length "\x{aaaa}\x{aaab}" ;
ab, ́�́�
2, 2

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Jill Ramonsky

2003-10-06 18:31:09 UTC

Nor I. "Characters" are perhaps the most useless objects ever invented.

Now - a count of DEFAULT GRAPHEME CLUSTERs might be useful (for example,
for display on a console which uses fixed-width fonts). Indeed, a whole
class of DEFAULT GRAPHEME CLUSTER handling functions might come in very
handy indeed. Bytes are useful. Default grapheme clusters are useful.
But a "character"? What's the point?

But then, a default grapheme cluster might theoretically require up to
16 Unicode characters. (Maybe more, I don't know). Even bit-packed to 21
bits per character, that still gives us 336 bits. So I conclude that our
string processing functions could go a lot faster if only we'd all use
UTF-336. Er....?

Jill

-----Original Message-----
Sent: Monday, October 06, 2003 11:10 AM
To: 'Doug Ewell'; Unicode Mailing List
Cc: Theodore H. Smith
Subject: RE: Non-ascii string processing?
What strlen() cannot do is countîng the number of
*characters* in a string.
But who cares? I can imagine very few situations where someone such an
information would be useful.
_ Marco

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Doug Ewell

2003-10-07 04:34:49 UTC

Post by Jill Ramonsky
But then, a default grapheme cluster might theoretically require up to
16 Unicode characters. (Maybe more, I don't know). Even bit-packed to
21 bits per character, that still gives us 336 bits. So I conclude
that our string processing functions could go a lot faster if only
we'd all use UTF-336. Er....?

If only I had a bit more spare time, Jill. You do NOT want to get me
started... >:-)

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Marco Cimarosti

2003-10-06 19:36:13 UTC

Post by Marco Cimarosti

Post by Marco Cimarosti
But I still don't see any use in knowing how many

characters are in an UTF-8

Post by Marco Cimarosti
string, apart the use that I already mentioned: allocating

a buffer for a

Post by Marco Cimarosti
UTF-8 to UTF-32 conversion.

Well, I know a good use for it: a console or terminal-based
application which displays information using fixed-width
fonts in a tabular form, such as a subset of records from
a database table. To calculate how wide to display each
column, knowing the maximum number of characters in the
strings for each column is a useful starting place.

Well, I am just about to start a time consuming task: fixing an application
which was based on the assumption the number of characters in a string was
good "starting place" to format tabular text in a fixed width font...

You have already explained why this can't work when CJK or other scripts pop
in.

What you really need for such a thing is a function which computes the
"width" of a string in terms of display units, rather than its length in
term of characters.

_ Marco

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Edward H. Trager

2003-10-06 21:45:23 UTC

Post by Marco Cimarosti

Post by Marco Cimarosti

Post by Marco Cimarosti
But I still don't see any use in knowing how many

characters are in an UTF-8

Post by Marco Cimarosti
string, apart the use that I already mentioned: allocating

a buffer for a

Post by Marco Cimarosti
UTF-8 to UTF-32 conversion.

Well, I know a good use for it: a console or terminal-based
application which displays information using fixed-width
fonts in a tabular form, such as a subset of records from
a database table. To calculate how wide to display each
column, knowing the maximum number of characters in the
strings for each column is a useful starting place.

Well, I am just about to start a time consuming task: fixing an application
which was based on the assumption the number of characters in a string was
good "starting place" to format tabular text in a fixed width font...
You have already explained why this can't work when CJK or other scripts pop
in.
What you really need for such a thing is a function which computes the
"width" of a string in terms of display units, rather than its length in
term of characters.

Yes, I agree. I also need such a function. Do you, Marco, or anyone else, know which function(s)
provide this service? (In my case, something Open Source or GPLed would be ideal, but ICU
would be too heavy). My application started out life in a sheltered ASCII-only
childhood, and now needs to move to the bigger UTF-8 world out there. Fortunately,
it is quite capable of succeeding in that world, but I haven't even started working
on the on-screen table formatting issue yet for exactly this reason.

Actually I believe that if I have to write something myself, making it work for the
Latin-with-combining-diacritics and CJK cases would not be too hard. After that however,
it seems that one would have to work on a script-by-script basis to get it to really
work properly. If it was only a case of Arabic, that would be one thing, but when one
looks at the Indic and Indic-derived scripts ... well, there are a lot of Indic and Indic-derived
scripts! Not that it is hard, but it would certainly take time, and I haven't done an ounce
of research yet to find out whether somebody has done it already or not ...

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Markus Scherer

2003-10-07 23:29:45 UTC

You might want to look at East Asian Width http://unicode.org/reports/tr11/ for an approximation of
the green-screen width of a string.

To be absolutely precise, you need feedback from your green-screen layout engine and its font, of
course, like you do for a graphical display.

markus

Post by Edward H. Trager

Post by Marco Cimarosti
What you really need for such a thing is a function which computes the
"width" of a string in terms of display units, rather than its length in
term of characters.

Yes, I agree. I also need such a function. Do you, Marco, or anyone else, know which function(s)
provide this service? (In my case, something Open Source or GPLed would be ideal, but ICU
would be too heavy). My application started out life in a sheltered ASCII-only
childhood, and now needs to move to the bigger UTF-8 world out there. Fortunately,
it is quite capable of succeeding in that world, but I haven't even started working
on the on-screen table formatting issue yet for exactly this reason.
Actually I believe that if I have to write something myself, making it work for the
Latin-with-combining-diacritics and CJK cases would not be too hard. After that however,
it seems that one would have to work on a script-by-script basis to get it to really
work properly. If it was only a case of Arabic, that would be one thing, but when one
looks at the Indic and Indic-derived scripts ... well, there are a lot of Indic and Indic-derived
scripts! Not that it is hard, but it would certainly take time, and I haven't done an ounce
of research yet to find out whether somebody has done it already or not ...

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Jill Ramonsky

2003-10-07 09:48:05 UTC

Well, I guess if I've got too many characters in my text for this
particular editor, I could always just present the document in NFC. That
might reduce the number a bit. However, I /strongly/ suspect that what
this editor /really/ wants is "no more than 3000 default grapheme
clusters". In which case, this is /still/ not a good use for CHARACTER
counting.

Jill

-----Original Message-----
Sent: Monday, October 06, 2003 1:38 PM
To: Marco Cimarosti
Cc: 'Stephane Bortzmeyer'; 'Doug Ewell'; Unicode Mailing
List; Theodore
H. Smith
Subject: Re: Non-ascii string processing?
Tell that to the editor (editors of paper publications still talk with
this unit "3 000 characters, no more, for tommorrow morning").

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Jill Ramonsky

2003-10-07 09:35:46 UTC

Knowing the number of characters won't help you one iota. What you need
to know here is the number of default grapheme clusters.
I still have yet to hear a useful purpose for counting the number of
/characters/.

Jill

-----Original Message-----
Sent: Monday, October 06, 2003 6:11 PM
Cc: Marco Cimarosti
Subject: Re: Non-ascii string processing?
Well, I know a good use for it: a console or terminal-based
application which
displays information using fixed-width fonts in a tabular
form, such as a subset
of records from a database table. To calculate how wide to
display each column, knowing the
maximum number of characters in the strings for each column
is a useful starting
place.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-10-07 11:20:27 UTC

Post by Jill Ramonsky
Knowing the number of characters won't help you one iota. What you
need to know here is the number of default grapheme clusters.
I still have yet to hear a useful purpose for counting the number of
/characters/.
Jill

Suppose I have a UTF-8 string and want to know how many default grapheme
clusters it contains. How do I do so? Well, I step through the string
character by character, combining successive characters into grapheme
clusters. To do this without having to decode the UTF-8 myself, I need
to be able to get at the string character by character, and very likely
use a loop based on the number of characters in the string, e.g. the
following Basic (horrid language but good for making my point here):

For i% = 1 to Len(utf8string$)
c$ = Mid(utf8string$, i%, 1)
Process c$
Next i%

Such a loop would be more efficient in UTF-32 of course, but this is
still a real need for working with character counts.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

j***@spin.ie

2003-10-07 08:42:50 UTC

Post by Jill Ramonsky
Now - a count of DEFAULT GRAPHEME CLUSTERs might be useful (for example,
for display on a console which uses fixed-width fonts). Indeed, a whole
class of DEFAULT GRAPHEME CLUSTER handling functions might come in very
handy indeed. Bytes are useful. Default grapheme clusters are useful.
But a "character"? What's the point?

Because characters are a useful intermeditary point between bytes and grapheme clusters. Such an intermeditary may be entirely wrapped by code stepping from octets to grapheme clusters, or exposed by an API which will be used by higher level code to produce the grapheme clusters since that is the lowest level an API could expose while remaining encoding neutral (hence that is the level at which XML APIs expose CDATA, element names, etc.). The alternative would be to have a straight mapping between octets and grapheme clusters...

Post by Jill Ramonsky
But then, a default grapheme cluster might theoretically require up to
16 Unicode characters. (Maybe more, I don't know). Even bit-packed to 21
bits per character, that still gives us 336 bits. So I conclude that our
string processing functions could go a lot faster if only we'd all use
UTF-336. Er....?

Certainly if we allow for linguistic improbabilities (such as C with two graves, an acute and a couple of Hebrew vowel points) there would be no limit. Allowing for linguistic improbabilities has the advantage of making it more likely that we are allowing for linguistic edge-cases.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Jill Ramonsky

2003-10-07 11:35:56 UTC

No. What you have demonstrated below is that given an API based on
characters, one can write an API based on default grapheme clusters.
Nonetheless, it is only the /_resulting _/default-grapheme-cluster-based
API which would actually be of any use to end-users.

...and anyone who even /thinks/ of writing an API based on default
grapheme clusters is surely competent enough to write that that (almost
trivial) character-based middle layer themselves.

I have yet to see an APPLICATION which needs a character-based API.
Jill

-----Original Message-----
Sent: Tuesday, October 07, 2003 12:20 PM
To: Jill Ramonsky
Subject: Re: Non-ascii string processing?

Post by Jill Ramonsky
Knowing the number of characters won't help you one iota. What you
need to know here is the number of default grapheme clusters.
I still have yet to hear a useful purpose for counting the

number of

Post by Jill Ramonsky
/characters/.
Jill

Suppose I have a UTF-8 string and want to know how many
default grapheme
clusters it contains. How do I do so? Well, I step through the string
character by character, combining successive characters into grapheme
clusters. To do this without having to decode the UTF-8
myself, I need
to be able to get at the string character by character, and
very likely
use a loop based on the number of characters in the string, e.g. the
For i% = 1 to Len(utf8string$)
c$ = Mid(utf8string$, i%, 1)
Process c$
Next i%
Such a loop would be more efficient in UTF-32 of course, but this is
still a real need for working with character counts.
--
Peter Kirk
http://www.qaya.org/

Peter Kirk

2003-10-07 12:28:18 UTC

Post by Jill Ramonsky
No. What you have demonstrated below is that given an API based on
characters, one can write an API based on default grapheme clusters.
Nonetheless, it is only the /_resulting
_/default-grapheme-cluster-based API which would actually be of any
use to end-users.
...and anyone who even /thinks/ of writing an API based on default
grapheme clusters is surely competent enough to write that that
(almost trivial) character-based middle layer themselves.
I have yet to see an APPLICATION which needs a character-based API.
Jill

Well, application programming with default grapheme clusters will be
fairly trivial when using a computer language which has string etc
processing able to work transparently and efficiently with arbitrary
length characters, I mean, default grapheme clusters. Until such
computer languages are widely available, and given that for very many
widely used natural languages (if NFC is used) characters and DGCs
coincide, I would much prefer to work with a character-based API than
have to always do my own combining of UTF-8 bytes.

Anyway, DGCs are not always what you want to work with. I work a lot
with pointed Hebrew texts. For most purposes (though not for calculating
space taken up on a line) the entities I need to work with correspond to
Unicode characters rather than DGCs, for I work separately with the base
characters (mostly consonants), the vowel points and the accents. In
some cases the match is not precise, but it is a lot more convenient
for my work if I can access a string character by character, rather than
UTF-8 byte by UTF-8 byte or DGC by DGC. And, by the way, I have real
examples of DGCs in Hebrew consisting of six characters.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Jungshik Shin

2003-10-09 04:55:59 UTC

Post by Peter Kirk
Anyway, DGCs are not always what you want to work with.

Besides, DGCs are just for the default and are not the
absolute invariant atomic unit that can never be broken. In some
situations, delete operation and cursor movement should work at a level
different from that of the DGCs. The Unicode DGCs for Korean script
are syllables, but at least during the text input (_before_ a syllable is
'committed'), many Koreans want 'backspace' key to delete what she just
typed in - a Korean letter(jamo) instead of the whole syllable. It'd be
frustrating to have to type the whole sequence again just because one
makes a mistake in the last Unicode character to form a DGC (made of
several Unicode characters).

Post by Peter Kirk
I work a lot
with pointed Hebrew texts. For most purposes (though not for calculating
space taken up on a line) the entities I need to work with correspond to
Unicode characters rather than DGCs, for I work separately with the base
characters (mostly consonants), the vowel points and the accents. In
some cases the match is not precise, but it is a lot more convenient
for my work if I can access a string character by character, rather than
UTF-8 byte by UTF-8 byte or DGC by DGC. And, by the way, I have real
examples of DGCs in Hebrew consisting of six characters.

I've got a question about the cursor movement and
selection in Hebrew text with such a grapheme (made up of 6 Unicode
characters). What would be ordinary users' expectation when delete,
backspace, and arrow keys(for cursor movement) are pressed around/in the
middle of that DGC? Do they expect backspace/delete/arrow keys to operate
_always_ at the DGC level or sometimes do they want them to work at the
Unicode character level (or its equivalent in their perception of Hebrew
'letters')? Exactly the same question can be asked of Indic scripts.
I've asked this before (discussed the issue with Marco a couple of years
ago), but I haven't heard back from native users of Indic scripts.

Jungshik

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-10-09 11:25:15 UTC

Post by Jungshik Shin
...
I've got a question about the cursor movement and
selection in Hebrew text with such a grapheme (made up of 6 Unicode
characters). What would be ordinary users' expectation when delete,
backspace, and arrow keys(for cursor movement) are pressed around/in the
middle of that DGC? Do they expect backspace/delete/arrow keys to operate
_always_ at the DGC level or sometimes do they want them to work at the
Unicode character level (or its equivalent in their perception of Hebrew
'letters')? Exactly the same question can be asked of Indic scripts.
I've asked this before (discussed the issue with Marco a couple of years
ago), but I haven't heard back from native users of Indic scripts.
Jungshik

I can't answer for native users of Hebrew. Maybe others can, but then
most modern Hebrew word processing is done with unpointed text where
this is not an issue. But I can speak for what has been done with
Windows fonts for pointed Hebrew for scholarly purposes.

In each of them, as far as I can remember, delete and backspace delete
only a single character, not a default grapheme cluster. This is
probably appropriate for a font used mainly for scholarly purposes,
where representations of complex grapheme clusters may need to be edited
to make them exactly correct. A different approach might be more
suitable for a font commonly used for entering long texts. In such a
case I would tend to expect backspace to cancel one keystroke - but that
may be ambiguous of course when editing text which has not just been
entered.

Cursor movement also works at the character level. In some fonts there
is no visible cursor movement when moving over a non-spacing character,
which is probably the default but can be confusing to users. At least
one font has attempted to place the cursor at different locations within
the base character e.g. in the middle when there are two characters in
the DGC, at the 1/3 and 2/3 points when there are three characters. But
this is likely to get confusing when there are 5 or 6 characters in the
DGC and their order is not entirely predictable.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Mark E. Shoulson

2003-10-09 12:46:36 UTC

Post by Peter Kirk

Post by Jungshik Shin
...
I've got a question about the cursor movement and
selection in Hebrew text with such a grapheme (made up of 6 Unicode
characters). What would be ordinary users' expectation when delete,
backspace, and arrow keys(for cursor movement) are pressed around/in the
middle of that DGC? Do they expect backspace/delete/arrow keys to operate
_always_ at the DGC level or sometimes do they want them to work at the
Unicode character level (or its equivalent in their perception of Hebrew
'letters')? Exactly the same question can be asked of Indic scripts.
I've asked this before (discussed the issue with Marco a couple of years
ago), but I haven't heard back from native users of Indic scripts.
Jungshik

I can't answer for native users of Hebrew. Maybe others can, but then
most modern Hebrew word processing is done with unpointed text where
this is not an issue. But I can speak for what has been done with
Windows fonts for pointed Hebrew for scholarly purposes.
In each of them, as far as I can remember, delete and backspace delete
only a single character, not a default grapheme cluster. This is
probably appropriate for a font used mainly for scholarly purposes,
where representations of complex grapheme clusters may need to be
edited to make them exactly correct. A different approach might be
more suitable for a font commonly used for entering long texts. In
such a case I would tend to expect backspace to cancel one keystroke -
but that may be ambiguous of course when editing text which has not
just been entered.
Cursor movement also works at the character level. In some fonts there
is no visible cursor movement when moving over a non-spacing
character, which is probably the default but can be confusing to
users. At least one font has attempted to place the cursor at
different locations within the base character e.g. in the middle when
there are two characters in the DGC, at the 1/3 and 2/3 points when
there are three characters. But this is likely to get confusing when
there are 5 or 6 characters in the DGC and their order is not entirely
predictable.

I'm not a native speaker either, but I do have some occasion to work in
both pointed and unpointed Hebrew, and I think I would disagree with
Peter here. Certainly in the case of cursor movement, I'd expect the
cursor to move by DGCs, and not take some unclear number of keypresses
to move back a letter. With backspace/delete, I would probably want
that to work by characters within the current DGC, but once past that
(or if I'm not doing it immediately after typing the characters) it
should take out whole DGCs. They're just too messy and potentially
randomly ordered for it to make any sense to try to edit them
internally. So I guess I see Hebrew DGCs as also going through a sort
of "commitment" phase, when you type the next base character or use
cursor-movement keys to move around: at that point, the DGC should go
atomic and get deleted all at once, but so long as you're still typing
combining characters (and occasional backspaces), backspace should go
character by character (since you presumably can remember the last few
you just typed).

Mind, I've not actually used all that many pointed-Hebrew text
processors; this is more my idea of how things *should* work than how
they *do* work. I think Yudit does or did something a bit like this,
though. (must have been "did": at the moment it seems to be consistent
about always doing everything by DGC).

~mark

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Ted Hopp

2003-10-09 13:45:21 UTC

One issue with deleting a DGC non-atomically is that deleting only the base
character can lead to all sorts of strange and problematic combining
character sequences. At a minimum, deleting a base character should delete
the entire DGC atomically. In Hebrew, I don't see any problem with deleting
combining characters non-atomically (although one might want to limit this
to just off the logical end of the sequence out of user interface
considerations). I suppose that this might be more of an issue in some other
languages, though.

One might be tempted to use some sort of canonical ordering logic to keep
the complexity down, but the combining classes for Hebrew are so problematic
that this would be a lost cause.

I have used software where the cursor moves non-atomically across a DGC in
Hebrew and I find it extremely confusing. The only way to make sense of
what's happening is to remember the exact sequence in which the combining
characters were entered. If someone wants to support such movement anyway, I
think that the cursor shape needs to change dramatically to indicate what's
going on. This is something I've never seen done well (usually not at all).
Subtle changes in cursor position are useless as a visual indication to the
user of what's going on. One might even need to include some sort of glyph
highlighting to make clear the state of the text entry system.

Ted

Ted Hopp, Ph.D.
ZigZag, Inc.
***@newSLATE.com
+1-301-990-7453

newSLATE is your personal learning workspace
...on the web at http://www.newSLATE.com/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Anto'nio Martins-Tuva'lkin

2003-10-09 15:22:29 UTC

Post by Ted Hopp
I have used software where the cursor moves non-atomically across a
DGC in Hebrew and I find it extremely confusing. The only way to make
sense of what's happening is to remember the exact sequence in which
the combining characters were entered.

Is it me or this whole thread should be about *insertion point* instead
of "cursor"...?

-- ____.
António MARTINS-Tuválkin, | ()|
<***@tuvalkin.web.pt> |####|
R. Laureano de Oliveira, 64 r/c esq. |
PT-1885-050 MOSCAVIDE (LRS) Não me invejo de quem tem |
+351 934 821 700 carros, parelhas e montes |
http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe |
http://pagina.de/bandeiras/ a água em todas as fontes |

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Elliotte Rusty Harold

2003-10-07 14:07:53 UTC

At 12:35 PM +0100 10/7/03, Jill Ramonsky wrote:

I have yet to see an APPLICATION which needs a character-based API.
Jill

A W3C XML Schema Language validator needs a character based API to
correctly implement the minLength and maxLength facets on xsd:string
and types derived from it. Perhpas you would argue that the schema
language should itself be written in terms of grapheme clusters
rather than characters, but it isn't and thus we need to handle
characters to implement a validator in accordance with the spec.
--
Elliotte Rusty Harold
***@metalab.unc.edu
Processing XML with Java (Addison-Wesley, 2002)
http://www.cafeconleche.org/books/xmljava
http://www.amazon.com/exec/obidos/ISBN%3D0201771861/cafeaulaitA

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Marco Cimarosti

2003-10-07 12:29:01 UTC

Post by Peter Kirk
For i% = 1 to Len(utf8string$)
c$ = Mid(utf8string$, i%, 1)
Process c$
Next i%
Such a loop would be more efficient in UTF-32 of course, but this is
still a real need for working with character counts.

If the string type and function of this Basic dialect is not Unicode-aware,
then:

- Len(s$) returns the number of *bytes* in the string;

- Mid(s$, i%, 1) returns a single *byte*;

- Your Process() subroutine won't work...

If the string type and functions are Unicode aware (as, e.g., in Visual
Basic or VBScript), then I'd expect that the actual internal representation
is hidden from the programmer, hence it makes no sense to talk about an
"UTF-8 string".

_ Marco

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-10-07 13:22:52 UTC

Post by Marco Cimarosti

Post by Peter Kirk
For i% = 1 to Len(utf8string$)
c$ = Mid(utf8string$, i%, 1)
Process c$
Next i%
Such a loop would be more efficient in UTF-32 of course, but this is
still a real need for working with character counts.

If the string type and function of this Basic dialect is not Unicode-aware,
- Len(s$) returns the number of *bytes* in the string;
- Mid(s$, i%, 1) returns a single *byte*;
- Your Process() subroutine won't work...
If the string type and functions are Unicode aware (as, e.g., in Visual
Basic or VBScript), then I'd expect that the actual internal representation
is hidden from the programmer, hence it makes no sense to talk about an
"UTF-8 string".
_ Marco

You are correct, of course. I was assuming a Unicode-aware dialect of
Basic. But my variable names are no more guaranteed to be meaningful and
appropriate than are Unicode character names ;-) ; they are only
required to be distinct.

I could imagine a dialect of Basic which had separate string handling
functions for UTF-8 bytes and for characters. This is how the
Unicode-aware version of the SIL Consistent Changes stream editor works,
see http://www.sil.org/computing/catalog/show_software.asp?id=4.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Jungshik Shin

2003-10-09 05:01:21 UTC

Post by Peter Kirk
I could imagine a dialect of Basic which had separate string handling
functions for UTF-8 bytes and for characters. This is how the

In Perl 5.8 or later that uses UTF-8, unless otherwise explicitly
specified, character-related functions all operate at the level
of Unicode characters (represented in UTF-8). If you want to
work with a raw byte stream, your intent has to be declared
explicitly.

Jungshik

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

j***@spin.ie

2003-10-07 11:45:40 UTC

Post by Jill Ramonsky
No. What you have demonstrated below is that given an API
based on characters, one can write an API based on
default grapheme clusters. Nonetheless, it is only the
resulting default-grapheme-cluster-based API which would
actually be of any use to end-users.

How close to the "end" do users have to be before we start worrying about what they need? Every hacker is also a user.

Post by Jill Ramonsky
I have yet to see an APPLICATION which needs a character-
based API.

XSLT, XPointer, XPath...
Okay, they're all applications of XML but they operate at a level in which they are required to be neutral to encodings, but also at which they are either below the stage at which issues about default and tailored grapheme clusters are dealt with (so any processing from characters to grapheme clusters may prove to be incorrect further down the line) or else where such matters are irrelevant (particularly common with 'data' rather than 'document' uses of XML - and these technologies act at a level lower than which allows that distinction to be made). Hence they necessarily operate on characters.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Delacour

2003-10-07 14:57:35 UTC

Post by Peter Kirk
Suppose I have a UTF-8 string and want to know
how many default grapheme clusters it contains.
How do I do so? Well, I step through the string
character by character, combining successive
characters into grapheme clusters. To do this
without having to decode the UTF-8 myself, I
need to be able to get at the string character
by character, and very likely use a loop based
on the number of characters in the string, e.g.
the following Basic (horrid language but good
For i% = 1 to Len(utf8string$)
c$ = Mid(utf8string$, i%, 1)
Process c$
Next i%
Such a loop would be more efficient in UTF-32
of course, but this is still a real need for
working with character counts.

Why use a horrid language when there's a nice one? :

#!/usr/bin/perl
use utf8 ; # not needed (and ignored) in Perl 5.8.*
my $string = "alpha \x{03b1}\ntagspace \x{e0020}" ;
my @utf8chars = split //, $string ;
foreach my $char(@utf8chars) {
my $len = length unpack "a*", $char;
print "$char\[$len\]" ;
}

### a[1]l[1]p[1]h[1]a[1] [1]α[2]
### [1]t[1]a[1]g[1]s[1]p[1]a[1]c[1]e[1] [1]󠀠[4]

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

j***@spin.ie

2003-10-07 14:59:08 UTC

Post by Elliotte Rusty Harold
A W3C XML Schema Language validator needs a character based API to
correctly implement the minLength and maxLength facets on xsd:string
and types derived from it. Perhpas you would argue that the schema
language should itself be written in terms of grapheme clusters
rather than characters,

If it were so specified there would be great difficulties involved in tailoring grapheme clusters.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Marco Cimarosti

2003-10-07 16:15:23 UTC

Post by Elliotte Rusty Harold
A W3C XML Schema Language validator needs a character based API to
correctly implement the minLength and maxLength facets on xsd:string

As far as I understand, xsd:string is a list of "Character"-s, and a
"Character" is an integer which can hold any valid Unicode code point.

In other terms, xsd:string is necessarily in UTF-32 (or something close to
it): it cannot be in UTF-8 or UTF-16.

The numbers returned by length, minLength and maxLength are the actual,
minimum and maximum number of *list elements*, contained in the list. I.e.,
in the case of xsd:string, the *size* of the string in *encoding units*.

The fact that, in UTF-32, the *size* of the sting in encoding units
corresponds to the number of "characters" is coincidental.

In any case, the useful information is always the *size* of the string in
encoding units (octets for UTF-8, 16-bit units for UTF-16, etc.), not the
number of "characters" it contains.

_ Marco

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Francois Yergeau

2003-10-07 18:23:09 UTC

Post by Marco Cimarosti
As far as I understand, xsd:string is a list of "Character"-s, and a
"Character" is an integer which can hold any valid Unicode code point.

Not quite. XML Schema points to XML for its definition of character, and
XML in turn says "A character is an atomic unit of text as specified by
ISO/IEC 10646". It's not a number, it's a piece of text that cannot be
further divided ("atomic").

Post by Marco Cimarosti
In other terms, xsd:string is necessarily in UTF-32 (or
something close to it): it cannot be in UTF-8 or UTF-16.

xsd:string is encoding-form-independent, you can represent it in UTF:-)336
if you want.

Post by Marco Cimarosti
The numbers returned by length, minLength and maxLength are
the actual, minimum and maximum number of *list elements*,
contained in the list.

Yep, the number of characters in the "finite-length sequence of characters"
(XML Schema's definition of xsd:string).

Post by Marco Cimarosti
I.e., in the case of xsd:string, the *size* of the string in
*encoding units*.

Nope. In characters.
--
François

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

j***@spin.ie

2003-10-08 09:46:59 UTC

Post by Marco Cimarosti

Post by Elliotte Rusty Harold
A W3C XML Schema Language validator needs a character based API to
correctly implement the minLength and maxLength facets on xsd:string

As far as I understand, xsd:string is a list of "Character"-s, and
a
"Character" is an integer which can hold any valid Unicode code
point.

No. First "list" in the context of XML Schema means a series of zero or more values from another datatype represented as whitespace-separated strings, where whitespace is defined according to production S from the XML spec:

S ::= (#x20 | #x9 | #xD | #xA)+

As such it's a good idea to avoid using "list" in a more general sense when dealing with XML Schema.

Secondly while string is defined as a sequence of characters, these characters are abstract UCS characters - the things defined by Unicode and ISO 10646 - also they must match the Char production from the XML spec:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

(The XML1.1 spec removes a few of those characters, I would have removed more, but that's another issue).

So "Character" is not an integer; it's a character, the thing that *has* a code point, rather than the code point, and also some valid Unicode code points are excluded and some are kind of allowed (the xxFFFE and xxFFFF codes from the astral planes are allowed by the Char production, does ISO 10646 allow those characters even though Unicode has them undefined?).

Post by Marco Cimarosti
In other terms, xsd:string is necessarily in UTF-32 (or something close to
it): it cannot be in UTF-8 or UTF-16.

It's characters that are included xsd:string is necessarily not in any encoding form - it's an abstract concept that can be represented by whatever means a programmer feels fit (though some will serve better than others). XML Schemata can be used with DOM and DOM mandates the use of UTF-16 at the interface.

Post by Marco Cimarosti
The fact that, in UTF-32, the *size* of the sting in encoding units
corresponds to the number of "characters" is coincidental.

Yes, but the coincidence is the other way around. :)
The coincidence is no coincidence at all of course, UTF-32 is designed to have a one-to-one mapping between Unicode characters and it's encoding units.

Post by Marco Cimarosti
In any case, the useful information is always the *size* of the string in
encoding units (octets for UTF-8, 16-bit units for UTF-16, etc.), not the
number of "characters" it contains.

Bah! "always" is a very strong word. It's already been shown that the useful information is often the number of grapheme clusters so this is clearly wrong whether character-counts are useful or not.

In the case of XML Schema not only is the number of encoding units not useful it's practically a Zen Koan - there is no such thing as encoding units at the level of abstraction it operates at.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Cowan

2003-10-08 12:08:53 UTC

Post by j***@spin.ie
(The XML1.1 spec removes a few of those characters, I would have
removed more, but that's another issue).

You have no idea what fearful drubbings I had to administer to get
even the few removed that I did.

Post by j***@spin.ie
[D]oes ISO 10646 allow those characters even though Unicode has them
undefined?

No, it doesn't. There was a strong feeling in the W3C Core WG that
it be possible to handle the Astral Planes uniformly; every character
off the BMP, therefore, is a valid Char as well as a valid NameStartChar.
--
"There is no real going back. Though I John Cowan
may come to the Shire, it will not seem ***@reutershealth.com
the same; for I shall not be the same. http://www.reutershealth.com
I am wounded with knife, sting, and tooth, http://www.ccil.org/~cowan
and a long burden. Where shall I find rest?" --Frodo

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Elliotte Rusty Harold

2003-10-08 13:24:46 UTC

Post by John Cowan
No, it doesn't. There was a strong feeling in the W3C Core WG that
it be possible to handle the Astral Planes uniformly; every character
off the BMP, therefore, is a valid Char as well as a valid NameStartChar.

Of course it would have been possible to handle the "Astral Planes"
uniformly by making every character in them a legal Char, but not a
valid name character or name start character. This would have avoided
silliness like elements named after the musical symbol for a six
string fretboard or the damage of using undefined characters in XML
documents. It also would have been much more compatible with existing
parsers and tools. :-(
--
Elliotte Rusty Harold
***@metalab.unc.edu
Processing XML with Java (Addison-Wesley, 2002)
http://www.cafeconleche.org/books/xmljava
http://www.amazon.com/exec/obidos/ISBN%3D0201771861/cafeaulaitA

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Doug Ewell

2003-10-08 15:03:16 UTC

Post by Elliotte Rusty Harold
Of course it would have been possible to handle the "Astral Planes"
uniformly by making every character in them a legal Char, but not a
valid name character or name start character. This would have avoided
silliness like elements named after the musical symbol for a six
string fretboard or the damage of using undefined characters in XML
documents. It also would have been much more compatible with existing
parsers and tools. :-(

You can never completely avoid silliness -- just look at yesterday's
election.

But the "undefined characters" issue is a greater problem. Limiting the
pool of valid name characters to those already assigned in Unicode X.X
would mean either:

(a) the XML spec would have to be updated promptly, 1 to 2 times per
year, to keep up with each new minor release of Unicode, or

(b) the characters accepted after Unicode X.X would be excluded,
creating one of those "digital divide" issues when someone wants to
create a Buginese or Tai Lue identifier and can't.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Cowan

2003-10-08 17:03:11 UTC

Post by Doug Ewell
But the "undefined characters" issue is a greater problem. Limiting the
pool of valid name characters to those already assigned in Unicode X.X
(a) the XML spec would have to be updated promptly, 1 to 2 times per
year, to keep up with each new minor release of Unicode, or
(b) the characters accepted after Unicode X.X would be excluded,
creating one of those "digital divide" issues when someone wants to
create a Buginese or Tai Lue identifier and can't.

Exactly the reasoning of the XML Core WG.
--
John Cowan ***@reutershealth.com www.reutershealth.com ccil.org/~cowan
Dievas dave dantis; Dievas duos duonos --Lithuanian proverb
Deus dedit dentes; deus dabit panem --Latin version thereof
Deity donated dentition;
deity'll donate doughnuts --English version by Muke Tever
God gave gums; God'll give granary --Version by Mat McVeagh

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

j***@spin.ie

2003-10-08 11:21:05 UTC

Post by John Cowan

Post by j***@spin.ie
(The XML1.1 spec removes a few of those characters, I would have
removed more, but that's another issue).

You have no idea what fearful drubbings I had to administer to get
even the few removed that I did.

Well I have a general tendency towards being liberal in these matters (as I've said before allowing nonsense is *sometimes* a good way to ensure you allow edge cases) so I can see where objectors would be coming from.

Post by John Cowan

Post by j***@spin.ie
[D]oes ISO 10646 allow those characters even though Unicode has them
undefined?

No, it doesn't. There was a strong feeling in the W3C Core WG that
it be possible to handle the Astral Planes uniformly; every character
off the BMP, therefore, is a valid Char as well as a valid NameStartChar.

Hmm. To my mind that isn't uniform at all - someone familiar with Unicode would have already disallowed, say U+4FFFE, as a non-character before they got as far as the production (making it effectively excluded) where someone else relying on the XML spec for information about character properties would allow it.

Maybe CharMod will safe us all...

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

j***@spin.ie

2003-10-08 15:28:39 UTC

Post by Elliotte Rusty Harold
Of course it would have been possible to handle the "Astral Planes"
uniformly by making every character in them a legal Char, but not a
valid name character or name start character. This would have avoided
silliness like elements named after the musical symbol for a six
string fretboard or the damage of using undefined characters in XML
documents. It also would have been much more compatible with existing
parsers and tools. :-(

This would have created the opposite silliness of perfectly sensible name start characters being arbitrarilly disallowed. I'd like to see the musical symbol for a six string fretboard disallowed because we *know* what it is and we *know* it is category So and hence not an appropriate character for such use.

Similarly we know U+400FE will never be assigned, so I'd prefer to see it disallowed as such, from Char and from all other productions.

With unassigned characters (but not non-characters) doing anything other than allowing them would cause forwards-compatibility issues. However they (along with the private-use characters) are probably characters that should not be used for interchange.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Continue reading on narkive:

Search results for 'Non-ascii string processing?' (Questions and Answers)

what is data processing?

started 2007-04-30 03:58:08 UTC

computers & internet

____ 3. In FTP, ASCII and binary are the two file transfer ____.?

started 2007-11-30 12:46:38 UTC

programming & design

what is mime text?

started 2006-11-22 21:03:24 UTC

string validation in asp.net?

started 2007-02-20 10:07:57 UTC

programming & design

Check if String contains only letters?

started 2011-02-23 23:50:37 UTC

programming & design

48 Replies
2 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Theodore H. Smith 2003-10-04 18:31:20 UTC

Doug Ewell 2003-10-04 21:07:17 UTC

Theodore H. Smith 2003-10-05 22:19:46 UTC

Doug Ewell 2003-10-05 23:10:48 UTC

Marco Cimarosti 2003-10-06 10:09:34 UTC

Stephane Bortzmeyer 2003-10-06 11:16:26 UTC

Peter Kirk 2003-10-06 12:16:05 UTC

John Delacour 2003-10-06 13:15:02 UTC

Marco Cimarosti 2003-10-06 10:50:09 UTC

Marco Cimarosti 2003-10-06 11:52:26 UTC

'Stephane Bortzmeyer' 2003-10-06 12:37:44 UTC

j***@spin.ie 2003-10-06 11:28:55 UTC

j***@spin.ie 2003-10-06 13:07:38 UTC

Marco Cimarosti 2003-10-06 15:15:25 UTC

Edward H. Trager 2003-10-06 17:11:07 UTC

j***@spin.ie 2003-10-06 17:06:55 UTC

Jill Ramonsky 2003-10-06 18:31:26 UTC

Jill Ramonsky 2003-10-06 18:31:09 UTC

Doug Ewell 2003-10-07 04:34:49 UTC

Marco Cimarosti 2003-10-06 19:36:13 UTC

Edward H. Trager 2003-10-06 21:45:23 UTC

Markus Scherer 2003-10-07 23:29:45 UTC

Jill Ramonsky 2003-10-07 09:48:05 UTC

Jill Ramonsky 2003-10-07 09:35:46 UTC

Peter Kirk 2003-10-07 11:20:27 UTC

j***@spin.ie 2003-10-07 08:42:50 UTC

Jill Ramonsky 2003-10-07 11:35:56 UTC

Peter Kirk 2003-10-07 12:28:18 UTC

Jungshik Shin 2003-10-09 04:55:59 UTC

Peter Kirk 2003-10-09 11:25:15 UTC

Mark E. Shoulson 2003-10-09 12:46:36 UTC

Ted Hopp 2003-10-09 13:45:21 UTC

Anto'nio Martins-Tuva'lkin 2003-10-09 15:22:29 UTC

Elliotte Rusty Harold 2003-10-07 14:07:53 UTC

Marco Cimarosti 2003-10-07 12:29:01 UTC

Peter Kirk 2003-10-07 13:22:52 UTC

Jungshik Shin 2003-10-09 05:01:21 UTC

j***@spin.ie 2003-10-07 11:45:40 UTC

John Delacour 2003-10-07 14:57:35 UTC

j***@spin.ie 2003-10-07 14:59:08 UTC

Marco Cimarosti 2003-10-07 16:15:23 UTC

Francois Yergeau 2003-10-07 18:23:09 UTC

j***@spin.ie 2003-10-08 09:46:59 UTC

John Cowan 2003-10-08 12:08:53 UTC

Elliotte Rusty Harold 2003-10-08 13:24:46 UTC

Doug Ewell 2003-10-08 15:03:16 UTC

John Cowan 2003-10-08 17:03:11 UTC

j***@spin.ie 2003-10-08 11:21:05 UTC

j***@spin.ie 2003-10-08 15:28:39 UTC

about - legalese

Loading...