Questions on ZWNBS

Post by Theodore H. Smith
Hi list,
I have some questions on the ZWNBS. While I don't actually need this
myself, someone I know needs this.

As far as I know, the only rules here are:

The character U+FEFF *should* occur at the start of a UTF16 (either
endianness) text to act as the BOM.

The non-character U+FFFE should not occur in any encoding of Unicode;
this means that the *byte sequence* 0xFE 0xFF should not occur in a
UTF-16LE string.

ZWNBS can be a useful character (to suppress a line break), and there
is no reason not to use it.

Regards,
Owen

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Jim Allan

2003-08-02 18:38:21 UTC

Post by Theodore H. Smith
I'm thinking that 0xFEFF shouldn't be in a UTF16BE string, except at
the start right?
For other kinds of UTF, I'm not sure if it is allowed or not. I know it
is allowed in UTF16LE, although discouraged.
Instead of "can't use ZWNBS", I think that char is discouraged. Where
is the rule that discourages it?

See http://anubis.dkuug.dk/JTC1/SC2/WG2/docs/n2235.htm for the proposal
to replace the ZWNBS use of U+FEFF with a new character U+2060 WORD JOINER.

See http://www.unicode.org/charts/PDF/UFE70.pdf for current definition
of U+FEFF stating:

• use as an indication of non-breaking is deprecated; see 2060 instead.

See http://www.unicode.org/charts/PDF/U2000.pdf for the definition of
U+2060 WORD JOINER which states:

• a zero width non-breaking space (only)
• intended for disambiguation of functions for byte order mark
→ FEFF zero width no-break space

U+20620 WORD JOINER should be used instead of U+FEFF if one's font and
application supports it.

Jim Allan

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Chris Jacobs

2003-08-02 19:47:19 UTC

[ cc Theodore Smith ]

So I had it wrong, it _is_ deprecated.

----- Original Message -----
From: "Jim Allan" <***@smrtytrek.com>
To: <***@unicode.org>; <***@redhat.com>
Sent: Saturday, August 02, 2003 8:38 PM
Subject: Re: Questions on ZWNBS

Post by Jim Allan

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Chris Jacobs

2003-08-02 19:27:33 UTC

----- Original Message -----
From: "Theodore H. Smith" <***@elfdata.com>
To: <***@unicode.org>
Sent: Saturday, August 02, 2003 12:32 PM
Subject: Questions on ZWNBS

Post by Theodore H. Smith
Hi list,
I have some questions on the ZWNBS. While I don't actually need this
myself, someone I know needs this.

Where? Specifically, where does it say FEFF shouldn't be in a string?

It does not say that.

Post by Theodore H. Smith

Certainly, FEFF shouldn't be considered a BOM anywhere but at the start
of a string, but does it say you just can't use that value? And if so,
how are you supposed to use a ZWNBSP?!

I'm thinking that 0xFEFF shouldn't be in a UTF16BE string, except at
the start right?

Wrong!

U+FEFF has two different uses, ZWNBS and BOM

In a UTF-16BE string (and also in a UTF-16LE string) it is _always_ a ZERO
WIDTH NO-BREAK SPACE, and _never_ a BOM, regardles if it is at the beginning
of the file or not.

Not that there is much use for a ZWNBS at the beginning of a file, but
suppose that jou have a routine that removes BOM's at the beginning of
files. Then it should _not_ remove a ZWNBS at the beginning of a UTF-16BE
text, even though a ZWNBS there makes no sense.

Post by Theodore H. Smith
For other kinds of UTF, I'm not sure if it is allowed or not. I know it
is allowed in UTF16LE. although discouraged.
Instead of "can't use ZWNBS", I think that char is discouraged. Where
is the rule that discourages it?

The use of U+FEFF as ZWNBS is afaik not discouraged.

As for the use UTF-16 with BOM I cannot cite a rule which
discourages it, but it is something I would expect to be discouraged. Using
UTF-16BE or UTF-16LE instead is much simpler.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kenneth Whistler

2003-08-04 20:39:09 UTC

Post by Chris Jacobs
[ cc Theodore Smith ]
So I had it wrong, it _is_ deprecated.

It isn't exactly "deprecated", since deprecation has a
rather strong sense in the standard, and is correlated with
the formal assignment of a deprecated property to the
character.

Use of the code point U+FEFF is clearly *not* deprecated
in the standard.

The current situation is briefly as follows:

The standard *requires* the use of U+FEFF for some of
the Unicode encoding schemes. Details are spelled out in:

http://www.unicode.org/book/preview/ch03.pdf

Because of those requirements and the nature of the encoding
scheme definitions, the occurrence of U+FEFF in initial
position in *some* of the encoding schemes forces its
interpretation as a zero width no-break space, rather
than as a byte order mark. The difference is roughly
as follows: a BOM is not formally part of the content
of the text, but rather is part of the specification
of the encoding scheme; a ZWNBSP is formally part of the
content of the text.

*Because* this distinction, which is required for backwards
compatibility with existing usage of U+FEFF, is rather
subtle and confusing, and *because*, nonetheless, the
idea of having a character to indicate a no-break position
is a useful one, the UTC standardized (in Unicode 3.2),
U+2060 WORD JOINER as the *preferred* character to use
in the latter situation.

In other words, if what you need is to glue things together,
i.e. a zero width no-break space *function*, then use
U+2060. If what you need is a BOM for the encoding scheme
specifications, then use U+FEFF.

What is *discouraged*, but not prohibited, of course, is
using U+FEFF for a zero width no-break space *function*,
precisely because that interacts so confusingly with
the BOM.

--Ken

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-08-04 21:21:30 UTC

Post by Kenneth Whistler
In other words, if what you need is to glue things together,
i.e. a zero width no-break space *function*, then use
U+2060. If what you need is a BOM for the encoding scheme
specifications, then use U+FEFF.
What is *discouraged*, but not prohibited, of course, is
using U+FEFF for a zero width no-break space *function*,
precisely because that interacts so confusingly with
the BOM.
--Ken

And what if you need a ZWNBS function for something other than gluing
things together? For example, as a carrier for a string or line initial
diacritical mark when no spacing is required? This is one of the
suggestions for some of the Hebrew problems, but I have had no response
to my suggestion of using U+2060, which is inappropriately named for the
function I have in mind.
--
Peter Kirk
***@ntlworld.com
http://web.onetel.net.uk/~peterkirk/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Mark Davis

2003-08-04 22:09:05 UTC

The ZWSP and Word Joiner (plus ZWNBSP in its discouraged usage) are
targeted specifically at encouraging or avoiding *line break*. Their
names may be misleading; people intending to use them for any other
function should carefully read the sections of the Unicode Standard
that discuss their usage.

Your particular case has nothing really to do with line break; these
would be inappropriate usages unless the UTC were to decide to extend
the usage model for these characters (which I would not expect).

Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄

----- Original Message -----
From: "Peter Kirk" <***@ntlworld.com>
To: "Kenneth Whistler" <***@sybase.com>
Cc: <***@unicode.org>
Sent: Monday, August 04, 2003 14:21
Subject: Re: Questions on ZWNBS

And what if you need a ZWNBS function for something other than

gluing

Post by Peter Kirk
things together? For example, as a carrier for a string or line initial
diacritical mark when no spacing is required? This is one of the
suggestions for some of the Hebrew problems, but I have had no

response

Post by Peter Kirk
to my suggestion of using U+2060, which is inappropriately named for the
function I have in mind.
--
Peter Kirk
http://web.onetel.net.uk/~peterkirk/

Kenneth Whistler

2003-08-04 21:59:03 UTC

And what if you need a ZWNBS function for something other than gluing
things together? For example, as a carrier for a string or line initial
diacritical mark when no spacing is required?

This is not something sanctioned by the standard.

The carrier for a combining mark that is to display in isolation without
a base character is U+0020 SPACE. If you want to also indicate the
absence of a line break opportunity, then the carrier is U+00A0
NO-BREAK SPACE (NBSP).

Despite its name, U+FEFF ZWNBS is *NOT* a space character. It is
formally gc=Cf, not gc=Zs. It also does not have the White_Space
property.

So "a ZWNBS function for something other than gluing things together"
is a contradiction in terms of the current definition of the standard.
The *meaning* of the "ZWNBS function" is its behavior in the
context of UAX #14, Line Breaking Properties. See the WJ Word joiner
entry (normative) of UAX #14:

http://www.unicode.org/reports/tr14/

Post by Peter Kirk
This is one of the
suggestions for some of the Hebrew problems, but I have had no response
to my suggestion of using U+2060, which is inappropriately named for the
function I have in mind.

The function I think you have in mind is not isolated display of
a combining mark, but rather trying to find a mechanism for
getting around the conformance strictures of the standard, to
get a combining mark to apply to a *following* base
character, rather than to a *preceding* base character.

Trying to use U+FEFF *or* U+2060 to do this would be inappropriate.

--Ken

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-08-04 22:57:07 UTC

And what if you need a ZWNBS function for something other than gluing
things together? For example, as a carrier for a string or line initial
diacritical mark when no spacing is required?

Neither of these is appropriate to the case I have in mind (described in
greater detail below) as they are not zero width and therefore give an
unwanted indent at the start of a line. U+200B ZERO WIDTH SPACE might be
appropriate, but this has the problem that it is a break opportunity,
which is not always appropriate.

Post by Kenneth Whistler
Despite its name, U+FEFF ZWNBS is *NOT* a space character. It is
formally gc=Cf, not gc=Zs. It also does not have the White_Space
property.
So "a ZWNBS function for something other than gluing things together"
is a contradiction in terms of the current definition of the standard.
The *meaning* of the "ZWNBS function" is its behavior in the
context of UAX #14, Line Breaking Properties. See the WJ Word joiner
http://www.unicode.org/reports/tr14/

Thank you, Ken, and also Mark. I didn't know where to find these

Post by Kenneth Whistler
Their
names may be misleading; people intending to use them for any other
function should carefully read the sections of the Unicode Standard
that discuss their usage.

But which sections? Where is the index, online? It is unfortunate that
there are no links from the character charts or the database to the
various places where the uses of the characters are explained. All there
is is a character name, and as I have found quite often this character
name is seriously misleading if not actually incorrect. It is highly
unfortunate that it is not permitted to change these misleading names.

As it is, the note at U+FEFF in the character charts reads "use as an
indication of non-breaking is deprecated...", although you wrote that
this was not deprecated. But there is no note that use of ZERO WIDTH
NO-BREAK SPACE as a zero width no-break space is deprecated or "a
contradiction in terms of the current definition of the standard". Are
you surprised that I am confused?

If by "apply" in the above you mean "be positioned adjacent to", there
is already a problem with the standard: the EXISTING Hebrew page of the
standard is in contravention to its conformance strictures. This is
because under the existing standard (irrespective of any changes being
proposed) and in legacy encodings, the combining mark holam, which is
usually graphically positioned above the preceding base character, is in
certain environments, specifically when followed by a silent alef (holam
male is a separate issue), graphically positioned above the following
base character. But the standard has anticipated this kind of difficulty
by recognising that positioning is not always consistent with logical
ordering, see the note on Indic vowel signs in The Unicode Standard 4.0
section 2.10, subsection "Sequence of Base Characters and Diacritics",
http://www.unicode.org/book/preview/ch02.pdf. This is a documented
special case; Hebrew holam followed by silent alef is also a special
case whether you like it or not, it just hasn't been documented. It
could be removed, but that would require changes to every existing
(ancient or modern) pointed Hebrew text.

Post by Kenneth Whistler
Trying to use U+FEFF *or* U+2060 to do this would be inappropriate.

Understood. I await alternative suggestions.

Post by Kenneth Whistler
--Ken

--
Peter Kirk
***@ntlworld.com
http://web.onetel.net.uk/~peterkirk/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Philippe Verdy

2003-08-08 15:21:08 UTC

Post by Kenneth Whistler
The function I think you have in mind is not isolated display of
a combining mark, but rather trying to find a mechanism for
getting around the conformance strictures of the standard, to
get a combining mark to apply to a *following* base
character, rather than to a *preceding* base character.
Trying to use U+FEFF *or* U+2060 to do this would be inappropriate.

I tried this sequence and it seems to have the correct behavior including
for line breaking and word breaking:

<some text>, <ZWSP, Combining Acute Accent, CGJ, A>, <B>

It renders as some text, with a break opportunity before the ZWSP,
a isolated accute accent combined with the following letter A, but
this last combination depends on fonts (if they support CGJ to
change the encoding order so that previous diacritics can combine
in the same combining sequence as the next base character).

This is quite tricky I admit, and I just wonder what is the correct
usage of CGJ before a base character like Latin Capital Letter A.
May be a distinct combining character could be used, but I
wonder which one (CGJ is supposed to create some ligature
between two normally distinct combining sequences each one
containing at least 1 base character. There's been a recent
discussion to use it also before a combining character and not
only before a base character.

If the initial break opportunity is undesirable, because the accented
letter A is in the same word as the previous <some text>, then one
can replace ZWSP (which is considered as white-space and thus a
word separator) by a Word-Joiner control (preferably to the ZWNBSP
U+FEFF whose usage in plain text is now deprecated if it is not used
as a BOM, and preferably not a ligating format control which would
have the undesirable effect of instructing the renderer to try using
a ligated glyph for the combined sequence, and thus alter the
semantic or appearance of the rendered text, where it was not
intended that the combining mark should have any implied glyph
relation with the previous base character)
--
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kenneth Whistler

2003-08-04 23:52:25 UTC

Peter,

Post by Kenneth Whistler
The carrier for a combining mark that is to display in isolation without
a base character is U+0020 SPACE. If you want to also indicate the
absence of a line break opportunity, then the carrier is U+00A0
NO-BREAK SPACE (NBSP).

Neither of these is appropriate to the case I have in mind (described in
greater detail below) as they are not zero width and therefore give an
unwanted indent at the start of a line.

Of course, because the whole point of this convention is to display
a non-spacing mark in isolation, not applied to a base character.

Post by Peter Kirk
U+200B ZERO WIDTH SPACE might be
appropriate, but this has the problem that it is a break opportunity,
which is not always appropriate.

U+200B ZERO WIDTH SPACE is not appropriate, for the same reason
the U+FEFF (or U+2060) is not appropriate: The Standard does
not specify the display of non-spacing marks on it as a means
of showing the marks without base characters. And, as you indicate,
U+200B (but also U+FEFF and U+2060) are implicated in the control
of line break opportunities. They are certainly not defined
as glyph display anchors or some such.

Post by Kenneth Whistler
Their
names may be misleading; people intending to use them for any other
function should carefully read the sections of the Unicode Standard
that discuss their usage.

But which sections? Where is the index, online?

Patience please. The editor is paddling as fast as she can. If
you will refrain from clicking the remote for just a day or two
longer, all will be revealed.

Post by Peter Kirk
It is unfortunate that
there are no links from the character charts or the database to the
various places where the uses of the characters are explained.

Users of the new online edition of Unicode 4.0 will be pleasantly
surprised, I predict. The General Index is much expanded and
improved, and in the pdf the index markers are fully linked,
so you will be able to click through from the index to a location
in the text which is indexed. Other links for section references
and references to external documents will also be "live" in the
pdf. :-)

Post by Peter Kirk
All there
is is a character name, and as I have found quite often this character
name is seriously misleading if not actually incorrect. It is highly
unfortunate that it is not permitted to change these misleading names.

Yes, we all agree, but we live with it. For some of the obnoxious
instances, like ZWNBSP, it is better to just live with the
abbreviations as opaque monikers, like a "BXLZFITZL", rather than
focussing on whether the fact that there is a "SPACE" in its
name actually makes it a space character.

Post by Peter Kirk
As it is, the note at U+FEFF in the character charts reads "use as an
indication of non-breaking is deprecated...", although you wrote that
this was not deprecated.

The Unicode *character* U+FEFF is not deprecated, in the precise
sense of deprecation which is correlated with the character having
the "Deprecated" property in the Unicode Character Database.
(U+206C INHIBIT ARABIC FORM SHAPING, for example, *is* deprecated
in this sense.)

The use of U+FEFF as a non-breaker (= word joiner) is deprecated,
in the more general sense of "depreciated, not recommended", because
use of U+2060 WORD JOINER is less ambiguous and less trouble-prone.

Post by Peter Kirk
But there is no note that use of ZERO WIDTH
NO-BREAK SPACE as a zero width no-break space is deprecated or "a
contradiction in terms of the current definition of the standard".

There is further explication in both UAX #14 and in the relevant
sections of Chapter 15 in Unicode 4.0.

Post by Peter Kirk
Are
you surprised that I am confused?

No. That's why I'm spending time trying to keep making the
clarifications for you and others.

If by "apply" in the above you mean "be positioned adjacent to",

No, I mean logical application, in this context.

There are admitted deficiencies in the standard's text, even
now, regarding just what the "graphic interaction" for a combining
mark means -- that is grist for the Unicode 5.0 mill to grind
very finely, I suggest.

Post by Peter Kirk
there
is already a problem with the standard: the EXISTING Hebrew page of the
standard is in contravention to its conformance strictures. This is
because under the existing standard (irrespective of any changes being
proposed) and in legacy encodings, the combining mark holam, which is
usually graphically positioned above the preceding base character, is in
certain environments, specifically when followed by a silent alef (holam
male is a separate issue), graphically positioned above the following
base character. But the standard has anticipated this kind of difficulty
by recognising that positioning is not always consistent with logical
ordering, see the note on Indic vowel signs in The Unicode Standard 4.0
section 2.10, subsection "Sequence of Base Characters and Diacritics",
http://www.unicode.org/book/preview/ch02.pdf.

Or meditate on Figure 2-3, Unicode Character Code to Rendered Glyphs.
That is the fundamental mandala of the standard. ;-)

Post by Peter Kirk
This is a documented
special case; Hebrew holam followed by silent alef is also a special
case whether you like it or not, it just hasn't been documented. It
could be removed, but that would require changes to every existing
(ancient or modern) pointed Hebrew text.

The discussion of details of how to represent these sequences
should probably migrate back to the ***@unicode.org list.

--Ken

Post by Kenneth Whistler
Trying to use U+FEFF *or* U+2060 to do this would be inappropriate.

Understood. I await alternative suggestions.

Peter Kirk

2003-08-05 00:14:00 UTC

Post by Peter Kirk
U+200B ZERO WIDTH SPACE might be
appropriate, but this has the problem that it is a break opportunity,
which is not always appropriate.

Thank you for the clarification.

Post by Mark Davis
Their
names may be misleading; people intending to use them for any other
function should carefully read the sections of the Unicode Standard
that discuss their usage.

But which sections? Where is the index, online?

Patience please. The editor is paddling as fast as she can. If
you will refrain from clicking the remote for just a day or two
longer, all will be revealed.

I will wait, and try to do so patiently.

Post by Peter Kirk
Are
you surprised that I am confused?

No. That's why I'm spending time trying to keep making the
clarifications for you and others.

Thank you. I appreciate the time you are putting into this.

Post by Mark Davis
The function I think you have in mind is not isolated display of
a combining mark, but rather trying to find a mechanism for
getting around the conformance strictures of the standard, to
get a combining mark to apply to a *following* base
character, rather than to a *preceding* base character.

If by "apply" in the above you mean "be positioned adjacent to",

No, I mean logical application, in this context.
There are admitted deficiencies in the standard's text, even
now, regarding just what the "graphic interaction" for a combining
mark means -- that is grist for the Unicode 5.0 mill to grind
very finely, I suggest.

Or meditate on Figure 2-3, Unicode Character Code to Rendered Glyphs.
That is the fundamental mandala of the standard. ;-)

Thanks for the pointer.

A similar issue which is not Hebrew related would be a (mythical)
requirement to display a diacritic like 0315, 031B or 0322 in isolation.
It would not always be appropriate to use a space or NBSP as a base
character as this would indent the glyph from the beginning of a line in
a way which might not be wanted. What would be the recommended encoding
if one wanted to display one of these characters with no leading white
space?

The discussion of details of how to represent these sequences

I have already copied it there.
--
Peter Kirk
***@ntlworld.com
http://web.onetel.net.uk/~peterkirk/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kent Karlsson

2003-08-05 15:47:00 UTC

Post by Peter Kirk
U+200B ZERO WIDTH SPACE might be
appropriate, but this has the problem that it is a break

opportunity,

Post by Peter Kirk
which is not always appropriate.

I see no particular *technical* problem with using WJ, though. In
contrast
to the suggestion of using CGJ (re. another problem) anywhere else but
at the end of a combining sequence. CGJ has combining class 0, despite
being invisible and not ("visually") interfering with any other
combining
mark. Using CGJ at a non-final position in a combining sequence puts
in doubt the entire idea with combining classes and normal forms.

/kent k

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Philippe Verdy

2003-08-08 15:54:36 UTC

Post by Kenneth Whistler
Peter,

Post by Kenneth Whistler
The carrier for a combining mark that is to display in isolation
without a base character is U+0020 SPACE. If you want to also
indicate the absence of a line break opportunity, then the
carrier is U+00A0 NO-BREAK SPACE (NBSP).

Neither of these is appropriate to the case I have in mind
(described in greater detail below) as they are not zero width and
therefore give an unwanted indent at the start of a line.

Of course, because the whole point of this convention is to display
a non-spacing mark in isolation, not applied to a base character.

Post by Peter Kirk
U+200B ZERO WIDTH SPACE might be
appropriate, but this has the problem that it is a break
opportunity, which is not always appropriate.

Here I disagree: ZWS is a white-space, not a format control, and thus it
has a glyphic and semantic identity by itself (unlike ZWNBSP or WJ).
So ZWS clearly qualifies as a base character, and is certainly better
(conceptually and per its breaking properties) than the standard ASCII
space which has an implied minimum width (which may be too large
to be used as a holder for a tiny diacritic like a dot above, or even an
acute accent.

200B;ZERO WIDTH SPACE;Zs;0;BN;;;;;N;;;;;

When we speak about combining sequences, they are already
supposed to expand the width or height of a base character to
which it applies, so ZWS despite being zero-width itself, does
not make this property inherited to the combining sequence which
includes it.

For me, the best two candidates for holders of isolated diacritics
are ZWS (if breakable before and after the combining sequence),
or WJ (if not breakable when the isolated diacritic must be used
within the same word without internal break opportunity).
However WJ is a control and does not fit well for the second
usage. Could there be another codepoint assigned that has
these properties:

20CF;ZERO WIDTH SYMBOL;Sk;0;ON;<compat> 0020;;;;N;;;;;

i.e. being considered symbolic, not a whitespace, with
combining class 0 (not combining), and used as an
explicit base for a isolated spacing diacritic to never show
with a dotted circle? (note U+20CF is just a suggestion, as
it fits at end of the symbolic block used for currency symbols,
just before the "extended" combining characters block, and
because the U+02XX block where other "Sk" spacing
diacritics are defined is full).

The compatibility decomposition to a space is to make it
in sync with other compatibly decomposable spacing
diacritics.

The new character would allow to represent diacritics that currently
don't have a spacing counterpart, and use them as if they were letter
like. Let's look at a similar diacritic which currently has an existing
"precombined" spacing version:

00B4;ACUTE ACCENT;Sk;0;ON;<compat> 0020 0301;;;;N;SPACING ACUTE;;;;

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-08-08 19:54:28 UTC

... Could there be another codepoint assigned that has
20CF;ZERO WIDTH SYMBOL;Sk;0;ON;<compat> 0020;;;;N;;;;;
i.e. being considered symbolic, not a whitespace, with
combining class 0 (not combining), and used as an
explicit base for a isolated spacing diacritic to never show
with a dotted circle? (note U+20CF is just a suggestion, as
it fits at end of the symbolic block used for currency symbols,
just before the "extended" combining characters block, and
because the U+02XX block where other "Sk" spacing
diacritics are defined is full).
The compatibility decomposition to a space is to make it
in sync with other compatibly decomposable spacing
diacritics.
The new character would allow to represent diacritics that currently
don't have a spacing counterpart, and use them as if they were letter
like. Let's look at a similar diacritic which currently has an existing
00B4;ACUTE ACCENT;Sk;0;ON;<compat> 0020 0301;;;;N;SPACING ACUTE;;;;

Philippe, this sounds like an excellent suggestion, at least in general
terms. There is a missing function here, which has been provided (since
Unicode 1.0) by overloading the characters space and NBSP with an
inappropriate second function. Of course we can't make existing practice
illegal, but we can recommend that in future versions of the standard
your new ZERO WIDTH SYMBOL character should be used for display of
isolated diacritics where there is no separate spacing form. We can also
suggest that the width of the combination should be that of the
diacritic only.

But I'm not sure that ZERO WIDTH SYMBOL is the best name, unless you are
suggesting other uses in which it really has zero width. Well, it might
have in a case like line initial holam which shifts on to a following
silent alef, but that is a rather special case.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

t***@widmann.uklinux.net

2003-08-08 20:56:45 UTC

... Could there be another codepoint assigned that has
20CF;ZERO WIDTH SYMBOL;Sk;0;ON;<compat> 0020;;;;N;;;;;
[...]

But I'm not sure that ZERO WIDTH SYMBOL is the best name, unless you
are suggesting other uses in which it really has zero width. Well, it
might have in a case like line initial holam which shifts on to a
following silent alef, but that is a rather special case.

What would be a better name? ACCENT CARRIER?

/Thomas
--
Thomas Widmann, MA +44 141 419 9872 Glasgow, Scotland, EU
***@widmann.uklinux.net http://www.widmann.uklinux.net

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-08-08 21:00:07 UTC

Post by t***@widmann.uklinux.net

... Could there be another codepoint assigned that has
20CF;ZERO WIDTH SYMBOL;Sk;0;ON;<compat> 0020;;;;N;;;;;
[...]

What would be a better name? ACCENT CARRIER?
/Thomas

Perhaps CARRIER FOR COMBINING CHARACTERS - not COMBINING CHARACTER
CARRIER as that gives the wrong idea that this should itself be a
combining character, it should not.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Philippe Verdy

2003-08-08 23:11:15 UTC

Post by Peter Kirk
But I'm not sure that ZERO WIDTH SYMBOL is the best name, unless you
are suggesting other uses in which it really has zero width. Well, it
might have in a case like line initial holam which shifts on to a
following silent alef, but that is a rather special case.

I just picked "SYMBOL" to just match the required property that would match
other spacing variants of diacritics. The "ZERO WIDTH" is probably confusive, but it just marks the fact that it has no associated glyph and a null *minimum* width (which expands to the largest diacritic(s) with which it is combined).

Its main role would be to fill the gap for missing spacing versions of existing diacritics.

What about the name "INVISIBLE CARRIER SYMBOL" ? (note that I avoid any occurence of the term "COMBINING" in the name, because there would be no requirement for this character to be followed by any diacritic(s), but the character would itself be handled as a symbol, in a way similar to the existing spacing diacritics (that are already of category Sk, and are conceptually a combination of the INVISIBLE CARRIER SYMBOL and diacritics, defined for compatibility purpose as an approximation of the sequence SPACE+diacritic).

It is worth noting that for now it is quite tricky to get an isolated diacritic without getting deceptive results (in some cases, the only way to do it is by using what Unicode describes as "defective" combining sequences, not illegal by themselves but whose rendering and interpretation is not guaranteed.

On the opposite, Unicode offers a standard way to force the appearance of the dotted circle for an isolated diacritic, a function that may not always be desirable, using a dotted circle symbol as the base character.

As someone corrected me in this list, SPACE+combiningdiacritic is admitted in the standard, but only as a compatibility equivalence for spacing diacritics, where in fact the isolated spacing diacritic is really a symbol (gc=Sk), unlike the base SPACE character used in the compatibility decomposition (which has gc=Zs), meaning that SPACE+combining diacritic does not have the same textual semantics as the effectively already encoded spacing diacritics (all of them seem to have property gc=Sk, and are not considered as Letters with gc=Lo, and that's why I thought the name "SYMBOL" was accurate).

Also I tried to justify a possible codepoint assignment at U+20CF, where it would group more logically, given that the U+02XX block is already full and U+20XX is used for both symbols (including currencies) and a set of additional combining diacritics. Of course the U+20CF is just a suggestion, not something approved or documented.
--
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Michael Everson

2003-08-08 23:24:13 UTC

Post by Philippe Verdy
I just picked "SYMBOL" to just match the required property that would match
other spacing variants of diacritics. The "ZERO WIDTH" is probably
confusive, but it just marks the fact that it has no associated
glyph and a null *minimum* width (which expands to the largest
diacritic(s) with which it is combined).

The Name Police reject this utterly. ZERO WIDTH cannot have an
expanding dynamic width.
This pseudo-character will not be encoded. Time to drop the thread.
--
Michael Everson * * Everson Typography * * http://www.evertype.com

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kent Karlsson

2003-08-09 13:11:39 UTC

Post by Michael Everson
The Name Police reject this utterly. ZERO WIDTH cannot have an
expanding dynamic width.

Then what about ZERO WIDTH SPACE, which, according to TUS3, p. 238,
"can grow to have a visible width when justified"? And it has the
NamesList comment:
* nominally zero width, but may expand in justification

(But U+0082, BREAK PERMITTED HERE, which otherwise is very similar
to ZWSP according to 6429, does apparently not allow such stretching...)

/kent k

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Philippe Verdy

2003-08-09 21:49:52 UTC

Post by Michael Everson
The Name Police reject this utterly. ZERO WIDTH cannot have an
expanding dynamic width.

Then what about ZERO WIDTH SPACE, which, according to TUS3, p. 238,
"can grow to have a visible width when justified"? And it has the
* nominally zero width, but may expand in justification
(But U+0082, BREAK PERMITTED HERE, which otherwise is very similar
to ZWSP according to 6429, does apparently not allow such
stretching...)
/kent k

- ZERO WIDTH SPACE would be good only if it had not the "Zs" general
category which qualifies it as a whitespace, and a word breaker (in fact
the same problem occurs with the general category offered by SPACE
or NBSP, which is a good reason why they are highly criticizable as
base characters for word-like sequences (even if there's a NBSP, there
is still a word delimitation which may be important for orthographic
and grammatical analysis, given that the main difference between SPACE
and NBSP is mostly the line-breaking behavior but not the word-breaking
behavior.)

- BREAK PERMITTED HERE is a control and does not qualify as a base
character.

In fact, depending on the usage, the gaps to fill depend on the usage:

1) when the isolated diacritic is to be used as a spacing symbol but which
should not be force glued with surrounding characters, the NBSP base
character is a problem, and in fact it also has the wrong character
properties which normally applies to the whole combining sequence
that should normally inherit the properties of the first base character.
For this usage, we need something like an "INVISIBLE SYMBOL"
base character (with gc=Sk like for other existing spacing diacritics,
and probably with neutral directionality). The combining sequence
will have its width adjusted to the largest diacritic(s) applied to that
"INVISIBLE SYMBOL" base character. The nearest existing character
to fit this function is ZWS, but it is whitespace, not symbolic.

2) when the isolated diacritic is to be used as a regular letter within
words (e.g.: in Traditional Hebrew), we need something like a "INVISIBLE
LETTER" base character (with gc=Lo and neutral directionality), whose
width is not necessarily supposed to be adjusted but may adjust depending
depending on the left or right context (in rendering engines), so that one could
use an isolated circumflex between each character in the pair "oo", and the
diacritic being centered on the touching edges of each surrounding spacing
base character, or it would create a sufficient margin on either side to make
the isolated diacritic fit. The resulting combining sequence with the INVISIBLE
LETTER and its non-spacing diacritics would be mostly non-spacing.
But this rendering may be tricky to implement in many cases, and the
renderer should be allowed to render it as a spacing diacritic, like for the
invisible symbol, except that it would not be a symbol but really a letter that
can fit within a word (and have applications for elided letters in the middle of
a unbreakable word). This function is partially implementable with CGJ only
if there's a preceding combining sequence or base letter, or by WJ (Word
Joiner) but it is a format control and not applicable as a base character.

For texts that want to present the isolated diacritic for its related normal
function as a diacritic, the current best solution is to use the existing
(spacing) dotted circle symbol as the base character. However this usage
is quite technical, and too much Unicode related, and is not appropriate
for all usages, where the dotted circle symbol base character may conflict
with other usage (in a document) of this symbol (some other documents
also prefer using for such presentation forms a gray-coloured Latin small
letter o in some rich text like HTML or RTF, but this still has the problem
that a rich-text format like HTML will break the plain-text into separate
sequences, where the non-grayed diacritic muct still be rendered on top
of this separate sequence: which base character can be used in that
case? there's currently none, except trying with ZWS (does not work
always), but should better be a non-spacing INVISIBLE LETTER, rather
than a spacing INVISIBLE SYMBOL (which by itself has no defined width
but has just a minimum width 0).
--
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kent Karlsson

2003-08-10 09:53:57 UTC

Post by Michael Everson
The Name Police reject this utterly. ZERO WIDTH cannot have an
expanding dynamic width.

Then what about ZERO WIDTH SPACE, which, according to TUS3, p. 238,
"can grow to have a visible width when justified"? And it has the
* nominally zero width, but may expand in justification

<<Philippe-"spam" deleted...>>

Note that *my* comment ("> >" above) only referred to the "name
policing"
(given that the policing principle Michael mentioned is already broken).

/kent k

----
Spams de Philippe Verdy non tolérés: tout message non sollicité sera
rapporté à son fournisseur de services Internet.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Philippe Verdy

2003-08-10 12:18:15 UTC

Post by Kent Karlsson
<<Philippe-"spam" deleted...>>
----
Spams de Philippe Verdy non tolérés: tout message non sollicité sera
rapporté à son fournisseur de services Internet.

There was no spam in the message you deleted. This was a single post to the list, no cross-posting, no advertizing, no product sold, no money claimed, no required action, no identity forged, and no deceptive subject line, the message was on topic...

Reread the definition of spam: "bulk + unsollicitated". May be you don't like my message, but reporting it to my ISP will not be successful for you, and in fact you risk more by doing so because my ISP could complain to yours.

If you think you don't like my message which was on topic, don't reply to it, delete it, ignore it, but don't do such false claim...

Thanks.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kent Karlsson

2003-08-10 15:27:19 UTC

Post by Kent Karlsson
<<Philippe-"spam" deleted...>>
----
Spams de Philippe Verdy non tolérés: tout message non sollicité sera
rapporté à son fournisseur de services Internet.

Some people just don't get sarcasm... ;-( For someone who has
a such a sentence as yours at the end of their mails, you do
generate quite a lot of unsolicited comments, misreading the
contents of the messages you reply to. (Yes, that does annoy
me.)

Post by Philippe Verdy
If you think you don't like my message which was on topic,
don't reply to it, delete it, ignore it, but don't do such
false claim...

Please don't go off on a limb on something that the message
you replied to did not talk about. Please read the message first,
and understand what it's about. Then read it again, and make
sure you have not misread. And keep any replies at a suitable
length. And no, I will not let you mislead readers of this list by
not commenting on what you write. I'm not the only one
chastising you. You may have noticed e.g. Ken do the same,
in no subtle ways.

You are of course welcome to participate in the discussions
on this list. But please,

. be careful about terminology, and about what is what,
. don't misread, and misreply, to messages from others,
. keep your posting at a reasonable length, concentrated
on the issue at hand,
. don't mislead readers by erroneous statements formulated
as unquestionable truths,
. (I'm sure to have missed something...).

That way you can contribute positively to the discussion,
while not constantly annoying or misleading people.

Don't worry Philippe, I of course never intended to report
you anywhere. Just trying to get you to behave a bit more
conscientiously.

/kent k

Post by Philippe Verdy
Thanks.

Michael Everson

2003-08-10 16:53:22 UTC

Post by Michael Everson
The Name Police reject this utterly. ZERO WIDTH cannot have an
expanding dynamic width.

Then what about ZERO WIDTH SPACE, which, according to TUS3, p. 238,
"can grow to have a visible width when justified"? And it has the
* nominally zero width, but may expand in justification

(Rolls eyes.)

Fine.
--
Michael Everson * * Everson Typography * * http://www.evertype.com

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kenneth Whistler

2003-08-06 01:06:00 UTC

Post by Kent Karlsson
I see no particular *technical* problem with using WJ, though. In
contrast
to the suggestion of using CGJ (re. another problem) anywhere else but
at the end of a combining sequence. CGJ has combining class 0, despite
being invisible and not ("visually") interfering with any other
combining
mark. Using CGJ at a non-final position in a combining sequence puts
in doubt the entire idea with combining classes and normal forms.

Why? There are any number of combining characters with combining
class 0, including the vast majority of Indic dependent vowels,
for instance.

A combining character sequence is a base character followed
by any number of combining characters. There is no constraint
in that definition that the combining characters have to
have non-zero combining class.

Canonical reordering is scoped to stop at combining class = 0.
It doesn't say that it applies to combining character sequences
per se. It applies to *decomposed* character sequences
(meaning, effectively, any sequence which has had the recursive
application of the decomposition mappings done).

Take a Myanmar example: /kau/:

character sequence: <1000, 1031, 102C, 1039, 200C>
combining?: no yes yes yes no
combining classes: 0 0 0 9 0
comb char sequence: ----------------------
canon reorder scope: ---| ---| ---------| ---|

The combining character sequence here is: <1000, 1031, 102C, 1039>
The *syllable* consists of that plus the trailing ZWNJ.
But the relevant sequences for application of the
canonical reordering algorithm are each sequence starting
with combining class zero and continuing through any
sequence with combining class not zero.

I don't see how introduction of CGJ into such sequences calls
any of the definitions or algorithms into question.

--Ken

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kent Karlsson

2003-08-06 10:38:03 UTC

Post by Kent Karlsson
I see no particular *technical* problem with using WJ, though. In
contrast
to the suggestion of using CGJ (re. another problem)

anywhere else but

Post by Kent Karlsson
at the end of a combining sequence. CGJ has combining class

0, despite

Post by Kent Karlsson
being invisible and not ("visually") interfering with any other
combining
mark. Using CGJ at a non-final position in a combining sequence puts
in doubt the entire idea with combining classes and normal forms.

Why?

See above (I DID write the motivation!). Combining classes are generally
assigned according to "typographic placement". Combining characters
(except those that are really letters) that have the "same" placement,
and "interfere typographically" are assigned the same combining class,
while those that don't get different classes, and the relative order is
then considered unimportant (canonically equivalent). How is then,
e.g. <a, ring above, cgj, dot below> supposed to be different from
<a, dot below, cgj, ring above> (supposing all involved characters
are fully supported), when <a, ring above, dot below> is NOT
supposed to be much different from <a, dot below, ring above>
(them being canonically equivalent)? An invisible combining character
does not interfere typographically with anything, it being invisible!
The other invisible (per se!) combining characters with combining
class 0, the variation selectors, are ok, since their *conforming* use
is
vary highly constrained. Maybe I've been wrong, but I have taken
CGJ as similarly constrained as it was given a semantics only when
followed by a base character (but now it seems to have no semantics
at all).

Post by Kent Karlsson
There are any number of combining characters with combining
class 0, including the vast majority of Indic dependent vowels,
for instance.

These are ok. They are not invisible, and the vowels should not
reorder amongst themselves in a single combining sequence (I know,
there is normally only one vowel per syllable, but as the Hebrew
discussion has shown, one should not generalise too much),
regardless of placement (before, above, below, after, before&after,
...).
So at least they should have the same combining class, regardless
of typographic placement. (This should have been the case also
for the Hebrew vowels...) But class 0 (which is specially treated),
I'm not sure if that was ideal.

Post by Kent Karlsson
A combining character sequence is a base character followed
by any number of combining characters. There is no constraint
in that definition that the combining characters have to
have non-zero combining class.

Well, you cannot *conformantly* place a VS anywhere in a combining
sequence! Only certain combinations of base+vs are allowed in
any given version of Unicode. (Breaking that does not make the
combining sequence ill-formed, or illegal, but would make it
non-conformant, just like using an unassigned code point.)

Post by Kent Karlsson
Canonical reordering is scoped to stop at combining class = 0.

(I know it is. But I confess I'm not sure why.)

Post by Kent Karlsson
It doesn't say that it applies to combining character sequences
per se. It applies to *decomposed* character sequences
(meaning, effectively, any sequence which has had the recursive
application of the decomposition mappings done).

Yes, for the definition of normalisation. But not necessary for
canonical equivalence. Your point?

Post by Kent Karlsson
character sequence: <1000, 1031, 102C, 1039, 200C>
combining?: no yes yes yes no
combining classes: 0 0 0 9 0
comb char sequence: ----------------------
canon reorder scope: ---| ---| ---------| ---|
The combining character sequence here is: <1000, 1031, 102C, 1039>
The *syllable* consists of that plus the trailing ZWNJ.
But the relevant sequences for application of the
canonical reordering algorithm are each sequence starting
with combining class zero and continuing through any
sequence with combining class not zero.

Formally, a character *pair* based definition is enough:
xy S yx, if 0 < cc(y) < cc(x) (and apply that repeatedly);
no need to define any "canonically reordering scope", though
that may be marginally more efficient in an implementation
of normalisation (but this is getting beside the topic of this
discussion).

Post by Kent Karlsson
I don't see how introduction of CGJ into such sequences calls
any of the definitions or algorithms into question.

No, not the algorithm, but the basic idea and design. The algorithm
as such has no "idea" how or why the combining class numbers
were assigned. But we humans do, or might have.

Again, why should not <a, ring above, cgj, dot below> be canonically
equivalent to <a, dot below, cgj, ring above>, when <a, ring above,
dot below> is canonically equivalent to <a, dot below, ring above>?
And I want a design answer, not a formal answer! (The latter I already
know, and is uninteresting.)

Since I think <a, ring above, cgj, dot below> should be canonically
equivalent to <a, dot below, cgj, ring above>, but cannot be made
so (now), the only ways out seem to be to either formally deprecate
CGJ, or at least confine it to very specific uses. Other occurrences
would not be ill-formed or illegal, but would then be non-conforming.

/kent k

Post by Kent Karlsson
--Ken

Philippe Verdy

2003-08-06 14:26:42 UTC

Post by Kent Karlsson
Since I think <a, ring above, cgj, dot below> should be canonically
equivalent to <a, dot below, cgj, ring above>, but cannot be made
so (now), the only ways out seem to be to either formally deprecate
CGJ, or at least confine it to very specific uses. Other occurrences
would not be ill-formed or illegal, but would then be non-conforming.

There's a way to specify that <A, RingAbove, CGJ, DotBelow> is
well-formed, but not <A, DotBelow, CGJ, RingAbove>:
a CGJ can be authorized in a combining sequence only if it
precedes a base character, or is precedes a combining character
which combining class is strictly lower than the combining class
of the previous character.

So, with this definition, with the combining classes indicated:

- <A=0, RingAbove=230, CGJ=0, DotBelow=220>
is well-formed because 220 < 230. It is distinct from:
<A=0, RingAbove=230, DotBelow=220>, whose canonical
ordering is
<A=0, DotBelow=220, RingAbove=230>

- <A=0, DotBelow=220, CGJ=0, RingAbove=230>
is ill-formed because 230 > 220. The CGJ is superfluous
and should be removed to create:
<A=0, DotBelow=220, RingAbove=230>

- <A=0, DotBelow=220, CGJ=0, Cedilla=220>
is ill-formed because 220 = 220. The CGJ is superfluous
and should be removed to create:
<A=0, DotBelow=220, Cedilla=220>
which is well-formed and in canonical order.

- <A=0, Cedilla=220, CGJ=0, DotBelow=220>
is ill-formed because 220 = 220. The CGJ is superfluous
and should be removed to create:
<A=0, Cedilla=220, DotBelow=220>
which is well-formed and in canonical order.

This "well-formed" rule would clearly give an exact semantic
for CGJ, used in the middle of a combining sequence as the
only way to bypass the canonical reordering of combining
characters.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-08-06 20:03:34 UTC

Post by Kent Karlsson
I see no particular *technical* problem with using WJ, though. In
contrast
to the suggestion of using CGJ (re. another problem)

anywhere else but

Post by Kent Karlsson
at the end of a combining sequence. CGJ has combining class

0, despite

Why?

Not true, as we have seen for Hebrew. It's supposed to be true, but
isn't, and the problems can't be fixed.

Post by Kent Karlsson
... and the relative order is
then considered unimportant (canonically equivalent). How is then,
e.g. <a, ring above, cgj, dot below> supposed to be different from
<a, dot below, cgj, ring above> (supposing all involved characters
are fully supported), when <a, ring above, dot below> is NOT
supposed to be much different from <a, dot below, ring above>
(them being canonically equivalent)? ...

There is no difference when the characters really do not interfere
typographically. But when they do, there is a real and, in some
languages, meaningful distinction.

Post by Kent Karlsson
...
... the only ways out seem to be to either formally deprecate
CGJ, or at least confine it to very specific uses. Other occurrences
would not be ill-formed or illegal, but would then be non-conforming.

OK, let's confine it to those specific uses where it is really needed,
e.g. to get round the problem of combining characters with different
combining classes which actually do interact typographically, and
perhaps there was another one being suggested. I have no problem with
that - as long as the list of permitted uses is not set in stone, so
that new uses can be approved when they are discovered. But there is no
good reason to object to its use in those cases where it is needed,
simply because in many other cases it is not needed.
--
Peter Kirk
***@ntlworld.com
http://web.onetel.net.uk/~peterkirk/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kent Karlsson

2003-08-07 16:29:47 UTC

...

Post by Kent Karlsson
and "interfere typographically" are assigned the same

combining class,

Post by Kent Karlsson
while those that don't get different classes, ...

Not true, as we have seen for Hebrew. It's supposed to be true, but
isn't, and the problems can't be fixed.

The combining classes for Hebrew (and Arabic) vowels are bizarre.
I have no idea how they came about. They should (ideally) probably
have been dealt with in the same way as Indic vowels.

/kent k

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kenneth Whistler

2003-08-06 20:19:34 UTC

Post by Kent Karlsson
I see no particular *technical* problem with using WJ, though. In
contrast
to the suggestion of using CGJ (re. another problem)

anywhere else but

Post by Kent Karlsson
at the end of a combining sequence. CGJ has combining class

0, despite

Why?

See above (I DID write the motivation!).

I guess that I did not (and still do not) see the motivation for
your final statement.

Post by Kent Karlsson
Combining classes are generally
assigned according to "typographic placement". Combining characters
(except those that are really letters) that have the "same" placement,
and "interfere typographically" are assigned the same combining class,
while those that don't get different classes, and the relative order is
then considered unimportant (canonically equivalent). How is then,
e.g. <a, ring above, cgj, dot below> supposed to be different from
<a, dot below, cgj, ring above> (supposing all involved characters
are fully supported), when <a, ring above, dot below> is NOT
supposed to be much different from <a, dot below, ring above>
(them being canonically equivalent)? An invisible combining character
does not interfere typographically with anything, it being invisible!

The same thing can be said about any inserted invisible character,
combining or not.

How is: <a, ring above, null, dot below> supposed to be different from
<a, dot below, null, ring above>

How is: <a, ring above, LRM, dot below> supposed to be different from
<a, dot below, LRM, ring above>

In display, they might not be distinct, unless you were doing some kind of
show-hidden display. Yet these sequences are not canonically
equivalent, and the presence of an embedded control character or an
embedded format control character would block canonical reordering.

Of course, they *might* be distinct in rendering, depending on
what assumptions the renderer makes about default ignorable
characters and their interaction with combining character sequences.
But you cannot depend on them being distinct in display -- the
standard doesn't mandate the particulars here.

Whether you think it is *reasonable* or not that there should be
non-canonically equivalent ways of representing the same
visual display, sequences such as those above, including sequences
with CGJ, are possible and allowed by the standard. They are:

a. well-formed sequences, conformantly interpretable
b. could be displayed by reasonable renderers, making reasonable
assumptions, as visually identical

I have been pointing out use of the CGJ, which *exists* as an encoded
character, and which has a particular set of properties defined,
would result in the kinds of non-canonically equivalent ordering
distinctions required in Hebrew, if inserted into vowel sequences.
Those are facts about the current standard, as currently
defined. And unless you or someone else convinces the UTC to
establish cooccurrence constraints on CGJ or to change its
properties, they will continue to be current facts about the
standard.

Post by Kent Karlsson
The other invisible (per se!) combining characters with combining
class 0, the variation selectors, are ok, since their *conforming* use
is
vary highly constrained. Maybe I've been wrong, but I have taken
CGJ as similarly constrained as it was given a semantics only when
followed by a base character (but now it seems to have no semantics
at all).

There was no such constraint defined for CGJ. The current statement
about CGJ is merely that it should be ignored in language-sensitive
sorting and searching unless "it specifically occurs within
a tailored collation element mapping." There is no constraint
on what particular sequences involving CGJ could be tailored
that way, and hence no constraint on what particular sequences
CGJ might occur in, in Unicode plain text.

Actually, it is not non-conformant like using an unassigned
code point would be. The latter is directly subject to conformance
clause C6:

C6 A process shall not interpret an unassigned code point as an
abstract character.

The case for variation sequences is subtly different. Suppose
I encounter a variation sequence <X, VS1>, where X could be
any Unicode character. X itself is conformantly interpretable.
VS1 itself is conformantly interpretable. The constraints are
on the interpretation of the variation sequence itself. And
they consist of:

"Only the variation sequences specifically defined in the
file StandardizedVariants.txt in the Unicode Character
Database are sanctioned for standard use; in all other
cases the variation selector cannot change the visual
appearance of the preceding base character from what it
would have had in the absence of the variation selector."

In other words, you can drop VS1's to your heart's content into
plain text, but a conformant implementation should ignore all
of them, unless a) it is interpreting variation selectors, and
b) it encounters a particular sequence defined in
StandardizedVariants.txt.

The cooccurrence constraints on VS1's are constraints on the
*encoding committees* regarding what sequences they will or will
not allow into StandardizedVariants.txt (for various reasons):

"The base character in a variation sequence is never a combining
character or a decomposable character."

That means the UTC will never make such a variation sequence
interpretable by putting it into StandardizedVariants.txt.
*But*, a text user who drops a VS1 into Unicode plain text
after a combining character doesn't "commit a foul" thereby --
he has just put a character into a position that no conformant
implementation will do other than ignore on display.

Post by Kent Karlsson
Canonical reordering is scoped to stop at combining class = 0.

(I know it is. But I confess I'm not sure why.)

Because God, er...., um... Mark Davis created it that way. ;-)

Yes, for the definition of normalisation. But not necessary for
canonical equivalence. Your point?

Of course it is necessary for canonical equivalence:

D24 Canonical equivalent: Two character sequences are said to be
canonical equivalents if their full canonical decompositions
are identical.

D23 Canonical decomposition: The decomposition of a character that
results from recursively applying the canonical mappings found
in the names list of Section 16.1, Character Names List, and those
described in Section 3.12, Conjoining Jamo Behavior, until no
characters can be further decomposed, and then reordering
^^^^^^^^^^^^^^^^^^^
nonspacing marks according to Section 3.11, Canonical Ordering
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Behavior.
^^^^^^^^

xy S yx, if 0 < cc(y) < cc(x) (and apply that repeatedly);
no need to define any "canonically reordering scope", though
that may be marginally more efficient in an implementation
of normalisation (but this is getting beside the topic of this
discussion).

I'm talking about "scope" here generically. I realize that
the algorithm is based on pair-based swapping, and there is
no necessity to have a formally-defined scope. The point,
however, as you recognize, is that any character with
cc=0 will limit the scope that any sequence of pair-swappings
can impact.

Post by Kent Karlsson
I don't see how introduction of CGJ into such sequences calls
any of the definitions or algorithms into question.

No, not the algorithm, but the basic idea and design. The algorithm
as such has no "idea" how or why the combining class numbers
were assigned. But we humans do, or might have.

True.

Post by Kent Karlsson
Again, why should not <a, ring above, cgj, dot below> be canonically
equivalent to <a, dot below, cgj, ring above>, when <a, ring above,
dot below> is canonically equivalent to <a, dot below, ring above>?
And I want a design answer, not a formal answer! (The latter I already
know, and is uninteresting.)

The formal answer is the true and interesting answer!

It shouldn't be canonically equivalent because it *isn't*
canonically equivalent.

But instead of obsessing about the particular case of the CGJ,
admit that the same shenanigans can apply to any number of
default ignorable characters which will not result in visually
distinct renderings under normal assumptions about rendering.

I'm detecting a deeper concern here -- that such a situation
should not be allowed in the standard at all, as a matter
of design and architecture. But as a matter of practicality,
given the complexity of text representation needs in the
Unicode Standard, I don't think you can legislate these kinds
of edge cases away entirely.

And I disagree with you, obviously. It should neither be
deprecated nor constrained from use where it may helpfully
solve a problem of text representation (in Biblical Hebrew).

--Ken

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Philippe Verdy

2003-08-06 22:22:23 UTC

Post by Kent Karlsson
I see no particular *technical* problem with using WJ, though.
In contrast
to the suggestion of using CGJ (re. another problem)

anywhere else but

Post by Kent Karlsson
at the end of a combining sequence. CGJ has combining class

0, despite

Post by Kent Karlsson
being invisible and not ("visually") interfering with any other
combining
mark. Using CGJ at a non-final position in a combining sequence
puts in doubt the entire idea with combining classes and normal
forms.

Why?

See above (I DID write the motivation!).

I guess that I did not (and still do not) see the motivation for
your final statement.

Post by Kent Karlsson
Combining classes are generally
assigned according to "typographic placement". Combining characters
(except those that are really letters) that have the "same"
placement, and "interfere typographically" are assigned the same
combining class, while those that don't get different classes, and
the relative order is then considered unimportant (canonically
equivalent). How is then,
e.g. <a, ring above, cgj, dot below> supposed to be different from
<a, dot below, cgj, ring above> (supposing all involved characters
are fully supported), when <a, ring above, dot below> is NOT
supposed to be much different from <a, dot below, ring above>
(them being canonically equivalent)? An invisible combining
character does not interfere typographically with anything, it
being invisible!

I disagree with you, using a LRM mark in the middle of a combining
sequence is conforming to canonicalization rules but is clearly
ill-formed, as well as using a NULL control in the middle, which
breaks the combining sequence.

So in your two examples above, inserting the LRM or NULL splits
a combining sequence and creates 3 ones, each with their own
properties, and the last one is ill-formed as it contains a combining
character after a control and not a base or combining character.

The proposal to use CGJ however is legal: it does not break the
combining sequences and grapheme clusters, and thus the whole
encoded sequence encoded with CGJ will be considered by
rendering engines, where CGJ is a no-op for rendering but not for
the canonical ordering where I see its only well-formed use as a
canonical ordering fix for NF* normalized forms, or before a
base character to extend the combining sequences used by
renderers or character parsers and breakers.

So your example with:
<a, dot below, LRM, ring above>
would in fact be rendered and parsed as three combining sequences:
<a, dot below>, <LRM>, <ring above>
i.e. a wellformed <a with dot below>, a control (normally invisible,
but may be edited with a visible glyph with a dotted square like in
the Unicode charts), and a ill-formed isolated <ring above> (most
probably rendered with a dotted circle).

So it cannot be thought as equivalent and not even rendered
equivalently as:
<a, dot below, ring above>
or its canonical equivalents (not in normalized order but still
conforming and well-formed, and handled equivalently):
<a, ring above, dot below>
<a with ring above, dot below>
--
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kent Karlsson

2003-08-07 16:38:12 UTC

...

Post by Kent Karlsson
(them being canonically equivalent)? An invisible combining

character

Post by Kent Karlsson
does not interfere typographically with anything, it being

invisible!
The same thing can be said about any inserted invisible character,
combining or not.
How is: <a, ring above, null, dot below> supposed to be different from
<a, dot below, null, ring above>

The first would be an å followed by separate dot below (under a space,
according to p. 131 of TUS 3.0). The second one is an <a, dot below>
with a separate ring above (over a space according to TUS 3.0 p. 131).

Post by Owen Taylor
How is: <a, ring above, LRM, dot below> supposed to be different from
<a, dot below, LRM, ring above>

As above (yea, <a, ring above, null, dot below> would look the same
as <a, ring above, LRM, dot below>; but neither of these are singe
combining sequences).

Post by Owen Taylor
In display, they might not be distinct, unless you were doing
some kind of
show-hidden display. Yet these sequences are not canonically
equivalent, and the presence of an embedded control character or an
embedded format control character would block canonical reordering.
Of course, they *might* be distinct in rendering, depending on
what assumptions the renderer makes about default ignorable
characters and their interaction with combining character sequences.
But you cannot depend on them being distinct in display -- the
standard doesn't mandate the particulars here.

Well, it does (did?) say "should"...

Post by Owen Taylor
Whether you think it is *reasonable* or not that there should be
non-canonically equivalent ways of representing the same
visual display, sequences such as those above, including sequences
a. well-formed sequences, conformantly interpretable
b. could be displayed by reasonable renderers, making reasonable
assumptions, as visually identical
I have been pointing out use of the CGJ, which *exists* as an encoded

Regrettable!

Post by Owen Taylor
character, and which has a particular set of properties defined,
would result in the kinds of non-canonically equivalent ordering
distinctions required in Hebrew, if inserted into vowel sequences.

As I've mentioned, if restricted (similar to the VS restrictions) to
particular cases (like just before (or between) Hebrew (and Arabic)
vowel marks, then ok. But only because the combining classes
of the Arabic and Hebrew vowel marks are bizarre (read: wrong).

...

There was no such constraint defined for CGJ.

While perhaps not explicitly stated as a restriction, the only
*intended*
use (after some suggestions had been dropped) was to be at the *end*
of a combining character sequence.

Post by Owen Taylor
The current statement
about CGJ is merely that it should be ignored in language-sensitive
sorting and searching unless "it specifically occurs within
a tailored collation element mapping." There is no constraint
on what particular sequences involving CGJ could be tailored
that way, and hence no constraint on what particular sequences
CGJ might occur in, in Unicode plain text.

Post by Kenneth Whistler
A combining character sequence is a base character followed
by any number of combining characters. There is no constraint
in that definition that the combining characters have to
have non-zero combining class.

Actually, it is not non-conformant like using an unassigned
code point would be. The latter is directly subject to conformance
C6 A process shall not interpret an unassigned code point as an
abstract character.
The case for variation sequences is subtly different. Suppose
I encounter a variation sequence <X, VS1>, where X could be
any Unicode character. X itself is conformantly interpretable.
VS1 itself is conformantly interpretable. The constraints are
on the interpretation of the variation sequence itself. And
"Only the variation sequences specifically defined in the
file StandardizedVariants.txt in the Unicode Character
Database are sanctioned for standard use; in all other
cases the variation selector cannot change the visual
appearance of the preceding base character from what it
would have had in the absence of the variation selector."
In other words, you can drop VS1's to your heart's content into
plain text, but a conformant implementation should ignore all
of them, unless a) it is interpreting variation selectors, and
b) it encounters a particular sequence defined in
StandardizedVariants.txt.

But since they too have combining class 0, inserting them
*between* combining characters (of non-zero combining class),
they will cause a normalisation issue (not a technical problem,
but a principles problem).

Post by Owen Taylor
The cooccurrence constraints on VS1's are constraints on the
*encoding committees* regarding what sequences they will or will
"The base character in a variation sequence is never a combining
character or a decomposable character."
That means the UTC will never make such a variation sequence
interpretable by putting it into StandardizedVariants.txt.

Ideally the VSes should have gotten a low non-zero combining class...
(e.g. 1)

Post by Owen Taylor
*But*, a text user who drops a VS1 into Unicode plain text
after a combining character doesn't "commit a foul" thereby --
he has just put a character into a position that no conformant
implementation will do other than ignore on display.

But it does mess up (hinder) the canonical reordering that
maybe *should* have taken place! They should be constrained
to occur just after a base character (to make up for the design
flaw of them getting combining class 0).

Post by Kenneth Whistler
Canonical reordering is scoped to stop at combining class = 0.

(I know it is. But I confess I'm not sure why.)

Because God, er...., um... Mark Davis created it that way. ;-)

Eeh, not really the answer I expected. This particular behaviour makes
(marginal!) sense for *enclosing* (and that means something visually...)
combining characters. I'm not so sure it makes sense to have this
particular behaviour for any other combining character (like combining
vowels, or recent flurry of invisible combining characters).

Post by Kent Karlsson
Yes, for the definition of normalisation. But not necessary for
canonical equivalence. Your point?

D24 Canonical equivalent: Two character sequences are said to be

...

That's one way of defining canonical equivalence. There are
equivalent(!)
ways, not going via NFD normal forms. However, I wasn't really going
so far. I was just saying that you determine if XxyY is canonically
equivalent
to XyxY or not by just looking at the combining classes of the
characters
x and y. You need not compute the NFD forms of XxyY and XyxY before
making that determination. (This is a rather immediate consequence of
an alternate, but equivalent, definition of canonical equivalence.)

...

Post by Kenneth Whistler
I don't see how introduction of CGJ into such sequences calls
any of the definitions or algorithms into question.

No, not the algorithm, but the basic idea and design. The algorithm
as such has no "idea" how or why the combining class numbers
were assigned. But we humans do, or might have.

True.

Which is one of my points!

The formal answer is the true and interesting answer!
It shouldn't be canonically equivalent because it *isn't*
canonically equivalent.

That's just a stability answer. It does not say why CGJ was given
(mistakenly, I'd say) combining class 0 in the first place.

Post by Owen Taylor
But instead of obsessing about the particular case of the CGJ,
admit that the same shenanigans can apply to any number of
default ignorable characters which will not result in visually
distinct renderings under normal assumptions about rendering.

No, this particular problem applies only to combining characters
of class 0 that are invisible, since they betray the very idea of
canonical reordering.

Post by Owen Taylor
I'm detecting a deeper concern here -- that such a situation
should not be allowed in the standard at all, as a matter
of design and architecture. But as a matter of practicality,
given the complexity of text representation needs in the
Unicode Standard, I don't think you can legislate these kinds
of edge cases away entirely.

Again, this particular problem applies only to combining
characters of class 0 that are invisible. Yes, there are other
cases which are, and should be, non-equivalent, but should
look the same (except when doing "show invisibles").

non-conforming.

Post by Owen Taylor
And I disagree with you, obviously. It should neither be
deprecated nor constrained from use where it may helpfully
solve a problem of text representation (in Biblical Hebrew).

Emphasis: "where it may helpfully solve a problem of text
representation (in Biblical Hebrew)". There we can agree,
even though I don't find that particular hack to be the
best solution. But if constrained *to* just before Hebrew
(and Arabic?) vowels (or at the end of a combining sequence),
ok. (Which I have said before.)

/kent k

Post by Owen Taylor
--Ken

P***@sil.org

2003-08-07 20:53:26 UTC

What I think is different here, Ken, is that a suggestion has been made
that CGJ be recommended for use within a combining sequence in order to
maintain a distinction for Biblical Hebrew, which it does by virtue of it's
property of blocking canonical reordering. No other default ignorable has
ever been specifically given this function. In introducing this function
for a particular character (CGJ, in this case), the issue really arises for
the first time. And I don't think it's insignificant: surely there will be
implementers out there wondering what the implications are with a
canonical-reordering blocker that can be inserted into sequences creating a
distinction where none previously existed -- and where none was ever
desired. (I think I mentioned this issue shortly after the CGJ suggestion
was first raised.)

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kenneth Whistler

2003-08-06 23:13:21 UTC

Post by Kenneth Whistler
The same thing can be said about any inserted invisible character,
combining or not.
How is: <a, ring above, null, dot below> supposed to be different from
<a, dot below, null, ring above>
How is: <a, ring above, LRM, dot below> supposed to be different from
<a, dot below, LRM, ring above>
In display, they might not be distinct, unless you were doing some
kind of show-hidden display. Yet these sequences are not canonically
equivalent, and the presence of an embedded control character or an
embedded format control character would block canonical reordering.

I disagree with you, using a LRM mark in the middle of a combining
sequence is conforming to canonicalization rules but is clearly
ill-formed,

It is not. TUS 4.0, p. 71:

D17a Defective combining character sequence: A combining character
sequence that does not start with a base character.

* Defective combining character sequences occur when a sequence
of combining characters appears at the start of a string or
follows a control or format character. Such sequences are
defective from the point of view of handling of combining
marks, but are not ill-formed.
^^^^^^^^^^^^^^^^^^^^^^

Post by Philippe Verdy
as well as using a NULL control in the middle, which
breaks the combining sequence.

I'm not claiming it doesn't break the combining sequence. Of
course it does. It creates a defective combining character
sequence, and that poses a challenge for rendering, since it
departs from the usual expectations for normal combining
character sequences. The renderer has to split hairs between
the fact that it is dealing with a defective combining
character sequence and the fact that it is dealing with a
default ignorable character which is supposed to be ignored
for text processes it is not immediately applicable to.

But I challenge you to find anything in the standard that
*prohibits* such sequences from occurring.

And *if* they occur, they are not canonically equivalent, which
was the point I was making to Kent.

Post by Philippe Verdy
The proposal to use CGJ however is legal: it does not break the
combining sequences and grapheme clusters, and thus the whole
encoded sequence encoded with CGJ will be considered by
rendering engines, where CGJ is a no-op for rendering but not for
the canonical ordering ...

Well, yes, which is why I have been advocating it as the
solution to the Biblical Hebrew text representation problem.
I agree with you about that. But it need not be characterized
as "legal" in opposition to the other examples I cited above.
All of these sequences are "legal" and allowed by the
standard.

--Ken

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Doug Ewell

2003-08-07 00:40:41 UTC

Post by Kenneth Whistler
But I challenge you to find anything in the standard that
*prohibits* such sequences from occurring.

I've learned that this question of "illegal" or "invalid" character
sequences is one of the main distinguishing factors between those who
truly understand Unicode and those who are still on the Road to
Enlightenment.

Very, very few sequences of Unicode characters are truly "invalid" or
"illegal." Unpaired surrogates are a rare exception.

In almost all cases, a given sequence might give unexpected results
(e.g. putting a combining diacritic before the base character) or might
be ineffectual (e.g. putting a variation selector before an arbitrary
character), but it is still perfectly legal to encode and exchange such
a sequence.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Philippe Verdy

2003-08-07 14:27:35 UTC

Post by Kenneth Whistler
But I challenge you to find anything in the standard that
*prohibits* such sequences from occurring.

For Unicode itself this is true, but what users want is interoperability
of the encoded text with accurate rendering rules.
In practice, this means that any undefined or unpredictable behavior
will mean lack of interoperability and should not be used.

The standard should then highly promote what is a /valid/ encoding
for text with regard of interoperability for all text processing algorithms
including parsing combining sequences, collation, and computing
character properties from those /valid/ encoded sequences.

We don't have to care much if some encoded text considered valid
under Unicode/ISO-IEC10646 is rendered or processed differently
or unpredictably, provided that this does not affect common text for
actual languages.

In fact the standard specifies that ALL sequences made of code
points in U+0000 to U+10FFFF (excluding U+xFEFF, U+xFFFF
and surrogates in U+D800 to U+DFFF) are valid under ISO/IEC
10646, but it does not attempt to assign properties or behavior to
ALL of these characters or encoded sequences, as this is the job
of Unicode to specify this behavior.

If there's something to enhance in the Unicode standard (not in the
ISO/IEC 10646), it's exactly the specification of interoperable encoded
sequences. This certainly means that concrete examples for actual
languages must be documented. Just assigning properties to individual
ISO/IEC 10646 characters is not enough, and Unicode should
concentrate more efforts in the actual encoding of text and not only on
individual characters.

So for me, the "validity" of text is a ISO/IEC 10646 concept (shared
now with Unicode versions for the assignment of characters in the
repertoire), related only to the legally usable code points, and Unicode
speaks about "well-formed" or "ill-formed" sequences, or about
"normalized" sequences and transformations that preserve the actual
text semantics.

There is no ambiguity in ISO/IEC 10646 for the character assignments.
But composed sequences are the real problem, for which Unicode
must seek agreements: the W3C character model is only based on
the simplified combining sequences, but Unicode should go further
with much more precise rules for the encoding of actual text, even
before any attempt to describe other transformation algorithms (only
the NF* transformations have for now a stability policy, but actual
text writers need also stability for the text composition rules for
actual languages.

We certainly don't need more assigned code points for existing
scripts. But more rules for the actual representation of text using
these scripts, and how distinct scripts can interact and be mixed.
There's some rules already specified for Combining jamos, or
combining Latin/Cyrillic/Greek alphabets, or for Hiragana/Katakana,
but we are still far from an agreement for Hebrew, and even for some
Han composed sequences, which still lack a specification needed
for interoperability.

The current wording of "Unicode validity" is for me very weak, and
probably defective. What it designates is only a ISO10646 validity
for used code points, and the validity of their UTF* transformations,
based on individual code points. The kind of validity rules users
want with Unicode is a conformance of the actually encoded scripts
for actual languages, for interoperability and data exchange.

The fact that Unicode is born by trying to maximize the roundtrip
convertibility with legacy codepages or encoded character sets has
introduced many difficulties: first the base+combining characters
model was introduced as fundamental for alphabetized scripts with
separate letters for vowels. Then there's the case of Brahmic scripts
which complicates things, as Unicode has chosen to support both
the ISCII standard model with nuktas and viramas in logical encoding
order, and the TIS620 model for Thai and Lao with a physical model.
On the opposite the combining jamos model is remarkably simple,
and it still follows the logical model shared by alphabetized scripts.

Looking now at the difficulties of encoding Tengwar reveals most of
the difficulties that already exist for Thai, and now Hebrew, and subtle
needed artefacts needed in existing scripts used to transliterate
foreign languages. Some of these difficulties are also affecting now
the general alphabetized scripts (Latin notably), showing that the
ummutable model used to encode base letters and diacritics is not
universal. So Unicode will need to extend and specify much more its
own character model to support more scripts and languages, including
in the case of transliterations.

May be in the future, this will lead to defining a new level of conformance
by defining something that is more precise than just some basic
canonical equivalence rules (for NF* transforms and XML), with more
precise definitions of "ill-formed" or "defective" sequences (I confess
that I do not understand the need to deferentiate both concepts, and
this current separation is really more confusive than helpful to
understand the Unicode standard). What this means, is that we need
something saying "Unicode valid text" and not just "Unicode encoded
text" which just relates to the shared assignment of code points to
individual characters. The current "valid" term should be left to the
ISO/IEC 10646 standard, and to the very few Unicode algorithms
that handle only individual code points (such as UTF* encoding
forms and schemes), but its current definition is not helping
implementers and writers to produce interoperable textual data.

If the term "valid" cannot be changed, then I suggest defining
"conforming" for encoded text independantly of its validity (a
"conforming text" would still need to use a "valid encoding").
--
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Jony Rosenne

2003-08-07 17:29:09 UTC

We need an official Unicode Lint.

Jony

-----Original Message-----
Sent: Thursday, August 07, 2003 4:28 PM
Subject: SPAM: Re: Questions on ZWNBS - for line initial
holam plus alef
On Thursday, August 07, 2003 2:40 AM, Doug Ewell

Post by Kenneth Whistler
But I challenge you to find anything in the standard that
*prohibits* such sequences from occurring.

I've learned that this question of "illegal" or "invalid" character
sequences is one of the main distinguishing factors between

those who

Post by Doug Ewell
truly understand Unicode and those who are still on the Road to
Enlightenment.
Very, very few sequences of Unicode characters are truly

"invalid" or

Post by Doug Ewell
"illegal." Unpaired surrogates are a rare exception.
In almost all cases, a given sequence might give unexpected results
(e.g. putting a combining diacritic before the base character) or
might be ineffectual (e.g. putting a variation selector before an
arbitrary character), but it is still perfectly legal to encode and
exchange such a sequence.

For Unicode itself this is true, but what users want is
interoperability of the encoded text with accurate rendering
rules. In practice, this means that any undefined or
unpredictable behavior will mean lack of interoperability and
should not be used.
The standard should then highly promote what is a /valid/
encoding for text with regard of interoperability for all
text processing algorithms including parsing combining
sequences, collation, and computing character properties from
those /valid/ encoded sequences.
We don't have to care much if some encoded text considered
valid under Unicode/ISO-IEC10646 is rendered or processed
differently or unpredictably, provided that this does not
affect common text for actual languages.
In fact the standard specifies that ALL sequences made of
code points in U+0000 to U+10FFFF (excluding U+xFEFF, U+xFFFF
and surrogates in U+D800 to U+DFFF) are valid under ISO/IEC
10646, but it does not attempt to assign properties or
behavior to ALL of these characters or encoded sequences, as
this is the job of Unicode to specify this behavior.
If there's something to enhance in the Unicode standard (not
in the ISO/IEC 10646), it's exactly the specification of
interoperable encoded sequences. This certainly means that
concrete examples for actual languages must be documented.
Just assigning properties to individual ISO/IEC 10646
characters is not enough, and Unicode should concentrate more
efforts in the actual encoding of text and not only on
individual characters.
So for me, the "validity" of text is a ISO/IEC 10646 concept
(shared now with Unicode versions for the assignment of
characters in the repertoire), related only to the legally
usable code points, and Unicode speaks about "well-formed" or
"ill-formed" sequences, or about "normalized" sequences and
transformations that preserve the actual text semantics.
There is no ambiguity in ISO/IEC 10646 for the character
assignments. But composed sequences are the real problem, for
which Unicode must seek agreements: the W3C character model
is only based on the simplified combining sequences, but
Unicode should go further with much more precise rules for
the encoding of actual text, even before any attempt to
describe other transformation algorithms (only the NF*
transformations have for now a stability policy, but actual
text writers need also stability for the text composition
rules for actual languages.
We certainly don't need more assigned code points for
existing scripts. But more rules for the actual
representation of text using these scripts, and how distinct
scripts can interact and be mixed. There's some rules already
specified for Combining jamos, or combining
Latin/Cyrillic/Greek alphabets, or for Hiragana/Katakana, but
we are still far from an agreement for Hebrew, and even for
some Han composed sequences, which still lack a specification
needed for interoperability.
The current wording of "Unicode validity" is for me very
weak, and probably defective. What it designates is only a
ISO10646 validity for used code points, and the validity of
their UTF* transformations, based on individual code points.
The kind of validity rules users want with Unicode is a
conformance of the actually encoded scripts for actual
languages, for interoperability and data exchange.
The fact that Unicode is born by trying to maximize the
roundtrip convertibility with legacy codepages or encoded
character sets has introduced many difficulties: first the
base+combining characters model was introduced as fundamental
for alphabetized scripts with separate letters for vowels.
Then there's the case of Brahmic scripts which complicates
things, as Unicode has chosen to support both the ISCII
standard model with nuktas and viramas in logical encoding
order, and the TIS620 model for Thai and Lao with a physical
model. On the opposite the combining jamos model is
remarkably simple, and it still follows the logical model
shared by alphabetized scripts.
Looking now at the difficulties of encoding Tengwar reveals
most of the difficulties that already exist for Thai, and now
Hebrew, and subtle needed artefacts needed in existing
scripts used to transliterate foreign languages. Some of
these difficulties are also affecting now the general
alphabetized scripts (Latin notably), showing that the
ummutable model used to encode base letters and diacritics is
not universal. So Unicode will need to extend and specify
much more its own character model to support more scripts and
languages, including in the case of transliterations.
May be in the future, this will lead to defining a new level
of conformance by defining something that is more precise
than just some basic canonical equivalence rules (for NF*
transforms and XML), with more precise definitions of
"ill-formed" or "defective" sequences (I confess that I do
not understand the need to deferentiate both concepts, and
this current separation is really more confusive than helpful
to understand the Unicode standard). What this means, is that
we need something saying "Unicode valid text" and not just
"Unicode encoded text" which just relates to the shared
assignment of code points to individual characters. The
current "valid" term should be left to the ISO/IEC 10646
standard, and to the very few Unicode algorithms that handle
only individual code points (such as UTF* encoding forms and
schemes), but its current definition is not helping
implementers and writers to produce interoperable textual data.
If the term "valid" cannot be changed, then I suggest
defining "conforming" for encoded text independantly of its
validity (a "conforming text" would still need to use a
"valid encoding").
--
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.

Peter Kirk

2003-08-07 19:34:09 UTC

Post by Kenneth Whistler
But I challenge you to find anything in the standard that
*prohibits* such sequences from occurring.

If the term "valid" cannot be changed, then I suggest defining
"conforming" for encoded text independantly of its validity (a
"conforming text" would still need to use a "valid encoding").

As a very quick thought, maybe what we need is not restrictions to the
Unicode standard but a set of rules for each language or group of
languages, defining exactly how Unicode characters should be used to
write the words etc of that language. Such definitions might be
independent of the actual Unicode standard.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Cowan

2003-08-07 01:22:18 UTC

Post by Kenneth Whistler
D17a Defective combining character sequence: A combining character
sequence that does not start with a base character.
* Defective combining character sequences occur when a sequence
of combining characters appears at the start of a string or
follows a control or format character. Such sequences are
defective from the point of view of handling of combining
marks, but are not ill-formed.
^^^^^^^^^^^^^^^^^^^^^^

What, if anything, does the term "ill-formed" mean when attached to
a sequence of characters? I understood that every sequence of
characters whatsoever is permitted.
--
"But the next day there came no dawn, John Cowan
and the Grey Company passed on into the ***@reutershealth.com
darkness of the Storm of Mordor and were http://www.ccil.org/~cowan
lost to mortal sight; but the Dead http://reutershealth.com
followed them. --"The Passing of the Grey Company"

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Philippe Verdy

2003-08-06 23:41:41 UTC

On Thursday, August 07, 2003 1:13 AM, Kenneth Whistler

Post by Kenneth Whistler
Well, yes, which is why I have been advocating it as the
solution to the Biblical Hebrew text representation problem.
I agree with you about that. But it need not be characterized
as "legal" in opposition to the other examples I cited above.
All of these sequences are "legal" and allowed by the
standard.

Once again sorry if I used the terms "ill-formed" or "well-formed"
instead of "defective" or "non defective" (normal?). Such distinction
in the standard does not help its understanding when discussing
about interoperability of text processing where neither ill-formed
nor defective sequences should be used if interoperability is the
main focus (and also normally the design focus for Unicode).

The canonical equivalences (NFC, NFD, canonical ordering) is
needed now for XML processing and in fact it greatly reduces
the number of ill-formed, invalid, or defective sequences or
whatever bad encoding of actual text, to simplify its processing.
Still these equivalences don't solve all the issues and create their
own (and this is now a good reason to use CGJ to override the
canonical ordering of combining diacritics).

Of course there may be a lot of strings created with Unicode
which are not "ill-formed" and not canonically equivalent (per
NFC, NFD, canonical ordering), but I won't enter in that zone.
For XML what is relevant is that it processes strings in NFC
form and thus implies only canonical equivalences, but XML
will still process "defective" sequences by correctly
processing characters per its canonical combining sequences.

I'd like to see a more formal rule for defective uses of CGJ used
to fix canonical ordering. What I suggested was to specify that
only some sequences with CGJ would be "non defective", if
the CGJ appears before a base character or between two
combining characters. The character model needs then to be
refined to be more precise to document which uses are
considered non defective, and which ones are not.

So a sequence <..., ring above, CGJ, cedilla, ...> would
not be defective as it fixes the canonical ordering, even if
in this case it does not interact graphically (note that this
statement supposes that the cedilla effectively appears
below, something which is wrong with some languages,
where the cedilla appears in fact like an acute accent
above right...).

The example of the effective rendering of diacritics at the
presupposed placement indicated by their combining class
is significant: it shows that combining classes just handle
some common placement rules, but not every case, and
a particular language or renderer may need to place
diacritics on other positions, in which case the canonical
ordering would have an impact on the renderer. That's a
good enough reason to justify and document the use of
CGJ as a combining class override for diacritics, whose
usage should be restricted for interoperability.

This has a consequence for input methods and editors:
users can type base characters and diacritics, and the
editor will, by default, use a canonical ordering, that the user
may fix if needed for a particular language with a control
command that would "swap" two misplaced diacritics by
automatically inserting a CGJ only if needed because both
diacritics have distinct combining classes: this editor control
command would have no other effect if executed after two
diacritics with identical combining, or after a single diacritic,
and the editor should make its best effort to not allow user
enter ill-formed or defective sequences.
--
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kenneth Whistler

2003-08-07 02:41:48 UTC

What, if anything, does the term "ill-formed" mean when attached to
a sequence of characters?

Nothing, really. The bullet goes on to point to the definition
(D30) of "ill-formed", which applies to code *unit* sequences in
the context of the encoding forms.

The rewrite of Chapter 3 of the Unicode Standard dispensed with
the ill-advised ;-) and confusing distinction between "illegal",
"irregular", and "ill-formed" "code value sequences" in the
context of the discussion of "transformations", in favor of
a much starker and simpler distinction:

a code unit sequence is either well-formed or it is not

Post by John Cowan
I understood that every sequence of
characters whatsoever is permitted.

As regards code *point* sequences, these sequences can either
be conformant to the standard or not conformant to the standard.
They are conformant if they meet the conformance requirements
(the "C" clauses of Chapter 3). And as regards sequences of
characters that basically comes down to not trying to
interchange reserved or noncharacter code points. So if you
include an reserved (unassigned) code point (for a particular version
of the Unicode Standard) in an interchanged data stream,
a recipient could claim that data stream is not conformant
to (that version of) the standard. Shorthand: the data contains
"illegal" characters. But even that is relative to the version
of the standard, since a recipient of reserved code points is
obliged to preserve their values -- they may, after all, be
"legal" assigned code points in a future version of the
standard that that particular implementation is not supporting.

So, yeah, basically every sequence of code points "assigned to
abstract characters" is "legal" for interchange. What you cannot
interchange are code points with gc=Cs (U+D800..U+DFFF) or
code points with gc=Cn (noncharacters and reserved).

What D17a is trying to tell people is that while certain sequences
of Unicode characters may be "defective" from the point of
view of certain kinds of processing -- in this case rendering
of combining character sequences -- that does not make them
ill-formed (for which see the specification of encoding forms),
nor does it make them nonconformant to the standard.

There are many sequences of Unicode characters that we could
dream up which would be abominable, distasteful, problematical,
defective, implementation-busting, or just plain screwy,
but the standard itself isn't prohibiting people from
conformantly creating such sequences and then challenging
Microsoft or anybody else to display them without
blowing a gasket.

One of the reasons why we have to be so incredibly careful now
before introducing conceptually new *types* of characters,
like the COMBINING GRAPHEME JOINER or such things as
INVISIBLE BASE CHARACTER or COMBINING CLASS CHANGER or whatnot,
is precisely that it gets harder and harder to program
defensively against all the possible combinations and interactions
that such beasties might have when mixed with everything else
that is available.

--Ken

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kent Karlsson

2003-08-07 20:53:04 UTC

Post by Kenneth Whistler
Canonical reordering is scoped to stop at combining class = 0.

(I know it is. But I confess I'm not sure why.)

Because God, er...., um... Mark Davis created it that way. ;-)

Eeh, not really the answer I expected. This particular behaviour makes
(marginal!) sense for *enclosing* (and that means something

Clarification: marginal since:
. There seems to be little point in putting
some kind of diacritic outside of an enclosing mark.
. Trying to put a diacritic outside of a double diacritic
is promptly considered to be inside of the double
diacritic (see fig. 3-2 in TUS 3.0 & 4.0). I don't think there
would have been any loss in doing the same for
diacritic marks with enclosing marks.

It may also be doubtful to use combining class 0 for Indic
vowels. E.g. is there any point in distinguishing <consonant,
nukta, dep. vowel> and <consonant, dep. vowel, nukta>?
Or in distinguishing <consonant, anusvara, dep. vowel> and
<consonant, dep. vowel, anusvara>? (Recall that Indic syllables
often go through quite a lot of shaping, including far reaching
glyph moves, unspecified by Unicode itself.) This can be
mitigated by defined a "syllable syntax", for the benefit of
those of us who don't find that obvious. Ok, the position of
nuktas is nearly given, since there are precomposed characters
with nukta. Not obvious how it interacts with conjunct formation
though. Maybe there is sense in distinguishing the examples
above (if so, why?).

As I've mentioned, assigning class 0 to invisible combining
characters, I find to be a major mistake. Can't be changed now,
but should be mitigated somehow.

/kent k

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kenneth Whistler

2003-08-07 23:13:09 UTC

Post by Kenneth Whistler
But I challenge you to find anything in the standard that
*prohibits* such sequences from occurring.

If the term "valid" cannot be changed, then I suggest defining
"conforming" for encoded text independantly of its validity (a
"conforming text" would still need to use a "valid encoding").

I emphatically agree with Peter on this.

The impulse to get the Unicode Standard to head down the road
to becoming the "spelling standard" for all languages of the
world has to be constrained, simply because there is not the
expertise or the bandwidth in the UTC to accomplish this and
because it isn't the business of the UTC in the first place.

This is the kind of task which *must* be distributed to the
relevant stakeholders around the world, wherever they may
be and however their relevant jurisdictions are defined and
constituted.

The establishment of orthographic rules for particular language in
the context of the Unicode Standard means transferring the notion
of what the printed conventions for that language are -- whatever
they may be -- into a determination of exactly which Unicode
characters are to be used to represent those conventions,
including any constraints on cooccurrence with particular
format control characters, and so on.

The scope of the task of defining rendering rules in the
Unicode Standard is generic to script behavior -- establishing
the general rules of the road, as it were, for how the
scripts behave in the encoding, so that people and implementations
have a determinate sense of what order characters should be
in, what it means for combining characters to "combine" with
base characters, how format control characters may impact
script rendering generically, and so on. But beyond that, one
is getting into the realm of orthographic rules for particular
languages or jurisdictions and the realm of typographic
conventions for particular styles and regions. Making those
determinations belongs to the stakeholders themselves: ministries,
academies, associations, type designers, whoever.

It is precisely because the developers of the Unicode Standard
cannot foresee all possible orthographic conventions and
uses to which the standard may be put in representing text
that it is deliberately permissive: essentially any sequence
of characters is "legal", and it is up to the users of
the standard to determine, for them, what is a *sensible*
sequence of characters for their multitudinous purposes.

--Ken

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Youtie Effaight

2003-08-07 23:39:52 UTC

Post by Jony Rosenne
We need an official Unicode Lint.
Jony

Lint? Oh, do you mean a program that polices mail list replies and
eliminates those snappy one-line replies that quote over 150 lines of the
original message for no particular purpose?

Yer ol' pal,
Youtie Effaight

_________________________________________________________________
Protect your PC - get McAfee.com VirusScan Online
http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kenneth Whistler

2003-08-07 23:49:00 UTC

Post by Kenneth Whistler
How is: <a, ring above, null, dot below> supposed to be different from
<a, dot below, null, ring above>

Dunno what you are talking about. TUS 3.0 p. 131 is part of the
table of line boundary rules, and has nothing to do with this
display issue.

I suspect this was a typo for p. 1*2*1, where you find the sentence:

"Defective combining character sequences should be rendered as if they
had a space as a base character."

But if that is your intent, then you have to take that in the
context of the beginning of the paragraph as well, which starts:

"In a degenerate case, a nonspacing mark occurs as the first character
in the text or is separated from its base character by a line
separator, paragraph separator, or other formatting character
that causes a positional separation. ..."
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In the case I have cited above, insertion of a null is not an
instance of an insertion of a formatting character that causes
a positional separation, and you are right back to the kind of
rendering conundrum as I originally stated it: does the renderer
focus on the defectiveness of the combining sequence and separate
off the second combining mark (as you suggest) or does it
focus on the ignorability of the inserted character and
render the entire sequence, ignoring any display effect of the
inserted ignorable character.

I don't believe that your assertion about how that sequence
"would" be rendered can be taken as a mandate by the standard,
at all.

--Ken

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kent Karlsson

2003-08-08 11:42:46 UTC

Yes, sorry.

Post by Kenneth Whistler
"Defective combining character sequences should be rendered as if they
had a space as a base character."
But if that is your intent, then you have to take that in the
"In a degenerate case, a nonspacing mark occurs as the first character
in the text or is separated from its base character by a line
separator, paragraph separator, or other formatting character
that causes a positional separation. ..."
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In the case I have cited above, insertion of a null is not an
instance of an insertion of a formatting character that causes
a positional separation,

???

The insertion of ANY formatting (or other control) character
causes a "positional separation" (in that they break the combining
sequence, if inserted into one)!

Post by Kenneth Whistler
and you are right back to the kind of
rendering conundrum as I originally stated it: does the renderer
focus on the defectiveness of the combining sequence and separate
off the second combining mark (as you suggest) or does it
focus on the ignorability of the inserted character and
render the entire sequence, ignoring any display effect of the
inserted ignorable character.

Now are many formatting (or other control) characters suddenly
*effectively* allowed INSIDE combining sequences via some hitherto
well hidden back door?! I do hope not! But if they are, they should be
*explicitly* allowed (I think allowing that would be a very bad idea,
however). It's not the case now, see TUS3, D17a, which (correctly)
says nothing about "causing positional separation" (whatever that
would be).

/kent k

Post by Kenneth Whistler
I don't believe that your assertion about how that sequence
"would" be rendered can be taken as a mandate by the standard,
at all.
--Ken

Kenneth Whistler

2003-08-08 00:21:25 UTC

An anonymous wag who picks the nits even finer that I did
wishes the following clarification to be posted regarding
an assertion I made about what Unicode code points are
interchangeable. ;-)

------------- Begin Forwarded Message -------------

Post by Kenneth Whistler
So, yeah, basically every sequence of code points "assigned to
abstract characters" is "legal" for interchange. What you cannot
interchange are code points with gc=Cs (U+D800..U+DFFF) or
code points with gc=Cn (noncharacters and reserved).

You *can* interchange reserved characters. You *should* not originate
them, but if you are passed a string with them, you should preserve
them, and pass them on. And in most circumstances you can depend on
them being preserved. For noncharacters you can interchange, but
should not depend on them being preserved.

You *can* also interchange Cs characters; just not within conformant
UTF encoding scheme/forms. But it is perfectly legal for me to have a
record with a field containing an *arbitrary Unicode code point*,
serialize that record, and send it off.

---------------End Forwarded Message ------------------

I concur with the general intent of this clarification, but
this is definitely in the gray area as regards exactly what
the conformance claims for the standard means.

It is certainly good practice and the most robust approach
to an implementation for it to behave the way suggested here,
but note also the following letter of the law from 10646,
to which the Unicode Standard itself claims conformance:

<quote>
2.2 Conformance of information interchange
A code-character-data-element (CC-data-element) within coded
information for interchange is in conformance with ISO/IEC
10646 if

a) all the coded representations of graphic characters
within that CC-data-element conform to clauses 6 and
7, ...
b) all the graphic characters represented within that
CC-data-element are taken from those within an identified
subset (clause 12)

...

7. General requirements for the UCS
...
b. Code positions to which a character is not allocated,
except for the positions reserved for private use characters
or for transformation formats, are reserved for future
standardization and shall not be used for any other
purpose. ...
</quote>

2.2.a and 7.b imply that it is not conformant to interchange
reserved code points, and 2.2.b implies that what you can
interchange are only the assigned characters from a subset
(in the Unicode case, of course, the subset of the whole).

So the way I would summarize this is:

I. Reserved code points

A conformant implementation should not originate them, but
because conformant implementations may be designed to work
with multiple versions of the standard and may encounter
uplevel data, good implementation practice is to follow the
Unicode recommendations about not munging uninterpreted
code points and about passing them along unharmed.

II. Noncharacters

These cannot be used in open interchange, although they can,
of course be used in "internal" interchange, which is
essentially a private agreement (perhaps with oneself) regarding
what noncharacter usage those code points have. No external
recipient can interpret them, nor is an external recipient
obliged to preserve them if received.

III. Surrogate code points

I would claim, contra the above, that these *cannot* be
interchanged in conformance with the standard -- at all.
If one is attempting to interchange arbitrary Unicode code
points, including Cs code points (U-0000D800..U-0000DFFF),
this cannot be done with a well-formed encoding form, and
thus cannot be done in conformance with the standard.
If one claims to be *interchanging* such code points in
the context of a Unicode string (which does not, of course,
have to be well-formed to constitute a "Unicode string" by
the definition in the standard), then such interchange
is effectively a protocol built on top of the standard,
rather than something in conformance with the standard
itself.

At any rate, that is how *I* would pick the nits.

--Ken

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kenneth Whistler

2003-08-08 21:22:00 UTC

Post by t***@widmann.uklinux.net

... Could there be another codepoint assigned that has
20CF;ZERO WIDTH SYMBOL;Sk;0;ON;<compat> 0020;;;;N;;;;;
[...]

But I'm not sure that ZERO WIDTH SYMBOL is the best name, ...

What would be a better name? ACCENT CARRIER?

How about: U+10FFFD UNNECESSARY CHARACTER ?

Philippe, you are tilting at windmills, here. There is no
chance that the UTC is going to consider such a character,
in my assessment, let alone give it the properties you
suggest.

--Ken

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Michael Everson

2003-08-08 22:49:21 UTC

Philippe, you are tilting at windmills, here. There is no chance
that the UTC is going to consider such a character, in my
assessment, let alone give it the properties you suggest.

Nor WG2 either.
--
Michael Everson * * Everson Typography * * http://www.evertype.com

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Philippe Verdy

2003-08-08 23:37:25 UTC

Post by Michael Everson

Philippe, you are tilting at windmills, here. There is no chance
that the UTC is going to consider such a character, in my
assessment, let alone give it the properties you suggest.

Nor WG2 either.

Why that? Because I suggest something that some other may think
as useful to fill a large gap in Unicode for spcing diacritics, but I'm
not trusted enough due to my errors or confusions here, so that this
suggestion would be endorsed by more "serious" UTC or WG2
members?

I admit that the properties of such character can be discussed, and
is possibly not necessarily a "Sk" symbol, but a "Lo" letter, in which
case the name "INVISIBLE LETTER" may be appropriate (where
it could also fill the gap for Hebrew "Yerushala(y)im", but this is a
possibly distinct function for a missing letter in phonology).

Why do you think it is stupid to have a single carrier character that
would avoid adding new spacing diacritics, when the standard
combining diacritics could be used without less "quirks" like
"defective" sequences just to produce the desired effect?

If you think that spacing diacritics are stupid, why then are they
given these properties and not deprecated (no more recommanded)
in the standard, in favor of the SPACE+diacritics sequences, which
are really not equivalent to spacing diacritics used as symbols
(sometimes described also as "MODIFIER LETTER" which is
very misleading according to their gc=Sk property) and as base
characters (to which other diacritics can be applied) ?
--
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kenneth Whistler

2003-08-09 00:27:20 UTC

Post by Michael Everson

Philippe, you are tilting at windmills, here. There is no chance
that the UTC is going to consider such a character, in my
assessment, let alone give it the properties you suggest.

Nor WG2 either.

Mostly because there is no "large gap" here in the first place.

Post by Philippe Verdy
Why do you think it is stupid to have a single carrier character that
would avoid adding new spacing diacritics, when the standard
combining diacritics could be used without less "quirks" like
"defective" sequences just to produce the desired effect?

Because the mechanism for doing so -- application to SPACE or
to NBSP -- has been specified by the standard for a decade now.

Post by Philippe Verdy
If you think that spacing diacritics are stupid,

We do not. Some of them are necessary compatibility characters.
Others have distinct usage as spacing forms that warrant
their separate encoding.

Post by Philippe Verdy
why then are they
given these properties and not deprecated (no more recommanded)
in the standard,

Because the ones in the standard, and particularly the ASCII
and Latin-1 spacing diacritics, were required for a number
of legacy and implementation reasons...

Post by Philippe Verdy
in favor of the SPACE+diacritics sequences,

...and because these are not, and never have been, canonically
equivalent.

--Ken

"Well then, if he be mad, as he is, and with a madness that mostly
takes one thing for another, and white for black, and black for
white, as was seen when he said the windmills were giants, and the
monk's mules dromedaries, flocks of sheep armies of enemies, and
much more to the same tune, it will not be very hard to make him
believe that some country girl, the first I come across here, is
the lady Dulcinea; and if he does not believe it, I'll swear it;
and if he should swear, I'll swear again; and if he persists I'll
persist still more, so as, come what may, to have my quoit always
over the peg. Maybe, by holding out in this way, I may put a stop
to his sending me on messages of this kind another time..."

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Hudson

2003-08-09 01:01:53 UTC

Post by Kenneth Whistler
Because the mechanism for doing so -- application to SPACE or
to NBSP -- has been specified by the standard for a decade now.

True enough, but I'm also a bit concerned about this mechanism because
white space characters are another pesky thing that not all applications
paint. TEX, perhaps most famously, uses its own 'glue' instead of the space
glyph in the font. And what happens when word spacing is expanded or
contracted in text? The diacritic mark ends up being shoved to the left or
right of where it should be. Of course, if the space glyph is not painted
you have to rely on blind offsets for mark positioning, because unpainted
glyphs can't be found for smart positioning lookups. As someone who cares
about typography, I don't like blind offsets because they don't offer
precise enough control: I would much rather have a mechanism that I can
reliably and precisely use with glyph positioning lookups. I'm not
suggesting that the use of space/nbspace for this purpose should be
deprecated, only that an alternate mechanism would be useful for those who
want more control of how combining marks are rendered on a blank base.

A similar but not identical issue was raised by Peter Constable when we
were talking about Qere vs Ketiv readings in Biblical Hebrew. There are
cases in which vowels are applied to ellided consonants, which in some
texts results in marks applied to a blank base in mid-word. In this case,
my concern about using space or nbspace is that these imply a word break
where there is not, in fact, any break in the word: the blank base is part
of the word.

John Hudson

Tiro Typeworks www.tiro.com
Vancouver, BC ***@tiro.com

The sight of James Cox from the BBC's World at One,
interviewing Robin Oakley, CNN's man in Europe,
surrounded by a scrum of furiously scribbling print
journalists will stand for some time as the apogee of
media cannibalism.
- Emma Brockes, at the EU summit

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-08-09 19:27:50 UTC

Post by Michael Everson

Philippe, you are tilting at windmills, here. There is no chance
that the UTC is going to consider such a character, in my
assessment, let alone give it the properties you suggest.

Nor WG2 either.

Mostly because there is no "large gap" here in the first place.

The gap may not be large, but Philippe, John H and I have identified a
real gap. Why this antagonism against filling it? Is it just because you
don't like the name Philippe suggested? I accept that there may be
rational arguments to be made that the gap is not significant enough for
Unicode to fill, but I have not seen any such rational arguments, just
"over my dead body" type irrational responses.

Because the mechanism for doing so -- application to SPACE or
to NBSP -- has been specified by the standard for a decade now.

Understood. But John H has clearly spelled out several of the weaknesses
in this mechanism. And this is not something set in stone, there is I
think no mention of it in the stability document. So there is no a
priori reason not to define a new and improved mechanism, with the old
mechanism still supported but now discouraged.

Post by Philippe Verdy
If you think that spacing diacritics are stupid,

We do not. Some of them are necessary compatibility characters.
Others have distinct usage as spacing forms that warrant
their separate encoding.

And what if it decided that others have "distinct usage as spacing
forms" which cannot be adequately represented by space or NBSP plus
diacritic? Of course we could propose more spacing diacritics, but
surely rather than define a potentially large number of new spacing
forms it would make sense to define one new character which can combine
with any diacritic to produce a spacing form.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Cowan

2003-08-09 20:41:35 UTC

Post by Peter Kirk
The gap may not be large, but Philippe, John H and I have identified a
real gap. Why this antagonism against filling it?

What you have identified is a set of implementation defects, not problems
with the Unicode Standard. The standard way to do what you want is to
precede the combining mark with SP or NBSP. If that "doesn't work", then
the implementation that makes it not work needs to be fixed.
--
John Cowan ***@reutershealth.com http://www.ccil.org/~cowan
Does anybody want any flotsam? / I've gotsam.
Does anybody want any jetsam? / I can getsam.
--Ogden Nash, _No Doctors Today, Thank You_

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-08-09 21:14:47 UTC

Post by Peter Kirk
The gap may not be large, but Philippe, John H and I have identified a
real gap. Why this antagonism against filling it?

Tell Microsoft! (See Noah Levitt's posting.)

If this is indeed "The standard way to do what you want", then the
standard needs to make it clear that the sequence of <space, combining
mark> or <NBSP, combining mark> has the properties which I want, i.e. it
has the width of the combining mark alone, and not the full width of a
space, and does not expand for justification, is not a line breaking
opportunity, does not in fact have any of the properties of a space. I
expect to see such a clarification in the next edition of the Unicode
Standard.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Philippe Verdy

2003-08-09 22:31:19 UTC

Post by Peter Kirk
The gap may not be large, but Philippe, John H and I have
identified a real gap. Why this antagonism against filling it?

Tell Microsoft! (See Noah Levitt's posting.)

And the W3C or SGML commities with the *ML character model!

Post by Peter Kirk
If this is indeed "The standard way to do what you want", then the
standard needs to make it clear that the sequence of <space, combining
mark> or <NBSP, combining mark> has the properties which I want, i.e.
it has the width of the combining mark alone, and not the full width
of a space, and does not expand for justification, is not a line
breaking opportunity, does not in fact have any of the properties of
a space. I expect to see such a clarification in the next edition of
the Unicode Standard.

Don't forget the issues created by the fact that in many cases, there's
no other way than using "defective" sequences, hoping that the
implementation will render the diacritic alone and not the dotted circle,
and will correctly space the diacritic. For now the tricky solution using
any (unspecified) control character before the diacritic is really
a trick, and not interoperable, and it complexifies the plain-text search
application where there is no predictable or stable base character to
match this diacritic (in addition, many input methods or keyboard driver
will not allow you to enter such "defective" sequence, meaning that for
example the "Yerushala(y)im" word cannot be entered and searched
exactly within a large text, as the implied invisible letter has no stable
representation).

Note that the CGJ solution will not work when the isolated diacritic must
be the initial of a word or breakable token: for this case, the solution with
SPACE is really tricky due to the special treatment of SPACE notably
in HTML, SGML, XML and often SQL which "normalize" whitespaces.

Thanks, the existing spacing diacritics do not have these problems as
they are not canonically equivalent to the suggested SPACE+diacritic
"compatibility equivalent", however this is only part of a solution for
some diacritics (not ALL), and it only fills the use as symbols, but not
as regular letters within the same word with surrounding letters.

So there is really two gaps: a small gap for missing spacing diacritics
used as symbols, and a large gap for all isolated diacritics used within
a word (that the CGJ solution only solves in the middle or at end of a
word, but not at its initial).
--
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Jon Hanna

2003-08-11 13:59:36 UTC

the

Post by Philippe Verdy
solution with
SPACE is really tricky due to the special treatment of SPACE notably
in HTML, SGML, XML

I disagree. There are a few different things that happen with whitespace in
such technologies. Some of these only apply to elements that do not allow
any character data apart from whitespace to appear directly within them, and
hence are not an issue here. Some happen at relatively high level of
processing, e.g. rendering (not parsing) of HTML, and as such should
correctly process spaces combined with combining characters.

There are only two theoretical problems that I can see here, the first is
that a whitespace character other than space gets converted to space by
attribute value normalisation, and that this changes the meaning of the text
in some way. This could only occur if the combining character were the first
character in a line of text, which is quite a nonsensical construct to begin
with.

The other would be with names, qnames, nmtokens and such. These are not
normal textual content; they are human-readable constructs that are based on
normal text because that makes it easier for some developers to work at a
plain-text level (if they speak the natural language that the human-readable
constructs were based on). Support for the linguistic oddity of a dialectic
divorced from the context in which it would normally exist would have little
justification in this place except for fulfilling the general goal of
"completeness". Completeness is a laudable aim of course, but extreme
edge-cases need only be brought in if they are both safe and cheap. Anyone
designing an XML application who frequently considers isolated diacritics as
the most natural choice in part of such tokens probably needs to take a
couple of weeks holidays before continuing the design. Of course some of the
characters that could be considered to be precomposed isolated diacritics
are banned from use in nmtokens anyway.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-08-11 17:58:49 UTC

Post by Jon Hanna
There are only two theoretical problems that I can see here, the first is
that a whitespace character other than space gets converted to space by
attribute value normalisation, and that this changes the meaning of the text
in some way. This could only occur if the combining character were the first
character in a line of text, which is quite a nonsensical construct to begin
with.

Not at all! Imagine a tutorial on a language, which might well list the
accents used, in a format like this:

` (grave accent) is used with a, e and o, and indicates more open
pronunciation
^ (circumflex accent) is used with any vowel, and indicates lengthening

So far so good, but when I get to an accent with no predefined spacing
variant, I have a problem!
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Cowan

2003-08-11 18:36:36 UTC

Post by Peter Kirk
So far so good, but when I get to an accent with no predefined spacing
variant, I have a problem!

No you don't. If you want to say <Seagull> is the diacritic used to
represent linguolabial sounds in the IPA, then you just encode U+0020 U+033C
at the beginning of the next line. If the seagull doesn't line up properly,
you complain to the foundry or the implementor.
--
John Cowan ***@reutershealth.com http://www.ccil.org/~cowan
Is it not written, "That which is written, is written"?

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Philippe Verdy

2003-08-11 23:04:01 UTC

Post by Peter Kirk
So far so good, but when I get to an accent with no predefined spacing
variant, I have a problem!

It's true that you can complain to a foundry for an inappropriaet glyph
positioning
but not to an implementor of other components dealing with text boundaries.
The inaccuracies we are spaeaking about are not in the glyph representation
but in text handling algorithms, these last ones being clearly part of the
Unicode
standard, unlike font problems.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Michael Everson

2003-08-11 18:46:12 UTC

Not at all! Imagine a tutorial on a language, which might well list
` (grave accent) is used with a, e and o, and indicates more open
pronunciation
^ (circumflex accent) is used with any vowel, and indicates lengthening
So far so good, but when I get to an accent with no predefined
spacing variant, I have a problem!

It has been explained the mechanism for doing this, and it has been
explained that if it is not implemented correctly you should yell at
the implementors.

In Mac OS X, for instance, the horizontal spacing seems to work all
right for many accents, but they seem to prefer to rest just above
the baseline. I'll report this as a rendering bug.
--
Michael Everson * * Everson Typography * * http://www.evertype.com

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Hudson

2003-08-11 19:04:17 UTC

Post by Peter Kirk
So far so good, but when I get to an accent with no predefined spacing
variant, I have a problem!

Again, you are working on the assumption that U+0020 is represented by an
actual painted glyph and not e.g. by a horizontal offset. In my experience,
the more sophisticated the application -- e.g. a professional page layout
application rather than a word processor -- the more likely it is that
white space characters will not be consistently treated as painted glyphs.
I've heard convincing arguments from the engineeers of such applications
that the space character shouldn't be a glyph in the font at all, but
should simply be a numeric value telling applications how large an offset
to apply. Since most fonts do not contain glyphs for variant white space
characters such as thin and hair spaces, applications typically treat these
as offset values. Painting a glyph is only one way to represent a character.

Regards, John

Tiro Typeworks www.tiro.com
Vancouver, BC ***@tiro.com

The sight of James Cox from the BBC's World at One,
interviewing Robin Oakley, CNN's man in Europe,
surrounded by a scrum of furiously scribbling print
journalists will stand for some time as the apogee of
media cannibalism.
- Emma Brockes, at the EU summit

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Cowan

2003-08-11 19:15:18 UTC

Post by John Hudson
Again, you are working on the assumption that U+0020 is represented by an
actual painted glyph and not e.g. by a horizontal offset. In my experience,
the more sophisticated the application -- e.g. a professional page layout
application rather than a word processor -- the more likely it is that
white space characters will not be consistently treated as painted glyphs.

I'm working on the assumption that applications that claim to conform to
Unicode actually do conform to it. If they don't, and it's not the font
foundry's fault, then complain, complain, complain! It's not Unicode
that's broken, it's the implementation.

Post by John Hudson
I've heard convincing arguments from the engineeers of such applications
that the space character shouldn't be a glyph in the font at all, but
should simply be a numeric value telling applications how large an offset
to apply. Since most fonts do not contain glyphs for variant white space
characters such as thin and hair spaces, applications typically treat these
as offset values. Painting a glyph is only one way to represent a character.

Nothing in the Unicode Standard says those oddball spaces have to work
"correctly" with combining diacritics.
--
A mosquito cried out in his pain, John Cowan
"A chemist has poisoned my brain!" http://www.ccil.org/~cowan
The cause of his sorrow http://www.reutershealth.com
Was para-dichloro- ***@reutershealth.com
Diphenyltrichloroethane. (aka DDT)

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kenneth Whistler

2003-08-10 22:27:26 UTC

Post by Peter Kirk
Tell Microsoft! (See Noah Levitt's posting.)

Indeed.

This is up to the implementation and the font, and is not something
that the Unicode Standard should mandate, IMO. This steps over the
bound of the plain text content.

Post by Peter Kirk
and does not expand for justification,

This is likewise an issue for the implementation. The Unicode Standard
does not mandate how a typographic implementation must implement
interword, intercharacter, or any other kind of justification.

Post by Peter Kirk
is not a line breaking
opportunity,

This, however, *is* specified. See UAX #14, in the section discussing
CM (the line break class associated with combining marks):

"If U+0020 SPACE is used as a base character, it is treated as
AL instead of SP."

What that means is that rather than sifting down through the line
break rule determinations according to a lb=SP category, it is
then handled as lb=AL, which puts it in the same class with
ordinary letters for the purposes of determining a line break
opportunity.

Of course, a conformant Unicode implementation is not *required*
to implement line-breaking as specified in UAX #14. But if it
claims it is doing so, and does not handle SP+combining_mark
combinations this way, then it is a nonconformant implementation
of line-breaking.

Post by Peter Kirk
does not in fact have any of the properties of a space.

It does, in fact, have some of the properties of a space, since
it is U+0020 SPACE, after all. But the important fact is that
implementations are supposed to be implementing the semantics
of the combining character sequence taking the SPACE as the base
and any following *non*-spacing combining mark as applied to
that base. If the implementations then result in inappropriate
rendering or line-breaking for that sequence, that is, as Kent
said, an issue to take up with the implementers.

Post by Peter Kirk
I
expect to see such a clarification in the next edition of the Unicode
Standard.

See above for the reasons why it is unlikely to be any more
constrained by the standard than it already is.

A point I keep trying to make, but which often gets overlooked
by people trying to code Unicode mechanisms for dealing with
edge cases, is that the design goal of the Unicode Standard is,
and always has been, to represent *plain text content*. It
cannot, and should not, IMO, deal with requirements for
representing arbitrarily fine distinctions of typographical
detail in all manuscripts and other documents in all writing
systems of the world.

Continuing to require that the Unicode Standard *must* specify
some inherent mechanism for indicating the display width of
combining character sequences clearly steps over the bounds
of what is required to represent plain text content.

--Ken

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Philippe Verdy

2003-08-10 23:36:17 UTC

Post by Kenneth Whistler
A point I keep trying to make, but which often gets overlooked
by people trying to code Unicode mechanisms for dealing with
edge cases, is that the design goal of the Unicode Standard is,
and always has been, to represent *plain text content*. It
cannot, and should not, IMO, deal with requirements for
representing arbitrarily fine distinctions of typographical
detail in all manuscripts and other documents in all writing
systems of the world.

Spacing diacritics are not "on the edge" of the standard, when they
are already given a full block and handled there as symbols (not as
letters as suggested in some parts of UAX's), with their own identity
independant of their actual glyphic representation. I am not
discussing about the typesetting of these grapheme clusters but
really about the textual semantics of such combining sequences
with an invisible base character, affecting all their properties and
not fully described in the various standard annexes. Due to the
huge legacy use of SPACE+diacritics in legacy text, and the
already normative parts of some standard annexes, it will be hard
to correct the behavior or change the text of these annexes.
And it's where a new better base character than SPACE could
help solve cleanly the ambiguities.
--
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-08-11 10:19:07 UTC

Post by Peter Kirk
Tell Microsoft! (See Noah Levitt's posting.)

Indeed.

This is up to the implementation and the font, and is not something
that the Unicode Standard should mandate, IMO. This steps over the
bound of the plain text content.
...
Continuing to require that the Unicode Standard *must* specify
some inherent mechanism for indicating the display width of
combining character sequences clearly steps over the bounds
of what is required to represent plain text content.
--Ken

Thank you, Ken. Well, you make it sound as if the problems are minimal,
and that version I can just about accept. But if Philippe is correct
about what he says about UAX#29 and UAX#14, there are some more serious
problems. It is certainly highly inappropriate for non-spacing
diacritics to be considered word boundaries. Philippe's quotations also
show that Unicode does concern itself with details of character
positioning and not just with plain text. Since Unicode does specify all
kinds of properties to do with spacing, breaking, word and sentence
boundaries, bidi behaviour etc etc, it is within the scope of Unicode
and indeed the responsibility of Unicode to define appropriate values of
all of these properties for spacing diacritics. I accept that some
things I have mentioned may have gone beyond this responsibility, so I
will withdraw those comments and continue to push only for appropriate
values of the properties which Unicode does define. And, if Philippe is
correct, many such properties are currently inappropriately defined, and
so either the text needs to be changed to correct these mistakes or a
new mechanism needs to be introduced.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Doug Ewell

2003-08-11 15:39:45 UTC

Post by Peter Kirk
Thank you, Ken. Well, you make it sound as if the problems are
minimal, and that version I can just about accept. But if Philippe is
correct about what he says about UAX#29 and UAX#14, there are some
more serious problems. It is certainly highly inappropriate for
non-spacing diacritics to be considered word boundaries.

Non-spacing diacritics had better not be word boundaries, otherwise a
string like Québec (spelled with U+0301, as here) would be considered
two words. I don't have time right now to look up the relevant
properties and UAX's, but I sincerely hope this is just another
"Philippe mistake" and not a general misinterpretation that anyone might
make.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-08-11 18:00:42 UTC

I think this may be a "Peter mistake". I meant to refer to spacing
diacritics. Sorry.

It is certainly highly inappropriate for spacing diacritics to be considered word boundaries.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Philippe Verdy

2003-08-11 23:00:49 UTC

----- Original Message -----
From: "Doug Ewell" <***@adelphia.net>
To: "Unicode Mailing List" <***@unicode.org>
Cc: "Peter Kirk" <***@ntlworld.com>; "Kenneth Whistler"
<***@sybase.com>
Sent: Monday, August 11, 2003 5:39 PM
Subject: Re: Questions on ZWNBS - for line initial holam plus alef

Not a mistake from me, sorry. From you yes: Peter Kirk probably wanted
to speak about *spacing* diacritics (when coded with SPACE+NSM).
There is no such *spacing* character in "Québec".

Don't accuse me of something I did not say. And be more tolerant please
with what is an obvious typo in the message from Peter Kirk. Instead of
just flaming, could you better read the message and accept errors and
correct them instead of sending such unconstructive replied.

Thanks.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kent Karlsson

2003-08-11 11:24:41 UTC

Post by Peter Kirk
If this is indeed "The standard way to do what you want", then the
standard needs to make it clear that the sequence of

<space, combining

Post by Peter Kirk
mark> or <NBSP, combining mark> has the properties which I

want, i.e. it

Post by Peter Kirk
has the width of the combining mark alone, and not the full

width of a

Post by Peter Kirk
space,

This is up to the implementation and the font, and is not something
that the Unicode Standard should mandate, IMO. This steps over the
bound of the plain text content.

I may agree with that, but id does not answer the questions I had
earlier:
How should a freestanding double diacritic be encoded (for purposes of
meta-discussions, and the like): <SPACE, dbl diacritic> or <SPACE, dbl
diacritic, SPACE>? How should combining characters (spacing as well
as non-spacing) that are not vertically centered *roughly* be displayed,
e.g. <SPACE, right-side combining character>, should that *roughly*
be displayed with or without a typographic void to the left of it? So
if I want a space (though not an overgrown one), should one use
<SPACE, SPACE, right-side combining character>? Or even <SPACE,
ZWSP, SPACE, right-side combining character>, to prevent "space
collapse".
And similarly for left-side combining characters. Likewise for defective
combining sequences. If I want a visible pseudo-base, a dotted ring, or
an
underline, the answers are fairly clear, using a suitable character as a
base. But not for the cases above. I don't think that should entirely up
to each font (maker), without any recommendation. (A "should" rather
than a "shall" is quite sufficient.)

/kent k

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kenneth Whistler

2003-08-11 00:05:21 UTC

Post by Philippe Verdy
Spacing diacritics are not "on the edge" of the standard,

The "edge" I was speaking of was the requirement for the exact
display width of a nonspacing diacritic on top of a SPACE to
be specifiable in some determinant way.

Post by Philippe Verdy
when they
are already given a full block and handled there as symbols (not as
letters as suggested in some parts of UAX's), with their own identity
independant of their actual glyphic representation. I am not
discussing about the typesetting of these grapheme clusters but
really about the textual semantics of such combining sequences
with an invisible base character, affecting all their properties and
not fully described in the various standard annexes.

In case you didn't notice, I was responding to Peter Kirk's
note -- not to yours.

Post by Philippe Verdy
Due to the
huge legacy use of SPACE+diacritics in legacy text, and the
already normative parts of some standard annexes, it will be hard
to correct the behavior or change the text of these annexes.

Um, yes.

Post by Philippe Verdy
And it's where a new better base character than SPACE could
help solve cleanly the ambiguities.

Um, no. Precisely because it would introduce *another* way
to do what is already specified in the standard. It would, I
predict, lead to nothing but more trouble.

You might, perhaps, find it satisfying, but I can guarantee
that there would then be a future critic complaining about
an unnecessary distinction introduced into the standard. And
then there would be *more* text in different places of the
standard to try to correct and change, in an attempt to
try to make consistent distinctions between the behavior
of <SPACE, NSM> and <ACCENT_ANCHOR, NSM>.

--Ken

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Philippe Verdy

2003-08-11 09:54:25 UTC

Post by Kenneth Whistler
Um, no. Precisely because it would introduce *another* way
to do what is already specified in the standard. It would, I
predict, lead to nothing but more trouble.
You might, perhaps, find it satisfying, but I can guarantee
that there would then be a future critic complaining about
an unnecessary distinction introduced into the standard. And
then there would be *more* text in different places of the
standard to try to correct and change, in an attempt to
try to make consistent distinctions between the behavior
of <SPACE, NSM> and <ACCENT_ANCHOR, NSM>.

I don't think so: for texts that are already coded with SPACE+NSM,
it won't be needed to do changes, as long as applications using
them are satisfied with their existing behavior, even if it's ambiguous
or causes problems in other applications. The rule would be not to
change things, but offer to writers a way to create new texts without
those ambiguities and problems, and correct them if authors wish it.

For me, the "ACCENT ANCHOR" if you call it like this, is solving the
usage of isolated diacritics as plain letters (such as the implied missing
y in Hebrew Yerushala(y)im), and so would behave like an alphabetic
character (whose directionality is still to define...)

Existing coded spacing diacritics are coded as symbols (Sk) and
mostly for accents used in LTR scripts, so the confusion of these
symbols with letters behavior in some UAX's which give them the AL
property (including for one case of SPACE+NSM) is not a problem.

The usage as symbols is mostly correct for the case where a text is
speaking about a diacritic as a isolated symbol and not within words
(this is correct for most languages).

The usage within words (for an implied missing base letter, including
when this missing letter is an initial) leaves a distinct hole (for example
if one was trying to encode a word like "(Y)erushala(y)im", where the
missing base letter is the initial. For languages like Arabic and South-Asian
scripts, there's no problem as there already is a base letter to hold
initial combining vowel signs, which also works for the case of multiple
combining vowels which should not stack but be writtenon this base
letter. In fact in those languages, the missing consonnantal base letter
is actually written with a visible glyph.

But for Latin, Cyrillic, Greek, Hebrew, and probably other scripts, their
isolated diacritics are missing a explicit coded form. And there is still
the need even for Arabic and Brahmic scripts to be able to speak about
the diacritic itself, without an explicit base letter, and so the SPACE+NSM
combining sequence is for now the only solution with its undocumented
properties problems.

Reread some UAXes to see the problematic impact of SPACE+NSM in
areas which are NOT related to rendering, notably when extracting word
sequences (for search and indexing), managing keyboard selections,
computing line breaks, and handling the directionality. Now consider the
even greater impact with the legac use os SPACE as a normalizable
padding whitespace (a key feature of SGML, HTML and XML), and the
legacy use of SPACE+NSM cause too many problems that won't
satisfy authors, which in some case will not be able to use it as it will
not work as expected. Due to these problems, authors are then using
even worse hacks, like using a control before the NSM, even if it creates
"defective" combining sequences, and the dotted circle is sometimes
displayed, and even if it is parsed with an invisible but still additional
grapheme cluster for the control itself, whose presence is a pollution.

Instead of forcing authors to use defective combining sequences like
control+NSM, which would be a even worse hack, why not designating
a clean and pure invisible base character with the required properties,
so that it creates a pure combining sequence for the isolated diacritic(s)?

So the question is which invisible base character(s) to define, with
which properties?
- A invisible symbolic base character (Sk), with neutral directionality (I
called it a INVISIBLE SYMBOL);
- A invisible letter base character (Lo) with neutral directionality (you call
it a ACCENT ANCHOR, and I called it a INVISIBLE LETTER), or
- A invisible letter base character (Lo) with LTR directionality and
- A invisible letter base character (Lo) with RTL directionality

Personnally, the term ACCENT ANCHOR seems ambiguous and does
not indicate precisely the usage (it fits more like the current ambiguous
usage of SPACE as this anchor for accents), and it seems restrictive to
the kind of diacritic or other combining mark that may (should?) be
applied to it. In addition, nothing would forbid to combine several
diacritics or marks on this base character.

Consider then these new characters are better base characters than
SPACE, and define them with only a compatibility decomposition to
SPACE, to match the previous encoding. If those new base characters
are used without diacritics, they will be shown like the glyph for NBSP,
and not necessarily as zero-width (there's no requirement for these
invisible symbols to be zero-width in all cases, as this is a more precise
substitution for the legacy SPACE, but without the associated whitespace
properties). With these new characters, there is no need to change the
rules in the various UAX's and other Unicode algorithms.
--
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kenneth Whistler

2003-08-11 18:45:11 UTC

Not at all! Imagine a tutorial on a language, which might well list the
` (grave accent) is used with a, e and o, and indicates more open
pronunciation
^ (circumflex accent) is used with any vowel, and indicates lengthening

We're going round and round in circles here. Those are not lines
starting with a combining character, but lines starting with
a *spacing diacritic*.

Post by Peter Kirk
So far so good, but when I get to an accent with no predefined spacing
variant, I have a problem!

Either you have the spacing diacritic encoded (as in those instances),
or the standard indicates that you can represent one by applying the
nonspacing, *combining* mark to SPACE. In those instances, the line
still doesn't start with a combining mark -- it starts with a SPACE
character serving as the base character for the combining mark.

--Ken

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-08-11 18:56:41 UTC

We're going round and round in circles here. Those are not lines
starting with a combining character, but lines starting with
a *spacing diacritic*.

Post by Peter Kirk
So far so good, but when I get to an accent with no predefined spacing
variant, I have a problem!

Thanks for the clarification. I probably misunderstood Jon's intention.
But is there a problem if, for example, an application sees the string
<space, space, combining mark> and regularises it (wrongly!) to <space,
combining mark>?
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Jon Hanna

2003-08-12 11:17:35 UTC

Post by Peter Kirk
Thanks for the clarification. I probably misunderstood Jon's intention.
But is there a problem if, for example, an application sees the string
<space, space, combining mark> and regularises it (wrongly!) to <space,
combining mark>?

Yes, I was not saying that it wouldn't be sensible to begin a line of text
with a spacing diacritic (whether precomposed or created using space or
NBSP). I was saying that it wouldn't be sensible to begin a line with a
combining diacritic, since that combining diacritic would be combining with
a newline character which it's difficult to think of any possible sensible
meaning for. Attribute normalisation would change the sequence U+000A,
<combining> to U+0020, <combining> which would arguably change the meaning,
but changing the meaning of a meaningless construct isn't a problem to my
mind.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Philippe Verdy

2003-08-12 12:49:54 UTC

Post by Jon Hanna
I was saying that it wouldn't be sensible to begin a line with a
combining diacritic, since that combining diacritic would be combining
with a newline character which it's difficult to think of any possible
sensible meaning for.

A newline is a control with a whitespace property and a line-breaking
behavior. It must not combine with a combining diacritic, according to
the UAX definition of grapheme clusters.

So <newline>+NSM is clearly defective and must be parsed as two distinct
combining sequences, the first one for the newline sequence, the second
one being "defective" as the combining character does not have a base
character to which it applies (the standard suggests using a dotted
circle to render it in editors, but suggests nothing for the rendering
of final documents, which could simply drop the defective sequence or
display it with a replacement base character, or use a dotted circle, or
a invisible glyph. So the result in this case is implementation
dependant, and not interoperable.

For me the term "difficult" is inappropriate. In fact it is invalid for
interoperability (even though it is valid, not forbidden, for
ISO10646/Unicode, as an string fragment for intermediate processing),
and such sequence should not occur in actual documents, out of any
external processing context which defines its behavior.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Jon Hanna

2003-08-12 14:05:51 UTC

Post by Philippe Verdy
For me the term "difficult" is inappropriate. In fact it is invalid for
interoperability (even though it is valid, not forbidden, for
ISO10646/Unicode, as an string fragment for intermediate processing),
and such sequence should not occur in actual documents, out of any
external processing context which defines its behavior.

So that fact that you can't stick it into XML won't cause you many tears
then.
Good.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-08-12 11:57:57 UTC

Post by Jon Hanna

Thanks for the clarification. I think the combining mark would not
combine with the new line mark but would be a defective combining
sequence. I might wish to do this simply because, according to UTR #14
this is the only way to get a combining mark to be treated as AL as I
might wish. Probably not the best way to do this, but not illegal!

So it seems to me that this attribute normalisation is a problem. It is
a problem for the higher level protocol as thinks it has created a space
but in fact it has created a combining sequence which it must not treat
as a space. A legal sequence at a lower level, even if meaningful,
should not confuse the higher level. (Indeed I don't think the higher
level ought to be confused even by illegal sequences at the lower level,
it should be transparent as far as possible.) So the higher level
protocol needs to know not only not to split a space, combining mark
sequence but also not to create one where one was not present before.
Perhaps it needs to insert a suitable separator (ZWNJ?) to ensure that
when the space is created it is not combined with the combining mark. So
another example of needless complication created by the long-standing
decision to permit space as a carrier for combining marks.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kenneth Whistler

2003-08-11 19:13:59 UTC

Then you have a problem, of course.

What the Unicode Standard says about application of nonspacing
combining marks to SPACE seem clear to me.

What other standards say about space folding is clear in their
own contexts.

If someone is implementing both such standards together, then
one has to be careful how the requirements articulate.

In Unicode terms, a space folding is an example of a "knowing
modification" of the content of the text. It is perfectly o.k.
to modify Unicode text, of course, *as long as you know what
you are doing* -- i.e., you aren't converting valid text to
bit hash because you aren't conforming to the meaning of
the characters or to their encoding forms.

Now if a process is doing a space folding, but is applying
it to Unicode text as a "semi-ignorant modification", i.e.,
without being aware of the fact that nonspacing combining
marks can apply to SPACE characters (and that such sequences
are valid combining character sequences and should be treated
analogously with other grapheme clusters, viz UAX #29), then
it is modifying the text away from its intended content without
*knowing* what it is actually doing. Such mistakes are
programming errors in application of the relevant standards.

Of course a standard which mandates space folding is also
within its rights to mandate, for example, the non-use of
nonspacing marks applied to SPACE characters. It can simply
rule out such sequences as valid for its context, in which
case the problem goes away.

The important thing here is to know what you are doing when
you modify text, and, as far as possible, to accomplish
such modifications in ways that are the same as other
processes which also know what they are doing. That is the
basis for interoperability of textual data.

--Ken

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kenneth Whistler

2003-08-11 19:26:44 UTC

Post by Peter Kirk
I think this may be a "Peter mistake". I meant to refer to spacing
diacritics. Sorry.
It is certainly highly inappropriate for spacing diacritics to
be considered word boundaries.

Why? It is entirely dependent on the orthography and conventions
involved. There is probably as much (or more) bad ASCII usage
of spacing diacritics like `this', where a grave accent character
is being misapplied to make a directional quotation mark, as
there is actual, linguistically appropriate use of spacing
diacritics.

Also, everyone should consider carefully the status of UAX #29,
Text Boundaries.

<quote>
2 Conformance

This is informative material. There are many different ways to
divide text elements corresponding to grapheme clusters, words
and sentences, and the Unicode Standard and this document do not
restrict the ways in which implementations can do this.

This specification is a <emphasis>default</emphasis> mechanism;
more sophisticated engines can and should tailor it for particular
locales or environments. ...
</quote>

The whole UAX is informative. It is a here's-how-you-can-approach-
the-problem implementation guide with some suggestions for
rules and classes.

*If* you are working with an orthography that uses one or more
spacing diacritics, and
*If* those spacing diacritics need to be represented by
<SPACE, NSM> sequences,

then you are in the situation where your implementation of
text boundaries should take <SPACE, NSM> sequences explicitly
into account, so as to result in expected behavior for that
orthography.

Everyone has had experiences with their platform UI producing
bad results for text boundaries. The Solaris platform I am
writing this on right now, for example, implements a double-click
word selection that treats the string "`this'," above, including
the grave accent, the apostrophe, and the comma, as a "word".
Is that right or wrong? Well, it depends on what you are trying
to do, I expect.

But even the most sophisticated platform implementers can only
do so much with processes like default word selection. It is
bound to be wrong for one purpose or another and for one
orthography or another. Ultimately you need to have tailored
processes that can be orthography-specific if you want to
get best results.

--Ken

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Mark Davis

2003-08-11 23:06:55 UTC

Some of this seems to be in reference to an earlier contention that
Text Boundaries (inc. Lines) break between the space and the
non-spacing mark. I think this was attributed to Phillipe.

[This may not be true: I don't actually read his email, because the
information content per line falls below my email threshold; not to
say that there may not be information there, but I cannot afford to
take the time to find out -- sadly, one of my character flaws.]

All of the text boundaries preserve grapheme cluster boundaries, which
never separate a base character (including space and NBSP) from a
following NSM. In addition, each of the boundary types above grapheme
clusters make some statement about the behavior of the grapheme
cluster. For example, with line boundaries a SPACE + NSM has a special
behavior. With the others, the behavior is the same as the base
character.

As Ken points out, in any event these are default boundaries, and can
be tailored. That being said, if the normal behavior of the default
can be improvied, and someone has a concrete proposal for doing so,
then it can be considered.

Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄

----- Original Message -----
From: "Kenneth Whistler" <***@sybase.com>
To: <***@ntlworld.com>
Cc: <***@unicode.org>; <***@sybase.com>
Sent: Monday, August 11, 2003 12:26
Subject: Re: Questions on ZWNBS - for line initial holam plus alef

Why? It is entirely dependent on the orthography and conventions
involved. There is probably as much (or more) bad ASCII usage
of spacing diacritics like `this', where a grave accent character
is being misapplied to make a directional quotation mark, as
there is actual, linguistically appropriate use of spacing
diacritics.
Also, everyone should consider carefully the status of UAX #29,
Text Boundaries.
<quote>
2 Conformance
This is informative material. There are many different ways to
divide text elements corresponding to grapheme clusters, words
and sentences, and the Unicode Standard and this document do not
restrict the ways in which implementations can do this.
This specification is a <emphasis>default</emphasis> mechanism;
more sophisticated engines can and should tailor it for particular
locales or environments. ...
</quote>
The whole UAX is informative. It is a here's-how-you-can-approach-
the-problem implementation guide with some suggestions for
rules and classes.
*If* you are working with an orthography that uses one or more
spacing diacritics, and
*If* those spacing diacritics need to be represented by
<SPACE, NSM> sequences,
then you are in the situation where your implementation of
text boundaries should take <SPACE, NSM> sequences explicitly
into account, so as to result in expected behavior for that
orthography.
Everyone has had experiences with their platform UI producing
bad results for text boundaries. The Solaris platform I am
writing this on right now, for example, implements a double-click
word selection that treats the string "`this'," above, including
the grave accent, the apostrophe, and the comma, as a "word".
Is that right or wrong? Well, it depends on what you are trying
to do, I expect.
But even the most sophisticated platform implementers can only
do so much with processes like default word selection. It is
bound to be wrong for one purpose or another and for one
orthography or another. Ultimately you need to have tailored
processes that can be orthography-specific if you want to
get best results.
--Ken

Peter Kirk

2003-08-11 23:41:38 UTC

Post by Mark Davis
Some of this seems to be in reference to an earlier contention that
Text Boundaries (inc. Lines) break between the space and the
non-spacing mark. I think this was attributed to Phillipe.
[This may not be true: I don't actually read his email, because the
information content per line falls below my email threshold; not to
say that there may not be information there, but I cannot afford to
take the time to find out -- sadly, one of my character flaws.]
All of the text boundaries preserve grapheme cluster boundaries, which
never separate a base character (including space and NBSP) from a
following NSM. In addition, each of the boundary types above grapheme
clusters make some statement about the behavior of the grapheme
cluster. For example, with line boundaries a SPACE + NSM has a special
behavior. With the others, the behavior is the same as the base
character.
As Ken points out, in any event these are default boundaries, and can
be tailored. That being said, if the normal behavior of the default
can be improvied, and someone has a concrete proposal for doing so,
then it can be considered.
Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄

I was aware that there should not be a line break or word break between
the space and the NSM, although I suspect that many implementers will
not be aware of this, or at least will not test for it properly and so
treat any space as a word break and a line break opportunity. As I just
wrote, this requirement to test all spaces for following NSMs is a
significant inefficiency built into the standard.

But there is still a problem if there is considered by default to be a
word break and a line break opportunity AFTER the NSM. I would suggest,
as a candidate for a concrete proposal, that the default behaviour be
adjusted so that there is no word break or line break opportunity here
either.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Mark Davis

2003-08-12 01:46:18 UTC

There are a number of incorrect statements. My comments below.

----- Original Message -----
From: "Peter Kirk" <***@ntlworld.com>
To: "Kenneth Whistler" <***@sybase.com>
Cc: <***@unicode.org>
Sent: Monday, August 11, 2003 16:28
Subject: Re: Questions on ZWNBS - for line initial holam plus alef

Post by Peter Kirk
I was aware that there should not be a line break or word break between
the space and the NSM, although I suspect that many implementers will
not be aware of this, or at least will not test for it properly and so
treat any space as a word break and a line break opportunity.

Hard to be clearer than what is written in the LineBreak UAX. (see
below).

Post by Peter Kirk
As I just
wrote, this requirement to test all spaces for following NSMs is a
significant inefficiency built into the standard.

This is incorrect. Characters (not just spaces) only need to be
checked for following NSMs in *those processes where that makes a
difference*. And in most of those processes, like line-break, some
lookahead is required anyway. To see, for example, whether there is a
linebreak after a character X, in almost all cases I have to look at
the character after X, and in many cases I have to look at more than
one character. Notice, for example, that in the sequence "a<space>" I
have to look ahead to see if there is a ":", so that French
punctuation works correctly.

In practice, looking at a character past a space does not represent a
significant performance issue. One is typically using a mechanism
(like an augmented state machine) that maintains enough state that
that is not an issue.

Post by Peter Kirk
But there is still a problem if there is considered by default to be a
word break and a line break opportunity AFTER the NSM. I would

suggest,

Post by Peter Kirk
as a candidate for a concrete proposal, that the default behaviour be
adjusted so that there is no word break or line break opportunity here
either.

It helps if "concrete proposals" were actually, well, concrete.

I see no problem with Line Break.
(http://www.unicode.org/reports/tr14/#Algorithm):

Space + NSM is treated as a unit, with behavior that is pretty
consistent with a stand-alone accent like "^". To quote:

LB 7a In all of the following rules, if a space is the base character
for a combining mark, the space is changed to type ID. In other words,
break before SP CM* in the same cases as one would break before an ID.

Treat SP CM* as if it were ID

If you want non-breaking behavior, you use NBSP + NSM; if you want
breaking behavior, you use SP + NSM. The algorithm does that.

I also see no problem with word-break
(http://www.unicode.org/reports/tr29/#Word_Boundaries). Look at the
specific text. To quote:

Treat a grapheme cluster as if it were a single character: the first
character of the cluster.
GC → FC (3)
...
Otherwise, break everywhere (including around ideographs).
Any ÷ Any (14)

None of the other rules are relevant.

So what this does is that SPACE + NSM will break before the space and
after the NSM (assuming there is only one). So it will behave like a
symbol, such as "*", or ")", or "^".

The one area I do see that there may be an issue is with one that you
didn't mention,
http://www.unicode.org/reports/tr29/#Sentence_Boundaries. Sp + NSM
should not behave as Sp in the rules (8), (10), and (11). Even there,
it will produce at most a minor oddity.

If we wanted to change it, the *concrete* change would be to replace
(4) by:

Treat a grapheme cluster as if it were a single character: the first
character of the cluster, except if that first character is a space.
In that case, change to Any.
SGC → FC (4a)
GC → FC (4b)

Post by Peter Kirk
--
Peter Kirk
http://www.qaya.org/

Peter Kirk

2003-08-12 11:01:36 UTC

Post by Mark Davis
There are a number of incorrect statements. My comments below.

Thanks for the clarifications. Sorry about the inaccuracies. On some
maybe Philippe misled me, on others it is just my inadequate understanding.

Post by Mark Davis
...
In practice, looking at a character past a space does not represent a
significant performance issue. One is typically using a mechanism
(like an augmented state machine) that maintains enough state that
that is not an issue.

Understood. I hope Microsoft is listening.

Post by Mark Davis
...
It helps if "concrete proposals" were actually, well, concrete.

Of course! But I need help to get rid of any inaccuracies before the
concrete sets.

Post by Mark Davis
I see no problem with Line Break.
Space + NSM is treated as a unit, with behavior that is pretty
LB 7a In all of the following rules, if a space is the base character
for a combining mark, the space is changed to type ID. In other words,
break before SP CM* in the same cases as one would break before an ID.
Treat SP CM* as if it were ID
If you want non-breaking behavior, you use NBSP + NSM; if you want
breaking behavior, you use SP + NSM. The algorithm does that.

Thank you. I have looked at this. Well, the ideal for me would be a
mechanism whereby base + NSM was AL, rather than ID or GL. The problem
comes, if I understand correctly, with a sequence like SP XX CM* AL,
where I want a break opportunity after SP but not before AL. If I use
NBSP for XX, I get not breaking opportunity at all. If I use SP, I may
get a break before AL. But I suppose SP SP CM* WJ AL would do what I
want, perhaps also SP ZWSP NBSP CM* AL as the break opportunity after
ZWSP takes precedence over the no break before NBSP.

Post by Mark Davis
I also see no problem with word-break
(http://www.unicode.org/reports/tr29/#Word_Boundaries). Look at the
Treat a grapheme cluster as if it were a single character: the first
character of the cluster.
GC → FC (3)
...
Otherwise, break everywhere (including around ideographs).
Any ÷ Any (14)
None of the other rules are relevant.
So what this does is that SPACE + NSM will break before the space and
after the NSM (assuming there is only one). So it will behave like a
symbol, such as "*", or ")", or "^".

OK, no real problem then. In some circumstances it might have been
better for space + NSM to behave like a letter rather than a symbol may
be more appropriate, but I recognise that tailoring may be required for
fine details.

Post by Mark Davis
The one area I do see that there may be an issue is with one that you
didn't mention,
http://www.unicode.org/reports/tr29/#Sentence_Boundaries. Sp + NSM
should not behave as Sp in the rules (8), (10), and (11). Even there,
it will produce at most a minor oddity.
If we wanted to change it, the *concrete* change would be to replace
Treat a grapheme cluster as if it were a single character: the first
character of the cluster, except if that first character is a space.
In that case, change to Any.
SGC → FC (4a)
GC → FC (4b)

Do you mean: "SGC → Any (4a)"?

How should I go about making a concrete proposal for this?

Anyway, many thanks for your help. I think I am beginning to realise
that this is a small problem which has been blown out of proportion by
others. I still see the space + NSM choice as a rather poor initial
design, but one which can be lived with.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-08-11 23:28:19 UTC

Why? It is entirely dependent on the orthography and conventions
involved. ...

Well, agreed, there may be orthographic conventions in which a spacing
diacritic is considered a word boundary or a break opportunity e.g. if
used like a hyphen. But there are other mechanisms for forcing a word
boundary where otherwise there would not be one. Are there to suppress a
word boundary? Perhaps I need to encode <WJ, space, diacritic, WJ> to
avoid the word boundary implication? Would this work?

Post by Kenneth Whistler
... There is probably as much (or more) bad ASCII usage
of spacing diacritics like `this', where a grave accent character
is being misapplied to make a directional quotation mark, as
there is actual, linguistically appropriate use of spacing
diacritics.

But this is an abuse of the spacing diacritic as punctuation. Proper,
linguistically appropriate use of spacing diacritics should not be
broken in order to support abuse. Or, if the standard wants to support
such abuse, we can reserve <space, diacritic> for the abuse and define
a new character XXX such that <XXX, diacritic> has the properties for
the linguistically appropriate use.

Post by Kenneth Whistler
Also, everyone should consider carefully the status of UAX #29,
Text Boundaries.
<quote>
2 Conformance
This is informative material. There are many different ways to
divide text elements corresponding to grapheme clusters, words
and sentences, and the Unicode Standard and this document do not
restrict the ways in which implementations can do this.
This specification is a <emphasis>default</emphasis> mechanism;
more sophisticated engines can and should tailor it for particular
locales or environments. ...
</quote>
The whole UAX is informative. ...

Then let it be correctly informative and not full of misinformation. And
let its default mechanism and recommendations be appropriate for the
majority of uses, including such cases as list of diacritics which may
occur in any orthography.

Ken, it seems to me all the more clearly from looking at the latest
batch of postings on this list that the <space, diacritic> mechanism
defined by Unicode is fundamentally flawed. It works, but it creates a
serious and needless complication for all kinds of other processes,
including rendering and higher level processes. These processes cannot
simply take a space as a space and process it as such. Every time they
come across a space (which is very often!) they have to test whether it
is followed by a combining character, and if it is they have to treat
that space specially. This has created a serious problem for
implementers, which is why they have produced non-conforming
implementations - and we are not talking about small companies which
have rushed into the market recently, we are talking about Microsoft,
among others, which has been sponsoring Unicode for the start, I
understand. Surely the UTC should not create difficulties for
implementers and then just shout at them for getting things wrong. The
UTC should try to produce a standard which is workable without
unnecessary complications

I agree that it works better to use NBSP here. There are fewer such
problems, but they have not gone away entirely. And NBSP is more likely
to be treated by implementers (in the absence of other guidelines from
Unicode) as fixed width, not trimmed to the width needed for the diacritic.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Cowan

2003-08-12 01:03:05 UTC

Post by Peter Kirk
These processes cannot
simply take a space as a space and process it as such. Every time they
come across a space (which is very often!) they have to test whether it
is followed by a combining character, and if it is they have to treat
that space specially.

This must be done for all other base characters as well.

Post by Peter Kirk
This has created a serious problem for
implementers, which is why they have produced non-conforming
implementations - and we are not talking about small companies which
have rushed into the market recently, we are talking about Microsoft,
among others, which has been sponsoring Unicode for the start, I
understand.

You don't have (nor do I) the vaguest idea why Microsoft produced
this particular nonconforming implementation, or whether they
consider it a bug or not.

Post by Peter Kirk
Surely the UTC should not create difficulties for
implementers and then just shout at them for getting things wrong. The
UTC should try to produce a standard which is workable without
unnecessary complications.

This is sheer conjecture.
--
John Cowan www.ccil.org/~cowan ***@reutershealth.com www.reutershealth.com
[P]olice in many lands are not complaining that local arrestees are insisting
on having their Miranda rights read to them, just like perps in American TV
cop shows. When it's explained to them that there are in a different country,
where those rights do not exist, they become outraged. --Neal Stephenson

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-08-12 10:24:46 UTC

Post by John Cowan
You don't have (nor do I) the vaguest idea why Microsoft produced
this particular nonconforming implementation, or whether they
consider it a bug or not.

Don't make assumptions about things you don't know anything about. I
have been working closely and personally with Microsoft's head of
typography on support for Hebrew and other scripts in Uniscribe. While I
don't happen to have detailed information on this particular point, I am
aware of some of the constraints that Microsoft has been under e.g. to
avoid the inefficiency of calling Uniscribe for rendering of plain text
in western languages. This is why they have been slow to support use of
arbitrary diacritics with Latin text. I think this issue may have been
fixed with the soon to be released new version of Uniscribe, and perhaps
the problem with spaces and diacritics has also been fixed. We'll see.

This is sheer conjecture.

No, it is not. For one thing I have not said that the UTC has done
anything bad, and certainly not that it has done so deliberately, only
that it should not do so. But it is not just me who has pointed to the
difficulty for implementers of the space + diacritic convention which
the UTC defined (with inadequate forethought rather than malicious
intention), see also John Hudson's independent opinions and the failure
of Microsoft to implement it. I was wrong to suggest that the UTC is
shouting at implementers for getting things wrong though I think it
should so so if they do. But UTC members have told me to complain to
implementers for getting things wrong. As for my last statement, that is
simply my opinion. If you wish to disagree with it, do you prefer that
the UTC should deliberately produce an unworkable standard, or that it
should introduce unnecessary complications?
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kenneth Whistler

2003-08-11 19:57:11 UTC

Post by Kent Karlsson
How should a freestanding double diacritic be encoded (for purposes of
meta-discussions, and the like): <SPACE, dbl diacritic> or <SPACE, dbl
diacritic, SPACE>?

It *could* be represented as <SPACE, dbl_diacritic, SPACE>, of course,
or for that matter <SPACE, dbl_diacritic, TAB>, or other possibilities.
The combining character sequence, in either case, is the
<SPACE, dbl_diacritic> sequence.

But it *should* be represented by something visually more
meaningful, such as <U+25CC, dbl_diacritic, U+25CC>, which is
how the standard itself tends to represent it when needing
to engage in a meta-discussion. The whole point of a double
diacritic is its graphic application to two base characters,
which point is lost in the discussion if you don't show a
graphic base when displaying the character in isolation.

Post by Kent Karlsson
How should combining characters (spacing as well
as non-spacing) that are not vertically centered *roughly* be displayed,
e.g. <SPACE, right-side combining character>, should that *roughly*
be displayed with or without a typographic void to the left of it?

It's up to the application. And again, I would say that if this
level of detail is a concern to the person originating the text,
then the better convention is to represent the combining character
on a *visible* generic base.

Post by Kent Karlsson
So
if I want a space (though not an overgrown one), should one use
<SPACE, SPACE, right-side combining character>? Or even <SPACE,
ZWSP, SPACE, right-side combining character>, to prevent "space
collapse".
And similarly for left-side combining characters. Likewise for defective
combining sequences. If I want a visible pseudo-base, a dotted ring, or
an
underline, the answers are fairly clear, using a suitable character as a
base.

Exactly. Which is why you should use such conventions if you
care about the placement in this detail.

Otherwise, you up-level and make use of whatever mechanisms a
typesetting application makes available for individual adjustment
of the placement of glyphs.

--Ken

Post by Kent Karlsson
But not for the cases above. I don't think that should entirely up
to each font (maker), without any recommendation. (A "should" rather
than a "shall" is quite sufficient.)
/kent k

Jim Allan

2003-08-11 20:35:46 UTC

Post by Kenneth Whistler
Of course a standard which mandates space folding is also
within its rights to mandate, for example, the non-use of
nonspacing marks applied to SPACE characters. It can simply
rule out such sequences as valid for its context, in which
case the problem goes away.

And for such standards or applications one can usually use U+00A0
NO-BREAK SPACE to force multiple spacings.

One can also use this followed by a non-spacing combining character to
call for rendering of that combining character in isolation.

My feeling is that because of the special qualities of regular SPACE
using NBSP (U+00A0) should be the more robust way to go.

Essentially, since the Unicode specifications say that a non-spacing
diacritic can be applied to any base character, including the spaces, it
is up to fonts and other presentation software to support this and to
try to make the results look good according to othrographic and cultural
expectations, just as it is with any text coded in Unicode.

Sometimes fonts don't do this. I would not at all be surprised to find
for example that _g_ followed by U+0325 COMBINING RING BELOW would come
out with the combining ring overlapping the tail of the _g_ unless I
were using a font especially designed for linguistic use.

I would not be at all surprised that some fonts and display devices
wouldn't justify NBSP + COMBINING DOT BELOW at the beginning of a line.
But good typographical fonts should justify such combinations and should
presumably change the width of NBSP when appropriate.

Such changes of width and shapes are what one finds with ligatures in
fonts that support ligatures.

Jim Allan

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Philippe Verdy

2003-08-11 22:14:48 UTC

Some of these only apply to elements that do not allow any
character data apart from whitespace to appear directly within them, and
hence are not an issue here. Some happen at relatively high level of
processing, e.g. rendering (not parsing) of HTML, and as such should
correctly process spaces combined with combining characters.

Here I have to disagree: in XML, the normalization of whitespaces occurs
during parsing before the DOM tree is built, and so the initial whitespaces
are made inaccessible; rendering occurs only later based on the parsed
DOM tree. This is to ensure the equivalence of the encoding under very
strict conditions defined in the XML standard (and retrofitted now in the
HTML standard to mimic the standard practices of HTML 4.01 in
XHTML 1.0 (and now 1.1 with the XHTML modularization).

Strict conformance for the behavior of these whitespaces is mandatory and
cannot be bypassed or negociated, notably when XML data needs to be
certified against alteration, i.e. cryptographically signed. (XML signature
is now standardized), or when the DOM tree is used and altered in a
predictable way with technologies like XPath which needs to refer to
exact encoding position in the encoded Unicode NFC form of text elements,
attribute values, or CDATA sections.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Jon Hanna

2003-08-12 11:38:31 UTC

Lots of different things happen that affect the whitespace of an XML
document (whether a DOM tree is constructed or not, since it isn't the only
legal way to process an XML document).

Of course rendering can do something further to parsing with whitespace.
Rendering can do whatever the rendering engine wants to do, it isn't defined
by XML. When an application receives U+0020, U+0020, U+0302, U+0020 then it
should probably (unless there are good application-specific reasons why not)
treat that more or less the same as if it had received U+0020, U+005E,
U+0020 (if there are minor glyph differences fair enough). This isn't a
matter of XML's whitespace rules, but it is a matter of how what we are
discussing affects XML-based technology as a whole.

Further it is completely true that some of the rules only affect elements
that only allow element content.

Post by Philippe Verdy
Strict conformance for the behavior of these whitespaces is mandatory and
cannot be bypassed or negociated,

Well if a non-validating parser hasn't seen a declaration for an attribute
of type NMTOKENS it would treat it as being of type CDATA which would alter
how whitespace was treated. However that is mostly correct, it just isn't a
problem except if someone attempts to use the sequence {space, combining
char} in a name or nmtoken, which as I said would be a pretty bizarre design
decision anyway.

notably when XML data needs to be

Post by Philippe Verdy
certified against alteration, i.e. cryptographically signed. (XML signature
is now standardized), or when the DOM tree is used and altered in a
predictable way with technologies like XPath which needs to refer to
exact encoding position in the encoded Unicode NFC form of text elements,
attribute values, or CDATA sections.

Yep, Yep, Yep. Still doesn't mean there is any problem.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Philippe Verdy

2003-08-12 12:40:02 UTC

Post by Jon Hanna
Lots of different things happen that affect the whitespace of an XML
document (whether a DOM tree is constructed or not, since it isn't the only
legal way to process an XML document).

Of course one is not required to build an actual DOM tree, however XML, HTML
and alike is now defined in terms of the DOM, where the text/xml syntax is
just a serialization, which is the only place where whitespaces
normalization is defined (such normalization does not occur at the DOM
level, and a XML document may be serialized with another concrete syntax
than the one assigned to the "text/xml" MIME type, registered and documented
by the W3C.

When processing XML documents, the DOM part is the most important feature
and it is logically separated from the concrete syntax used by text XML
parsers. The W3C defines very strict rules to ensure that the DOM-equivalent
data will be preserved, and whitespace normalization in XML documents
serialized as "text/xml" is mandatory, or it is not a valid "text/xml"
serialization.

Processing a "text/xml" document in a way that would be incompatible with
what a DOM tree builder would create is not conforming. If this is
different, then it is not XML but a derived language (for example HTML or
SGML which are using more "relaxed" syntaxes). In XML, whitespace
normalization can be overriden using very precise rules within the parser
only, but not in the resulting DOM-tree, so it is important to understand
each step that goes from the concreate text/xml syntax to the DOM-tree or
its equivalents (notably the successive steps required in parsed entities,
named entities, ...) No XML application is required to use the "text/xml"
MIME syntax, and there exists such examples (for example the serialization
and compression formats used by WAP, MMS, Nec's i-Mode, and SOAP).

If an application does not build the DOM tree, it is still required to
perform namespace resolution and to solve named entities according to the
standard "text/xml" MIME rules formulated by the W3C reference, including
all its facets, needed for interoperability of document properties
independantly of the character encoding used in the serialized document, or
its syntaxic representation. In my opinion, all XML-based languages should
be defined now in terms of its DOM structure, and the XML application should
be defined by a valid DTD, or beter now with a now standard XSD schema, that
can be processed by validating parsers (parsers that absolutely need to
create a DOM-like tree or flow of tokens with strictly defined properties,
value sets and behavior.)

Without DOM interoperability, XML would be another imprecise language like
HTML, with very little reusability due to naming conflicts. This is the most
important benefit of XHTML (strictly based on XML) face to HTML (4.x and
before) and SGML (all versions), notably when a schema is explicitly
specified for the document, and is loaded for validating purposes (some
schemas are normative like XHTML, and canot be changed by authors)

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Jon Hanna

2003-08-12 13:59:48 UTC

Post by Philippe Verdy
Of course one is not required to build an actual DOM tree,
however XML, HTML
and alike is now defined in terms of the DOM, where the text/xml syntax is
just a serialization, which is the only place where whitespaces
normalization is defined (such normalization does not occur at the DOM
level, and a XML document may be serialized with another concrete syntax
than the one assigned to the "text/xml" MIME type, registered and documented
by the W3C.

No.

"XML documents are made up of storage units called entities, which contain
either parsed or unparsed data. Parsed data is made up of characters, some
of which form character data, and some of which form markup. Markup encodes
a description of the document's storage layout and logical structure. XML
provides a mechanism to impose constraints on the storage layout and logical
structure." (XML, Introduction. XML1.1 will not change that).

*XML applications* can be defined in terms of the DOM, but they can also be
defined in terms of the XML Information Set, XPath, by extending one of the
above, or through some other model (e.g. in terms of SAX events). Many
applications are defined in terms of the Information Set or XPath.

None of this actually matters here of course, because there is still no
problem with the use of space and NBSP with combining characters unless you
use that in names or nmtokens.

and whitespace normalization in XML documents

Post by Philippe Verdy
serialized as "text/xml" is mandatory, or it is not a valid "text/xml"
serialization.

But it doesn't matter.

Post by Philippe Verdy
Processing a "text/xml" document in a way that would be incompatible with
what a DOM tree builder would create is not conforming.

Doesn't matter.

If this is

Post by Philippe Verdy
different, then it is not XML but a derived language (for example HTML or
SGML which are using more "relaxed" syntaxes).

XML is derived from SGML, not the other way around. Still doesn't matter.

Post by Philippe Verdy
If an application does not build the DOM tree, it is still required to
perform namespace resolution

Namespace resolution, do you mean complying with Namespaces in XML? XML
parsers aren't required to do that, and it still doesn't matter.

Post by Philippe Verdy
Without DOM interoperability, XML would be another imprecise language like
HTML,

HTML is pretty precise, most of the imprecision is quite possible in XML as
well. Comparing HTML with XML is a pretty fruitless exercise beyond "oh look
this one has point brackets as well".

Still doesn't matter.

Post by Philippe Verdy
with very little reusability due to naming conflicts.

Naming conflicts are perfectly possible with XML applications that don't use
Namespaces. Which they are perfectly within the spec in doing, and where
combining diacritics still don't matter.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Philippe Verdy

2003-08-12 16:00:43 UTC

Post by Peter Kirk
If this is

Post by Philippe Verdy
different, then it is not XML but a derived language (for example HTML or
SGML which are using more "relaxed" syntaxes).

XML is derived from SGML, not the other way around. Still doesn't matter.

I did not say that, despite the sentence may let you think so. Of course
XML is born based on the ground of SGML and its HTML application, but
now contains enough differences that it can no longer be considered an
application of SGML, as it is both a subset and a superset of SGML (XML
allows things forbidden in SGML, and forbids things that is completely
valid in SGML).

Additionally the DTD syntax profile used in XML is very limited face to
SGML, and even this DTD syntax is not enough to represent in SGML XML
features like namespaces (in XML, namespace prefixes can be freely
substituted without requiring a new DTD, and are resolved as URIs
instead of being part of the element or attribute names). Naming
conventions in XML are based on two orthogonal dimensions, unlike in
HTML and SGML which just use a single namespace.

Finally DTDs are being deprecated in XML, because they cannot represent
correctly the semantics of allowed attributes and even the allowed
content models for schemas (so a XML document would validate with a DTD
which would not if the schema was defined more precisely with a XSD
schema: nearly all DTDs I have seen for XML, HTML and SGML contain
important comments that cannot be represented in a parsable way.

OK I used the term DOM instead of InfoSet but what I said was "DOM-like"
data-representation (meaning InfoSet if this is what is used to
represent the document). I won't discuss the case of element names or
attribute names, which
are by essence constrained by XML datatypes and do not represent any
arbitrary Unicode text. But CDATA sections, attribute values (in non
validating parsers), and anonymous text elements are where the handling
of initial/final whitespaces as well as sequences of whitespaces, cause
problems. This is clearly NOT markup, but plain text data, which may or
may not be constrained by datatype facets, without even the need to
specify a special xml:whitespace
attribute in the markup of the document itself.

As validating documents against their definitions is an optional part of
a valid XML document, normalization of whitespace sequences occurs only
if the schema is known. In the case of standardized schemas, like XHTML,
it becomes mandatory, and there's no way to bypass this rule, as any
client could assume and load the corresponding schema and preprocess the
DOM-like data contained in the parsed document to create data which will
not expose unnormalized whitespaces. So the behavior of spaces must be
assumed by authors which canot predict if the XML parser will validate
or not the parsed document. It is clearly not a rendering issue in fonts
or XSLT processors or stylesheets. I see absolutely no place where a XML
author can create a valid XML schema instance that will work with
parsers if the author wants to use SPACE+diacritics sequences in the
document. The only way to bypass safely this behavior is to use unparsed
entities to represent the leading SPACE, or the whole combining
sequence.

This is really a shame that there is no "XML-safe" base character in
Unicode to represent leading spacing diacritics in actual documents
(either in HTML, XML, SGML, or even for other Rich-Text format,
including TeX, RTF, or proprietary text formats like MS-Doc, or PDF
which already can and do use Unicode as its now prefered encoding).
Ignoring the extremely huge number of applications assuming this role to
spaces, is then a critical caveat as such rules cannot be changed
easily.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Cowan

2003-08-12 14:05:12 UTC

Post by Philippe Verdy
Of course one is not required to build an actual DOM tree, however XML, HTML
and alike is now defined in terms of the DOM, where the text/xml syntax is
just a serialization,

This is absolutely false. XML is defined by the XML Recommendation, which
is entirely syntactic. As a matter of convenience, many other XML
recommendations use the XML Infoset, which is by no means the same as the
DOM. The DOM is an abstract API for programmatic access to the content
of XML documents.

Post by Philippe Verdy
which is the only place where whitespaces
normalization is defined (such normalization does not occur at the DOM
level, and a XML document may be serialized with another concrete syntax
than the one assigned to the "text/xml" MIME type, registered and documented
by the W3C.

"May" be, yes. You can serialize it in ASN.1 if you want to. That doesn't
make ASN.1 an instance of XML.

Post by Philippe Verdy
[W]hitespace normalization in XML documents
serialized as "text/xml" is mandatory, or it is not a valid "text/xml"
serialization.

Very true. But what is this whitespace normalization?

1) Throughout the document, line-end characters and sequences are normalized
to LF. Not relevant here.

2) In attribute values, LF, CR, and TAB characters are normalized to spaces.
Not relevant here.

3) In attribute values that have a declared type other than CDATA, multiple
spaces are compressed to a single space, and leading and trailing spaces
are removed. After this is done, there can be no spaces in attributes
of type ID, IDREF, ENTITY, NMTOKEN, NOTATION, or enumerated types.
In the types IDREFS and ENTITIES, spaces are used to separate
individual tokens, none of which may begin with a combining character.
In the remaining type, NMTOKENS, individual characters may begin
with a combining character, so it is possible that such a token, if
not the first in the attribute, will be rendered in a peculiar way,
with the combining character placed over the separating space.
But that is a mere rendering glitch and in no way affects anything.

Post by Philippe Verdy
No XML application is required to use the "text/xml"
MIME syntax, and there exists such examples (for example the serialization
and compression formats used by WAP, MMS, Nec's i-Mode, and SOAP).

That is not the definition of "XML application" given in the XML Recommendation,
which is the sole authority on the subject. You can invent your own
definitions if you like, but you need not expect to be listened to.

Post by Philippe Verdy
If an application does not build the DOM tree, it is still required to
perform namespace resolution

No XML application is required to perform "namespace resolution", whatever
that may be.

Post by Philippe Verdy
to solve named entities according to the
standard "text/xml" MIME rules formulated by the W3C reference,

Only certain named entities *must* be resolved: specifically, internal
entities that are defined in the internal subset.

Post by Philippe Verdy
In my opinion, all XML-based languages should
be defined now in terms of its DOM structure, and the XML application should
be defined by a valid DTD, or beter now with a now standard XSD schema, that
can be processed by validating parsers (parsers that absolutely need to
create a DOM-like tree or flow of tokens with strictly defined properties,
value sets and behavior.)

In your *opinion*.

Post by Philippe Verdy
Without DOM interoperability, XML would be another imprecise language like
HTML, with very little reusability due to naming conflicts.

Nonsense.

*plonk*
--
There is / One art John Cowan <***@reutershealth.com>
No more / No less http://www.reutershealth.com
To do / All things http://www.ccil.org/~cowan
With art- / Lessness -- Piet Hein

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-08-12 16:27:57 UTC

Post by John Cowan
Very true. But what is this whitespace normalization?
1) Throughout the document, line-end characters and sequences are normalized
to LF. Not relevant here.
2) In attribute values, LF, CR, and TAB characters are normalized to spaces.
Not relevant here.

This would be relevant if it is legal for the character after LF, CR,
and TAB to be a combining mark. Is this legal? In this case what was
previously a defective (but legal) combining sequence would turn into a
non-defective one, but the intended whitespace would be lost.

Post by John Cowan
3) In attribute values that have a declared type other than CDATA, multiple
spaces are compressed to a single space, and leading and trailing spaces
are removed. After this is done, there can be no spaces in attributes
of type ID, IDREF, ENTITY, NMTOKEN, NOTATION, or enumerated types.
In the types IDREFS and ENTITIES, spaces are used to separate
individual tokens, none of which may begin with a combining character.
In the remaining type, NMTOKENS, individual characters may begin
with a combining character, so it is possible that such a token, if
not the first in the attribute, will be rendered in a peculiar way,
with the combining character placed over the separating space.
But that is a mere rendering glitch and in no way affects anything.

Not just a rendering glitch, I suspect. If the combining character is
combined with the separating space, the space loses many of its
separating functions, and perhaps keeps a confusing subset of them with
all sorts of possibilities of error. At best tokens beginning with
combining characters will be unusable. At worst they will crash the
implementation (and count on someone trying deliberately to do that!).
The only safe thing to do is to specify that space followed by a
combining mark is NEVER considered to be a space and this combination is
NEVER generated.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Philippe Verdy

2003-08-12 00:51:40 UTC

It is perfectly reasonable, as I see it, to consider the
a. significant
b. part of the characters in a document that are not markup
(at least in the cases we are talking about, since the
problem is not about defining Nmtokens for markup in
Biblical Hebrew, but rather the representation of the
Biblical Hebrew document content itself)
So I *still* don't see the problem you are on about, and even
if there was one, the xml:space attribute could be used to
require preservation of a particular space.

May be you are forgetting that in XML and HTML, attributes
(including "spacial attributes like "xml:space" can have default
values, and in fact they have such values set in DTD or
schemas to by normative XML applications like XHTML.
Authors are not supposed to modify normative schemas or DTDs,
and so use elements with their default attributes. This is the case
of XHTML as an application of XML, and HTML as an
application of SGML (neither HTML or SGML parsers will
interpret the xml:space attribute, and XML parsers will handle it
only if they are validating documents with their DTD or schema)

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kenneth Whistler

2003-08-12 00:06:32 UTC