Post by Kent KarlssonPost by Kent KarlssonPost by Kent KarlssonI see no particular *technical* problem with using WJ, though. In
contrast
to the suggestion of using CGJ (re. another problem)
anywhere else but
Post by Kent Karlssonat the end of a combining sequence. CGJ has combining class
0, despite
Post by Kent Karlssonbeing invisible and not ("visually") interfering with any other
combining
mark. Using CGJ at a non-final position in a combining sequence puts
in doubt the entire idea with combining classes and normal forms.
Why?
See above (I DID write the motivation!).
I guess that I did not (and still do not) see the motivation for
your final statement.
Post by Kent KarlssonCombining classes are generally
assigned according to "typographic placement". Combining characters
(except those that are really letters) that have the "same" placement,
and "interfere typographically" are assigned the same combining class,
while those that don't get different classes, and the relative order is
then considered unimportant (canonically equivalent). How is then,
e.g. <a, ring above, cgj, dot below> supposed to be different from
<a, dot below, cgj, ring above> (supposing all involved characters
are fully supported), when <a, ring above, dot below> is NOT
supposed to be much different from <a, dot below, ring above>
(them being canonically equivalent)? An invisible combining character
does not interfere typographically with anything, it being invisible!
The same thing can be said about any inserted invisible character,
combining or not.
How is: <a, ring above, null, dot below> supposed to be different from
<a, dot below, null, ring above>
How is: <a, ring above, LRM, dot below> supposed to be different from
<a, dot below, LRM, ring above>
In display, they might not be distinct, unless you were doing some kind of
show-hidden display. Yet these sequences are not canonically
equivalent, and the presence of an embedded control character or an
embedded format control character would block canonical reordering.
Of course, they *might* be distinct in rendering, depending on
what assumptions the renderer makes about default ignorable
characters and their interaction with combining character sequences.
But you cannot depend on them being distinct in display -- the
standard doesn't mandate the particulars here.
Whether you think it is *reasonable* or not that there should be
non-canonically equivalent ways of representing the same
visual display, sequences such as those above, including sequences
with CGJ, are possible and allowed by the standard. They are:
a. well-formed sequences, conformantly interpretable
b. could be displayed by reasonable renderers, making reasonable
assumptions, as visually identical
I have been pointing out use of the CGJ, which *exists* as an encoded
character, and which has a particular set of properties defined,
would result in the kinds of non-canonically equivalent ordering
distinctions required in Hebrew, if inserted into vowel sequences.
Those are facts about the current standard, as currently
defined. And unless you or someone else convinces the UTC to
establish cooccurrence constraints on CGJ or to change its
properties, they will continue to be current facts about the
standard.
Post by Kent KarlssonThe other invisible (per se!) combining characters with combining
class 0, the variation selectors, are ok, since their *conforming* use
is
vary highly constrained. Maybe I've been wrong, but I have taken
CGJ as similarly constrained as it was given a semantics only when
followed by a base character (but now it seems to have no semantics
at all).
There was no such constraint defined for CGJ. The current statement
about CGJ is merely that it should be ignored in language-sensitive
sorting and searching unless "it specifically occurs within
a tailored collation element mapping." There is no constraint
on what particular sequences involving CGJ could be tailored
that way, and hence no constraint on what particular sequences
CGJ might occur in, in Unicode plain text.
Post by Kent KarlssonPost by Kent KarlssonA combining character sequence is a base character followed
by any number of combining characters. There is no constraint
in that definition that the combining characters have to
have non-zero combining class.
Well, you cannot *conformantly* place a VS anywhere in a combining
sequence! Only certain combinations of base+vs are allowed in
any given version of Unicode. (Breaking that does not make the
combining sequence ill-formed, or illegal, but would make it
non-conformant, just like using an unassigned code point.)
Actually, it is not non-conformant like using an unassigned
code point would be. The latter is directly subject to conformance
clause C6:
C6 A process shall not interpret an unassigned code point as an
abstract character.
The case for variation sequences is subtly different. Suppose
I encounter a variation sequence <X, VS1>, where X could be
any Unicode character. X itself is conformantly interpretable.
VS1 itself is conformantly interpretable. The constraints are
on the interpretation of the variation sequence itself. And
they consist of:
"Only the variation sequences specifically defined in the
file StandardizedVariants.txt in the Unicode Character
Database are sanctioned for standard use; in all other
cases the variation selector cannot change the visual
appearance of the preceding base character from what it
would have had in the absence of the variation selector."
In other words, you can drop VS1's to your heart's content into
plain text, but a conformant implementation should ignore all
of them, unless a) it is interpreting variation selectors, and
b) it encounters a particular sequence defined in
StandardizedVariants.txt.
The cooccurrence constraints on VS1's are constraints on the
*encoding committees* regarding what sequences they will or will
not allow into StandardizedVariants.txt (for various reasons):
"The base character in a variation sequence is never a combining
character or a decomposable character."
That means the UTC will never make such a variation sequence
interpretable by putting it into StandardizedVariants.txt.
*But*, a text user who drops a VS1 into Unicode plain text
after a combining character doesn't "commit a foul" thereby --
he has just put a character into a position that no conformant
implementation will do other than ignore on display.
Post by Kent KarlssonPost by Kent KarlssonCanonical reordering is scoped to stop at combining class = 0.
(I know it is. But I confess I'm not sure why.)
Because God, er...., um... Mark Davis created it that way. ;-)
Post by Kent KarlssonPost by Kent KarlssonIt doesn't say that it applies to combining character sequences
per se. It applies to *decomposed* character sequences
(meaning, effectively, any sequence which has had the recursive
application of the decomposition mappings done).
Yes, for the definition of normalisation. But not necessary for
canonical equivalence. Your point?
Of course it is necessary for canonical equivalence:
D24 Canonical equivalent: Two character sequences are said to be
canonical equivalents if their full canonical decompositions
are identical.
D23 Canonical decomposition: The decomposition of a character that
results from recursively applying the canonical mappings found
in the names list of Section 16.1, Character Names List, and those
described in Section 3.12, Conjoining Jamo Behavior, until no
characters can be further decomposed, and then reordering
^^^^^^^^^^^^^^^^^^^
nonspacing marks according to Section 3.11, Canonical Ordering
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Behavior.
^^^^^^^^
Post by Kent KarlssonPost by Kent Karlssoncharacter sequence: <1000, 1031, 102C, 1039, 200C>
combining?: no yes yes yes no
combining classes: 0 0 0 9 0
comb char sequence: ----------------------
canon reorder scope: ---| ---| ---------| ---|
The combining character sequence here is: <1000, 1031, 102C, 1039>
The *syllable* consists of that plus the trailing ZWNJ.
But the relevant sequences for application of the
canonical reordering algorithm are each sequence starting
with combining class zero and continuing through any
sequence with combining class not zero.
xy S yx, if 0 < cc(y) < cc(x) (and apply that repeatedly);
no need to define any "canonically reordering scope", though
that may be marginally more efficient in an implementation
of normalisation (but this is getting beside the topic of this
discussion).
I'm talking about "scope" here generically. I realize that
the algorithm is based on pair-based swapping, and there is
no necessity to have a formally-defined scope. The point,
however, as you recognize, is that any character with
cc=0 will limit the scope that any sequence of pair-swappings
can impact.
Post by Kent KarlssonPost by Kent KarlssonI don't see how introduction of CGJ into such sequences calls
any of the definitions or algorithms into question.
No, not the algorithm, but the basic idea and design. The algorithm
as such has no "idea" how or why the combining class numbers
were assigned. But we humans do, or might have.
True.
Post by Kent KarlssonAgain, why should not <a, ring above, cgj, dot below> be canonically
equivalent to <a, dot below, cgj, ring above>, when <a, ring above,
dot below> is canonically equivalent to <a, dot below, ring above>?
And I want a design answer, not a formal answer! (The latter I already
know, and is uninteresting.)
The formal answer is the true and interesting answer!
It shouldn't be canonically equivalent because it *isn't*
canonically equivalent.
But instead of obsessing about the particular case of the CGJ,
admit that the same shenanigans can apply to any number of
default ignorable characters which will not result in visually
distinct renderings under normal assumptions about rendering.
I'm detecting a deeper concern here -- that such a situation
should not be allowed in the standard at all, as a matter
of design and architecture. But as a matter of practicality,
given the complexity of text representation needs in the
Unicode Standard, I don't think you can legislate these kinds
of edge cases away entirely.
Post by Kent KarlssonSince I think <a, ring above, cgj, dot below> should be canonically
equivalent to <a, dot below, cgj, ring above>, but cannot be made
so (now), the only ways out seem to be to either formally deprecate
CGJ, or at least confine it to very specific uses. Other occurrences
would not be ill-formed or illegal, but would then be non-conforming.
And I disagree with you, obviously. It should neither be
deprecated nor constrained from use where it may helpfully
solve a problem of text representation (in Biblical Hebrew).
--Ken
------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->
To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com
This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/