Bangla: [ZWJ], [VIRAMA] and CV sequences

Post by Gautam Sengupta
Is there any reason (apart from trying to be
ISCII-conformant) why the Bangla word /ki/ "what"
cannot be encoded as [KA][ZWJ][I]? Do we really need
combining forms of vowels to encode Indian scripts?

I don't know what the original motivations were, but one thing about the
current (ISCII-based) encoding scheme that appeals to me is that on average
it requires fewer characters than other more natural schemes. Bangla has a
high percentage of 'vowel signs', each of which would require two characters
in your scheme as opposed to one in the current one.

Post by Gautam Sengupta
Also, why not use [CONS][ZWJ][CONS] instead of
[CONS][VIRAMA][CONS]? One could then use [VIRAMA] only
where it is explicit/visible.

But this would not reflect the fact that the *glyph* [CONS][ZWJ][CONS] is
actually the same thing as the *sequence of characters* [CONS][VIRAMA][CONS],
i.e., [CONS][VIRAMA][ZWNJ][CONS] is also a perfectly legitimate
representation. This latter decision is one that should be taken (normally)
by the rendering mechanism (loosely speaking, the font), not the author.

Deepayan

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Gautam Sengupta

2003-10-08 02:44:32 UTC

Post by Deepayan Sarkar
I don't know what the original motivations were, but
one thing about the
current (ISCII-based) encoding scheme that appeals
to me is that on average
it requires fewer characters than other more natural
schemes. Bangla has a
high percentage of 'vowel signs', each of which
would require two characters
in your scheme as opposed to one in the current one.

There is a trade-off here between file size and the
number of code points used. File size could be further
reduced, for example, if combining forms of consonants
were introduced. But that would be a step in the wrong
direction for various reasons that I will not discuss
here. I am not sure that the right thing to do is to
economize on file size rather than code points.

Post by Gautam Sengupta
Also, why not use [CONS][ZWJ][CONS] instead of
[CONS][VIRAMA][CONS]? One could then use [VIRAMA]

only

Post by Gautam Sengupta
where it is explicit/visible.

But this would not reflect the fact that the *glyph*
[CONS][ZWJ][CONS] is
actually the same thing as the *sequence of
characters* [CONS][VIRAMA][CONS],

But, it is not, certainly not in writing; and that's
the whole point. [CONS][ZWJ][CONS] and
[CONS][(EXPLICIT)VIRAMA][CONS] are "identical" at a
level of linguistics abstraction that need not be
reflected in text encoding. Consider [C][L] and
[C][L][VIRAMA]. They represent the same words, they
are the "same" at some level of representation, but
that is irrelevant for the task at hand.

Post by Deepayan Sarkar
This latter decision is one that should be taken
(normally) by the rendering mechanism (loosely
speaking, the font), not the author.

I disagree. If an author chooses to write a word with
an explicit virama, you have to respect that and let
it be reflected in the encoding. Leaving such
decisions to the rendering engine would destroy the
character and flavor of certain texts. Furthermore
there are metalinguistic uses of the explicit virama
that need to be kept distinct from forms with
conjoined characters.

Thanks Deepayan for your feedback. -Gautam

__________________________________
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Deepayan Sarkar

2003-10-08 05:58:48 UTC

That's a matter of opinion, and as I said, I don't know the motivations of the
original designers. In any case, I wouldn't dwell too much on this for 2
reasons. First, it's very unlikely that you are going to be able to influence
people enough to induce changes at such a fundamental level (especially at
this late stage when there are already fully functional rendering
implementations based on the current scheme). Second, why does it matter what
one particular encoding scheme does ? If you think something is better, use
it, along with some mechanism for converting from unicode to your scheme and
vice versa. Of course, this assumes that it is possible to represent all
reasonable features of Bengali in Unicode, which it should be. If you think
there's something that's not possible, I believe there's a formal mechanism
via which you can submit requests/proposals to the Unicode consortium.

Post by Gautam Sengupta
Also, why not use [CONS][ZWJ][CONS] instead of
[CONS][VIRAMA][CONS]? One could then use [VIRAMA]
only where it is explicit/visible.

But this would not reflect the fact that the *glyph*
[CONS][ZWJ][CONS] is
actually the same thing as the *sequence of
characters* [CONS][VIRAMA][CONS],

What exactly are [C] and [L] here ?

Post by Deepayan Sarkar
This latter decision is one that should be taken
(normally) by the rendering mechanism (loosely
speaking, the font), not the author.

I did qualify my statement by saying that this should be the normal behaviour.
An author would usually not bother about whether her 'da + ukaar' or 'sa +
yaphala' is written by a distinct ligated glyph. The explicit virama-s you
mention are definitely a common feature where the author's control is
important, and that's what ZWNJ is for.

As for the 'flavor' of the texts you mention, if you are talking about visual
appearance, then that's the purpose of the font you are using. You will have
a valid point if you can show an example where there's some text that you
cannot reproduce with (1) unicode + (2) a properly implemented renderer + (3)
a properly implemented font. Do you have any such example ?

Deepayan

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Gautam Sengupta

2003-10-08 07:55:39 UTC

On Tuesday 07 October 2003 21:44, Gautam Sengupta

Post by Deepayan Sarkar
I don't know what the original motivations were,

but

Post by Deepayan Sarkar
one thing about the
current (ISCII-based) encoding scheme that

appeals

Post by Deepayan Sarkar
to me is that on average
it requires fewer characters than other more

natural

Post by Deepayan Sarkar
schemes. Bangla has a
high percentage of 'vowel signs', each of which
would require two characters
in your scheme as opposed to one in the current

one.

Post by Gautam Sengupta
There is a trade-off here between file size and

the

Post by Gautam Sengupta
number of code points used. File size could be

further

Post by Gautam Sengupta
reduced, for example, if combining forms of

consonants

Post by Gautam Sengupta
were introduced. But that would be a step in the

wrong

Post by Gautam Sengupta
direction for various reasons that I will not

discuss

Post by Gautam Sengupta
here. I am not sure that the right thing to do is

Post by Gautam Sengupta
economize on file size rather than code points.

That's a matter of opinion, and as I said, I don't know the motivations of the
original designers.

No, it's also a matter of uniformity and elegance. If consonant clusters are [C][ZWJ or VIRAMA][C], there's no reason why CV clusters shouldn't be treated the same way.

In any case, I wouldn't dwell too much on this for 2 reasons. First, it's very unlikely
that you are going to be able to influence people enough to induce changes at such a
fundamental level (especially at this late stage when there are already fully
functional rendering implementations based on the current scheme).

I agree with you on this. But we have an obligation to explore and figure out the alternatives that would have been better for Bangla and other Indian scripts had they been proposed and accepted, if only for the sake of a better understanding of our scripts.

Second, why does it matter what one particular encoding scheme does?
If you think something is better, use it, along with some mechanism for
converting from unicode to your scheme and vice versa.

That sound like a very ad hoc solution. We should be able to do better than that.

Of course, this assumes that it is possible to represent all reasonable features of
Bengali in Unicode, which it should be. If you think there's something that's not
possible, I believe there's a formal mechanism via which you can submit
requests/proposals to the Unicode consortium.

It's not just a matter of whether all the relevant features of a script can be encoded using a particular mechanism. If there are more than one such encoding, we have to choose between them, and our choice will have to be guided by considerations of economy, uniformity and elegance. In the present scheme there is no elegant solution to the problem of encoding [Vowel-A][J-PHOLA][AA-kar]. You have to do something ad hoc. On my scheme you'd exactly what you do elsewhere, namely, [Vowel-A][ZWJ][Y][ZWJ][AA].

Post by Gautam Sengupta
Also, why not use [CONS][ZWJ][CONS] instead of
[CONS][VIRAMA][CONS]? One could then use

[VIRAMA]

Post by Gautam Sengupta
only where it is explicit/visible.

But this would not reflect the fact that the

*glyph*

Post by Deepayan Sarkar
[CONS][ZWJ][CONS] is
actually the same thing as the *sequence of
characters* [CONS][VIRAMA][CONS],

But, it is not, certainly not in writing; and

that's

Post by Gautam Sengupta
the whole point. [CONS][ZWJ][CONS] and
[CONS][(EXPLICIT)VIRAMA][CONS] are "identical" at

Post by Gautam Sengupta
level of linguistics abstraction that need not be
reflected in text encoding. Consider [C][L] and
[C][L][VIRAMA]. They represent the same words,

they

Post by Gautam Sengupta
are the "same" at some level of representation,

but

Post by Gautam Sengupta
that is irrelevant for the task at hand.

What exactly are [C] and [L] here ?

The letters [CA] and [LA], as in Bangla /cOl/ "come!" which can be written both with and without a final [VIRAMA]. The author's choice in this matter has to be respected.

Post by Deepayan Sarkar
This latter decision is one that should be taken
(normally) by the rendering mechanism (loosely
speaking, the font), not the author.

I disagree. If an author chooses to write a word

with

Post by Gautam Sengupta
an explicit virama, you have to respect that and

let

Post by Gautam Sengupta
it be reflected in the encoding. Leaving such
decisions to the rendering engine would destroy

the

Post by Gautam Sengupta
character and flavor of certain texts. Furthermore
there are metalinguistic uses of the explicit

virama

Post by Gautam Sengupta
that need to be kept distinct from forms with
conjoined characters.

But the encoding that uses [ZWNJ] to encode an explicit [VIRAMA] is much less intuitive that the one I am suggesting. The [ZWNJ] in the latter encoding merely acts as a flag to alert us that something very ad hoc is going on here! Our encodings should not only be adequate for the job of representing written texts, they should also *mean* something to us.

As for the 'flavor' of the texts you mention, if you are talking about visual
appearance, then that's the purpose of the font you are using.

No, I am NOT talking about visual appearance. I am talking about writing a word with an explicit virama vs. writing it with a conjoined character. Recall /choToder pattaRi/ in Jugaantar.

You will have a valid point if you can show an example where
there's some text that you cannot reproduce with (1) unicode + (2) a properly
implemented renderer + (3) a properly implemented font. Do you have any such
example ?

No, this is going back to the claim that all is well as long as everything can be given an unambiguous representation, no matter how ad hoc or counterintuitive. This approach has already done a lot of harm to the system: look at the placement of diirgha RI and LI or even the Assamese RA and VA (a revision for the latter has been suggested and appears to be entirely on the right track), or even the proposal to assign a code point to KSH in Bangla, Hindi etc. (The fact that sorting/collation is often language-specific should not be misused to justify random assignment of code points to characters) or encodings to character strings. Compare, for example, [Vowel-A][ZWJ][Y][ZWJ][AA] in my scheme of encoding with the equivalent one in Unicode.

Best, Gautam

---------------------------------
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search

Gautam Sengupta

2003-10-08 07:56:29 UTC

On Tuesday 07 October 2003 21:44, Gautam Sengupta

Post by Deepayan Sarkar
I don't know what the original motivations were,

but

Post by Deepayan Sarkar
one thing about the
current (ISCII-based) encoding scheme that

appeals

Post by Deepayan Sarkar
to me is that on average
it requires fewer characters than other more

natural

Post by Deepayan Sarkar
schemes. Bangla has a
high percentage of 'vowel signs', each of which
would require two characters
in your scheme as opposed to one in the current

one.

Post by Gautam Sengupta
There is a trade-off here between file size and

the

Post by Gautam Sengupta
number of code points used. File size could be

further

Post by Gautam Sengupta
reduced, for example, if combining forms of

consonants

Post by Gautam Sengupta
were introduced. But that would be a step in the

wrong

Post by Gautam Sengupta
direction for various reasons that I will not

discuss

Post by Gautam Sengupta
here. I am not sure that the right thing to do is

Post by Gautam Sengupta
economize on file size rather than code points.

That's a matter of opinion, and as I said, I don't know the motivations of the
original designers.

No, it's also a matter of uniformity and elegance. If consonant clusters are [C][ZWJ or VIRAMA][C], there's no reason why CV clusters shouldn't be treated the same way.

Second, why does it matter what one particular encoding scheme does?
If you think something is better, use it, along with some mechanism for
converting from unicode to your scheme and vice versa.

That sound like a very ad hoc solution. We should be able to do better than that.

Post by Gautam Sengupta
Also, why not use [CONS][ZWJ][CONS] instead of
[CONS][VIRAMA][CONS]? One could then use

[VIRAMA]

Post by Gautam Sengupta
only where it is explicit/visible.

But this would not reflect the fact that the

*glyph*

Post by Deepayan Sarkar
[CONS][ZWJ][CONS] is
actually the same thing as the *sequence of
characters* [CONS][VIRAMA][CONS],

But, it is not, certainly not in writing; and

that's

Post by Gautam Sengupta
the whole point. [CONS][ZWJ][CONS] and
[CONS][(EXPLICIT)VIRAMA][CONS] are "identical" at

Post by Gautam Sengupta
level of linguistics abstraction that need not be
reflected in text encoding. Consider [C][L] and
[C][L][VIRAMA]. They represent the same words,

they

Post by Gautam Sengupta
are the "same" at some level of representation,

but

Post by Gautam Sengupta
that is irrelevant for the task at hand.

What exactly are [C] and [L] here ?

The letters [CA] and [LA], as in Bangla /cOl/ "come!" which can be written both with and without a final [VIRAMA]. The author's choice in this matter has to be respected.

Post by Deepayan Sarkar
This latter decision is one that should be taken
(normally) by the rendering mechanism (loosely
speaking, the font), not the author.

I disagree. If an author chooses to write a word

with

Post by Gautam Sengupta
an explicit virama, you have to respect that and

let

Post by Gautam Sengupta
it be reflected in the encoding. Leaving such
decisions to the rendering engine would destroy

the

Post by Gautam Sengupta
character and flavor of certain texts. Furthermore
there are metalinguistic uses of the explicit

virama

Post by Gautam Sengupta
that need to be kept distinct from forms with
conjoined characters.

As for the 'flavor' of the texts you mention, if you are talking about visual
appearance, then that's the purpose of the font you are using.

No, I am NOT talking about visual appearance. I am talking about writing a word with an explicit virama vs. writing it with a conjoined character. Recall /choToder pattaRi/ in Jugaantar.

No, this is going back to the claim that all is well as long as everything can be given an unambiguous representation, no matter how ad hoc or counterintuitive. This approach has already done a lot of harm to the system: look at the placement of diirgha RI and LI or even the Assamese RA and VA (a revision for the latter has been suggested and appears to be entirely on the right track), or even the proposal to assign a code point to KSH in Bangla, Hindi etc. (The fact that sorting/collation is often language-specific should not be misused to justify random assignment of code points to characters) or encodings to character strings. Compare, for example, [Vowel-A][ZWJ][Y][ZWJ][AA] in my scheme of encoding with the equivalent one in Unicode.

Best, Gautam

Gautam Sengupta
Professor of Applied Linguistics
Director, School of Linguistics & Language Technology
Jadavpur University
Kolkata 700 032, INDIA
Email: ***@icqmail.com

---------------------------------
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search

Gautam Sengupta

2003-10-08 08:05:36 UTC

On Tuesday 07 October 2003 21:44, Gautam Sengupta

Post by Deepayan Sarkar
I don't know what the original motivations were,

but

Post by Deepayan Sarkar
one thing about the
current (ISCII-based) encoding scheme that

appeals

Post by Deepayan Sarkar
to me is that on average
it requires fewer characters than other more

natural

Post by Deepayan Sarkar
schemes. Bangla has a
high percentage of 'vowel signs', each of which
would require two characters
in your scheme as opposed to one in the current

one.

Post by Gautam Sengupta
There is a trade-off here between file size and

the

Post by Gautam Sengupta
number of code points used. File size could be

further

Post by Gautam Sengupta
reduced, for example, if combining forms of

consonants

Post by Gautam Sengupta
were introduced. But that would be a step in the

wrong

Post by Gautam Sengupta
direction for various reasons that I will not

discuss

Post by Gautam Sengupta
here. I am not sure that the right thing to do is

Post by Gautam Sengupta
economize on file size rather than code points.

That's a matter of opinion, and as I said, I don't know the motivations of the
original designers.

No, it's also a matter of uniformity and elegance. If consonant clusters are [C][ZWJ or VIRAMA][C], there's no reason why CV clusters shouldn't be treated the same way.

Second, why does it matter what one particular encoding scheme does?
If you think something is better, use it, along with some mechanism for
converting from unicode to your scheme and vice versa.

That sound like a very ad hoc solution. We should be able to do better than that.

Post by Gautam Sengupta
Also, why not use [CONS][ZWJ][CONS] instead of
[CONS][VIRAMA][CONS]? One could then use

[VIRAMA]

Post by Gautam Sengupta
only where it is explicit/visible.

But this would not reflect the fact that the

*glyph*

Post by Deepayan Sarkar
[CONS][ZWJ][CONS] is
actually the same thing as the *sequence of
characters* [CONS][VIRAMA][CONS],

But, it is not, certainly not in writing; and

that's

Post by Gautam Sengupta
the whole point. [CONS][ZWJ][CONS] and
[CONS][(EXPLICIT)VIRAMA][CONS] are "identical" at

Post by Gautam Sengupta
level of linguistics abstraction that need not be
reflected in text encoding. Consider [C][L] and
[C][L][VIRAMA]. They represent the same words,

they

Post by Gautam Sengupta
are the "same" at some level of representation,

but

Post by Gautam Sengupta
that is irrelevant for the task at hand.

What exactly are [C] and [L] here ?

The letters [CA] and [LA], as in Bangla /cOl/ "come!" which can be written both with and without a final [VIRAMA]. The author's choice in this matter has to be respected.

Post by Deepayan Sarkar
This latter decision is one that should be taken
(normally) by the rendering mechanism (loosely
speaking, the font), not the author.

I disagree. If an author chooses to write a word

with

Post by Gautam Sengupta
an explicit virama, you have to respect that and

let

Post by Gautam Sengupta
it be reflected in the encoding. Leaving such
decisions to the rendering engine would destroy

the

Post by Gautam Sengupta
character and flavor of certain texts. Furthermore
there are metalinguistic uses of the explicit

virama

Post by Gautam Sengupta
that need to be kept distinct from forms with
conjoined characters.

As for the 'flavor' of the texts you mention, if you are talking about visual
appearance, then that's the purpose of the font you are using.

No, I am NOT talking about visual appearance. I am talking about writing a word with an explicit virama vs. writing it with a conjoined character. Recall /choToder pattaRi/ in Jugaantar.

Christopher John Fynn

2003-10-08 13:49:32 UTC

Post by Deepayan Sarkar
But this would not reflect the fact that the *glyph* [CONS][ZWJ][CONS] is
actually the same thing as the *sequence of characters*

[CONS][VIRAMA][CONS],

Post by Deepayan Sarkar
i.e., [CONS][VIRAMA][ZWNJ][CONS] is also a perfectly legitimate
representation.

As I understand it, [CONS][VIRAMA][VIRAMA][CONS] is the correct way of
forcing a virama to be displayed rather than a ligature - not
[CONS][VIRAMA][ZWNJ][CONS]

- Chris

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Gautam Sengupta

2003-10-08 16:39:34 UTC

Post by Christopher John Fynn
As I understand it, [CONS][VIRAMA][VIRAMA][CONS]
is the correct way of
forcing a virama to be displayed rather than a
ligature - not
[CONS][VIRAMA][ZWNJ][CONS]

This is certainly true of ISCII, but I think Unicode
uses [CONS][VIRAMA][ZWNJ][CONS]. -Gautam

__________________________________
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Christopher John Fynn

2003-10-08 21:24:24 UTC

Post by Christopher John Fynn
As I understand it, [CONS][VIRAMA][VIRAMA][CONS]
is the correct way of
forcing a virama to be displayed rather than a
ligature - not
[CONS][VIRAMA][ZWNJ][CONS]

This is certainly true of ISCII, but I think Unicode
uses [CONS][VIRAMA][ZWNJ][CONS]. -Gautam

My mistake. You're right.

- Chris

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Marco Cimarosti

2003-10-08 09:58:11 UTC

Perhaps you are right that it *would* have been a cleaner design to have
only one set of vowel.

But notice that <KA><+> is one character longer that <KA><+I>. Maybe
storage space is not a big problem these days, but still it makes 2 to 4
extra bytes for each consonant not followed by the inherent vowel /a/.

Perhaps it *would* have been better to have only the combining vowels, and
to form independent vowels with a "mute consonant" (actually, the
independent vowel "a").

Post by Gautam Sengupta
Also, why not use [CONS][ZWJ][CONS] instead of
[CONS][VIRAMA][CONS]? One could then use [VIRAMA] only
where it is explicit/visible.

OK. But what happens when the font does not have a glyph for the ligature
<cons><ZWJ><cons>, nor for the half consonant <cons><ZWJ>, nor for the
subjoined consonant <ZWJ><cons>?

As <ZWJ>, per se, is an invisible character, what happens is that your
string displays as <cons><cons>, which is clearly semantically incorrect. If
you want the explicit virama to be visible, you need to encode it as
<cons><VIRAMA><cons>.

And this means that you (the author of the text) are forced to chose between
<ZWJ> and <VIRAMA> based on the availability of glyphs in the *particular*
font that you are using while typing. And this is a big no no no, because it
would impede you to change the font without re-typing part of the text.

What happens with the current Unicode scheme is that, if the font does not
have a glyph for the ligature <cons><VIRAMA><cons>, nor for the half
consonant <cons><VIRAMA>, nor for the subjoined consonant <VIRAMA><cons>,
the virama is *automatically* displayed visibly, so that the semantics of
the text is always safe, even if rendered with the most stupid of fonts.

Post by Gautam Sengupta
Surely, [A/E][ZWJ][Y][ZWJ][AA] is more "natural" and
intuitively acceptable than any encoding in which a
vowel is followed by a [VIRAMA]?

Maybe. But I see no reason why being natural or intuitive should be seen as
key feature for an encoding system. That might be the case for an encoding
system designed to be used by humans, but Unicode is designed to be used by
computers, so I don't see the problem.

I assume that in a well designed Bengali input method, yaphala would be a
key on its own, so, by the point of view of the user, it is just a
"character": they don't need to know that when they press that key the
sequence of codes <VIRAMA><YA> will actually be inserted, so they won't
notice the apparent nonsense of the sequence <vowel><VIRAMA> and, as we say
in Italy, "If eye doesn't see, heart doesn't hurt".

_ Marco

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-10-08 10:54:44 UTC

Post by Marco Cimarosti
...
What happens with the current Unicode scheme is that, if the font does not
have a glyph for the ligature <cons><VIRAMA><cons>, nor for the half
consonant <cons><VIRAMA>, nor for the subjoined consonant <VIRAMA><cons>,
the virama is *automatically* displayed visibly, so that the semantics of
the text is always safe, even if rendered with the most stupid of fonts.

I don't understand the specific issues here... But it does seem a rather
strange design principle that we should expect a text to be displayed
meaningfully even when the font lacks the glyphs required for proper
display. I would have thought it better not to attempt to display
properly, perhaps display boxes as an indication of an error or trigger
substitution by a font which does have the glyphs. After all, presumably
those who write Bangla regularly will use a font which does have the
necessary glyphs, and those who write it occasionally should be warned
to find and change to such a font rather than misled into thinking
things are OK.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

John Cowan

2003-10-08 12:04:31 UTC

The key term is "necessary". In the Indic scripts, it is a principle
that any instance of a consonant with VIRAMA (the rough equivalent of
schwa quiescens) followed in the same word by another consonant may
be replaced by a ligature of those consonants. However, no ligature
is actually mandatory, and which ligatures are customary depends on the
particular script, the particular language (it's common for some languages
to be written in more than one Indic script), and the particular time
and place of writing. Using fewer ligatures than custom dictates makes
the text look crude, but using too many may render it utterly illegible,
since unfamiliar ligatures are often not recognizable at sight.

Therefore, the Unicode Standard does not encode any Indic ligature,
though it does specify general methods (involving ZWJ and ZWNJ) for
requesting partial or complete ligatures or for prohibiting ligaturing
(and using the explicit VIRAMA appropriate to the script). It is indeed
very much a matter of the font, therefore, which ligatures are possible
and which are not possible.

Disclaimer: I'm no expert on this. These remarks don't apply in their
full generality to Tibetan, and aren't applicable at all to Thai or Lao.
--
One art / There is John Cowan <***@reutershealth.com>
No less / No more http://www.reutershealth.com
All things / To do http://www.ccil.org/~cowan
With sparks / Galore -- Douglas Hofstadter

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Christopher John Fynn

2003-10-08 14:41:41 UTC

----- Original Message -----
From: "Peter Kirk" <***@qaya.org>
To: "Marco Cimarosti" <***@essetre.it>
Cc: <***@unicode.org>
Sent: Wednesday, October 08, 2003 11:54 AM
Subject: Re: Bangla: [ZWJ], [VIRAMA] and CV sequences

Post by Marco Cimarosti
What happens with the current Unicode scheme is that, if the font does not
have a glyph for the ligature <cons><VIRAMA><cons>, nor for the half
consonant <cons><VIRAMA>, nor for the subjoined consonant

<VIRAMA><cons>,

Post by Marco Cimarosti
the virama is *automatically* displayed visibly, so that the semantics of
the text is always safe, even if rendered with the most stupid of fonts.

Yes

Post by Peter Kirk
I don't understand the specific issues here... But it does seem a rather
strange design principle that we should expect a text to be displayed
meaningfully even when the font lacks the glyphs required for proper
display. I would have thought it better not to attempt to display
properly, perhaps display boxes as an indication of an error or trigger
substitution by a font which does have the glyphs. After all, presumably
those who write Bangla regularly will use a font which does have the
necessary glyphs, and those who write it occasionally should be warned
to find and change to such a font rather than misled into thinking
things are OK.

Simplistically put, every Indic consonant usually has an inherent vowel
"A". When this vowel is not wanted the consonant is usually written as a
ligature joined (often in half form) with the following consonant. Another
way of removing the inherent vowel is to write a virama (halant) under it.
(Both forms are readable but the first is usually the preferable & expected
form.

In old handwritten orthography a large number of ligatures were used. With
metal type some typefaces lacked type (precomposed glyphs) for less
frequent combinations. This was worked around by printing consonant +
virama consonant in place of the ligature.

So a <consonant virama consonant> (where the virama is displayed below the
first consonant) is equivalent to a ligature of the two consonants (-
though writing the virama is usually not good typography.)

In Unicode virama (094D) is used between two consonants to indicate that
they should be displayed as a ligature. If the font does not have a glyph
for the ligature then a virama should be displayed under the first
consonant (to indicate the inherent vowel is killed).

If a ligature glyph for the two consonants is available and displayed then
the virama glyph is not displayed. So in effect the virama character
functions as a kind of ZWJ between two consonants but if, due to font
limitations*, a joined ligature cannot be displayed then the virama should
be displayed under the preceding character.

If a user wants to force a virama to be displayed (and prevent the ligature
form of the two consonants) then she can enter two virama characters after
the first consonant (and a glyph for *one* of these should be displayed
under the first consonant in the pair).

A virama typed after a consonant with no following consonant should always
be displayed.

It should be noted that there are combinations (ligatures) of several
consonants - to form these, a virama character would have to be entered
after each consonant character in the combination except the final one.

===

The model used for encoding Tibetan is different - two sets of consonants
were encoded. The first set (0F40 -0F6A) is used for isolated consonants
and for the first consonant in any combination; and the second set
(0F90-0FBC) explicitly combine with the preceding consonant. So the Tibetan
virama (0F84) is not needed as a joiner character and when it occurs
should always be displayed as a combining glyph. In the Tibetan encoding
isolated forms of vowels are also unnecessary.

- Chris

* e.g. in a pan-Unicode font like Arial Unicode or Code 2000 it would
probably not be practical to support all the ligatures for all the Indic
scripts. In such cases if the Virama is displayed the text is still
readable.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-10-08 16:20:30 UTC

Post by Christopher John Fynn
----- Original Message -----
Sent: Wednesday, October 08, 2003 11:54 AM
Subject: Re: Bangla: [ZWJ], [VIRAMA] and CV sequences
...
Simplistically put, every Indic consonant usually has an inherent vowel
"A". When this vowel is not wanted the consonant is usually written as a
ligature joined (often in half form) with the following consonant. Another
way of removing the inherent vowel is to write a virama (halant) under it.
(Both forms are readable but the first is usually the preferable & expected
form.

...

Thank you, also to John and Marco. I understand the basic issue now.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Gautam Sengupta

2003-10-08 14:02:35 UTC

Post by Marco Cimarosti

Post by Gautam Sengupta
Also, why not use [CONS][ZWJ][CONS] instead of
[CONS][VIRAMA][CONS]? One could then use [VIRAMA]
only where it is explicit/visible.

OK. But what happens when the font does not have a
glyph for the ligature <cons><ZWJ><cons>, nor for
the half consonant <cons><ZWJ>, nor for the
subjoined consonant <ZWJ><cons>?
As <ZWJ>, per se, is an invisible character, what
happens is that your
string displays as <cons><cons>, which is clearly
semantically incorrect. If
you want the explicit virama to be visible, you need
to encode it as
<cons><VIRAMA><cons>.
And this means that you (the author of the text) are
forced to chose between
<ZWJ> and <VIRAMA> based on the availability of
glyphs in the *particular*
font that you are using while typing. And this is a
big no no no, because it
would impede you to change the font without
re-typing part of the text.
What happens with the current Unicode scheme is
that, if the font does not
have a glyph for the ligature <cons><VIRAMA><cons>,
nor for the half
consonant <cons><VIRAMA>, nor for the subjoined
consonant <VIRAMA><cons>,
the virama is *automatically* displayed visibly, so
that the semantics of
the text is always safe, even if rendered with the
most stupid of fonts.

I am no programmer, but surely the rendering engine
could be tweaked to display a halant/hashant in the
aforementioned situations? I understand that it won't
happen *automatically* if we were to use <ZWJ> instead
of <VIRAMA>. But if you were to take the trouble to do
the tweaking, you'd then have a completely *intuitive*
encodings for vowel yaphala sequences,
<vowel><ZWJ><Y>, instead of oddities like
<vowel><VIRAMA><Y>.

Post by Marco Cimarosti

Post by Gautam Sengupta
Surely, [A/E][ZWJ][Y][ZWJ][AA] is more "natural"
and intuitively acceptable than any encoding in
which a vowel is followed by a [VIRAMA]?

Perhaps there isn't a *problem* as such, and perhaps
naturalness and intuitive acceptability aren't *key*
features of the system, but surely other factors being
equal they ought be taken into consideration in
choosing one method of encoding over another?

Post by Marco Cimarosti
I assume that in a well designed Bengali input
method, yaphala would be a
key on its own,
so, by the point of view of the user, it is just a
"character": they don't need to know that when they
press that key the
sequence of codes <VIRAMA><YA> will actually be
inserted, so they won't
notice the apparent nonsense of the sequence
<vowel><VIRAMA> and, as we say
in Italy, "If eye doesn't see, heart doesn't hurt".

No, YAPHALA won't be a character on its own, only Y
will be. The -PHALA in YAPHALA indicates that it is a
combining variant of a grapheme. YAPAHALA will be a
combining variant of Y to be inserted by the rendering
engine in the appropriate environment. The user will
*see* and key in the <ZWJ> between a consonant and a
<Y> (or a vowel and <Y>) in order make the latter show
up as a yaphala.

Marco, thank you *very* much for your extremely
helpful comments and feedback. Best, Gautam.

__________________________________
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Marco Cimarosti

2003-10-08 13:17:45 UTC

The fact is that these glyphs are not necessarily *required*. Each Indic
script has a relatively small set of glyphs that are absolutely required in
any font, but also an unspecified number of ligature that may or may not be
present.

This may depend from the language (e.g., Devanagari for Sanskrit typically
uses more ligatures than Devanagari for Hindi), or even simply be a matter
of typographical style.

_ Marco

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Gautam Sengupta

2003-10-08 16:27:32 UTC

Post by Gautam Sengupta
Is there any reason (apart from trying to be
ISCII-conformant) why the Bangla word /ki/ "what"
cannot be encoded as [KA][ZWJ][I]? Do we really

need

Post by Gautam Sengupta
combining forms of vowels to encode Indian

scripts?
The encoding of most Indic scripts is based on ISCII
- and that's not going
to change. It was adopted since ISCII was the
pre-existing Indian national
character encoding standard for these scripts.

I understand that this is so. But perhaps it is
worthwhile for us to be aware of the flaws in ISCII
that were inherited by Unicode. It is also necessary
to recognize the fact that the bureaucrats in a
government are not necessarily the most competent
people to adjudicate on how a script should be
encoded. I wonder whether the Dept of Electronics,
Govt of India, would have any reasons to offer
justifying the placement of Assammese /r/ and /v/ and
the long syllabic /r/ and /l/ in their current
positions.

Post by Gautam Sengupta
Another model could have been followed. For example
in Tibetan, isolated
0F68
0F68 0F71
0F68 0F72
0F68 0F73 [0F68 0F71 0F72]
0F68 0F74
0F68 0F75 [0F68 0F71 0F74]
0F62 0F80
0F62 0F81 [0F62 0F71 0F80]
0F63 0F80
0F63 0F81 [0F63 0F71 0F80]
0F68 0F7A
0F68 0F7B
0F68 0F7C
0F68 0F7D
0F68 0F7E
0F68 0F7F

This would have been more appropriate, and possibly
more economical.

Post by Gautam Sengupta
Also, why not use [CONS][ZWJ][CONS] instead of
[CONS][VIRAMA][CONS]? One could then use [VIRAMA]

only

Post by Gautam Sengupta
where it is explicit/visible.

There was a third possibility: In the Tibetan
encoding a second set of
explicitly combining consonants was encoded
So you have [CONS] [COMBINING CONS]
Instead of [CONS][VIRAMA][CONS] or
[CONS][ZWJ][CONS]

This would have been difficult for the Indian scripts.
There would be too many combining forms. We would need
many more code points.

Post by Gautam Sengupta
This was done because a) although a Virama character
exists in Tibetan very
few Tibetans know what it means since it is almost
never written and never
occurs in ordinary text. b) In many combinations it
is totally unacceptable

The use of ligatures in Indian scripts is not as much
a matter of choice as it is often assumed to be. For
example the word /strii/ "woman" written as
<S><VIRAMA><T><VIRAMA><R><II> would be totally
unacceptable in Bangla and most other Indian scripts.

Post by Gautam Sengupta
In other words the ISCII model was not suitable for
Tibetan so a different encoding model was adopted.

*In its current implementation/interpretation* it
doesn't seem to be very suitable for Indian scripts
either.
-Gautam

__________________________________
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Marco Cimarosti

2003-10-08 17:45:11 UTC

Post by Gautam Sengupta
I am no programmer, but surely the rendering engine
could be tweaked to display a halant/hashant in the
aforementioned situations? I understand that it won't
happen *automatically* if we were to use <ZWJ> instead
of <VIRAMA>. But if you were to take the trouble to do
the tweaking, you'd then have a completely *intuitive*
encodings for vowel yaphala sequences,
<vowel><ZWJ><Y>, instead of oddities like
<vowel><VIRAMA><Y>.

OK but, then, your <ZWJ> becomes exactly what Unicode's <VIRAMA> has always
been: a character that is normally invisible, because it merges in a
ligature with adjacent characters, but occasionally becomes visible when a
font does not have a glyph for that combination.

But there is one detail which makes your approach much more complicated:
what we have been calling <VIRAMA> is *not* a single character. Every Indic
script has its own: <DEVANAGARI SIGN VIRAMA>, <BENGALI SIGN VIRAMA>, and so
on.

Each one of these characters, when displayed visibly, has a distinct glyph:
a Bangla hashant is a small "/" under the letter, a Tamil virama is a dot
over the letter, etc.

With your approach, the single character <ZWJ> is overloaded with a dozen
different glyphs depending on which script the adjacent letters belong to.
Plus, it still has to be invisible when used in a non-Indic script, such as
Arabic.

Implementing all this is certainly possible, but would result in bigger
look-up tables, for no advantage at all.

Post by Gautam Sengupta
Perhaps there isn't a *problem* as such, and perhaps
naturalness and intuitive acceptability aren't *key*
features of the system, but surely other factors being
equal they ought be taken into consideration in
choosing one method of encoding over another?

Yes. But the flaws that I see in ISCII/Unicode model are much smaller than
you imply. E.g., I agree that it would have been more logic if:

- independent and dependent vowels were the same characters;

- each script was encoded in its natural alphabetical order;

- there were no precomposed and decomposed alternatives for the same
graphemes.

And others, on which perhaps a linguist won't agree, but which would have
made life much easier to programmers:

- all vowels were encoded in visual order, so that vowel reordering was
necessary;

- "repha ra" were encoded as a separate characters, so that no reordering at
all was necessary.

But, all summed up, leaving with these little flaws is *much* simpler than
trying to change the rules of a standard a dozen years after people started
implementing it.

_ Marco

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Gautam Sengupta

2003-10-08 19:38:26 UTC

Post by Gautam Sengupta
I am no programmer, but surely the rendering

engine

Post by Gautam Sengupta
could be tweaked to display a halant/hashant in

the

Post by Gautam Sengupta
aforementioned situations? I understand that it

won't

Post by Gautam Sengupta
happen *automatically* if we were to use <ZWJ>

instead

Post by Gautam Sengupta
of <VIRAMA>. But if you were to take the trouble

to do

Post by Gautam Sengupta
the tweaking, you'd then have a completely

*intuitive*

Post by Gautam Sengupta
encodings for vowel yaphala sequences,
<vowel><ZWJ><Y>, instead of oddities like
<vowel><VIRAMA><Y>.

You are absolutely right. I am suggesting that the
language-specific viramas be retained as
script-specific *explicit* viramas that never
disappear. In addition, let's have a script-specific
ZWJ which behaves in the way you describe in the
preceding paragraph. The explicit virama (rather the
ONLY virama) will never appear after a vowel, but the
language-specific ZWJ will, as in <A><ZWJ><Y><AA>
encoding A+YOPHOLA+AA. The cost is just one additional
code point for each script. Note that we will no
longer need the combining vowels or an additional code
point for YAPHOLA.

Post by Gautam Sengupta
But there is one detail which makes your approach
what we have been calling <VIRAMA> is *not* a single
character. Every Indic
script has its own: <DEVANAGARI SIGN VIRAMA>,
<BENGALI SIGN VIRAMA>, and so
on.
Each one of these characters, when displayed
a Bangla hashant is a small "/" under the letter, a
Tamil virama is a dot
over the letter, etc.
With your approach, the single character <ZWJ> is
overloaded with a dozen
different glyphs depending on which script the
adjacent letters belong to.
Plus, it still has to be invisible when used in a
non-Indic script, such as
Arabic.
Implementing all this is certainly possible, but
would result in bigger
look-up tables, for no advantage at all.

See my previous paragraph.

Post by Gautam Sengupta
Perhaps there isn't a *problem* as such, and

perhaps

Post by Gautam Sengupta
naturalness and intuitive acceptability aren't

*key*

Post by Gautam Sengupta
features of the system, but surely other factors

being

Post by Gautam Sengupta
equal they ought be taken into consideration in
choosing one method of encoding over another?

Yes. But the flaws that I see in ISCII/Unicode model
are much smaller than you imply. E.g., I agree that
- independent and dependent vowels were the same
characters;
- each script was encoded in its natural
alphabetical order;
- there were no precomposed and decomposed
alternatives for the same
graphemes.
And others, on which perhaps a linguist won't agree,
but which would have
- all vowels were encoded in visual order, so that
vowel reordering was necessary;
- "repha ra" were encoded as a separate characters,
so that no reordering at all was necessary.

I agree with you on all of these issues. You have in
fact summed up my critique of the ISCII/Unicode model.
The only point I'd like to add here is that these
mistakes were avoidable and should have been avoided.
There can be no excuses for placing the Assamese r and
v the way they are currently placed. The same goes for
the long syllabic R and L.

Post by Gautam Sengupta
But, all summed up, leaving with these little flaws
is *much* simpler than
trying to change the rules of a standard a dozen
years after people started
implementing it.

Take a second look. My suggestion amounts to:

1. retaining the script-specific virama as it is. Its
existing behavior remains unchanged. I rename it as
"(script-specific) ZWJ" merely for my convenience and
conceptual clarity.

2. extending the role of this script-specific ZWJ to
encode combining forms of vowels in CV sequences,
entirely in line with the way it is used to encode CC
ligatures.

[1 and 2 may sound somewhat different from what I have
suggested above, but they are in effect the same].

3. introducing a script-specific explicit virama,
which we can very well afford after getting rid of all
the combining forms of vowels.

4. getting rid of *all* precomposed forms including
the recent innovations in Devanagari that are used
only for transliteration. These not only fill up the
code space of Devanagari but also put constraints on
the placement of characters in the code spaces of
other Indian scripts.

How much recoding would these changes involve? Would
the cost be really unacceptable?

Best, Gautam

__________________________________
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kenneth Whistler

2003-10-08 21:12:37 UTC

Post by Gautam Sengupta
You are absolutely right. I am suggesting that the
language-specific viramas be retained as
script-specific *explicit* viramas that never
disappear. In addition, let's have a script-specific
ZWJ which behaves in the way you describe in the
preceding paragraph. The explicit virama (rather the
ONLY virama) will never appear after a vowel, but the
language-specific ZWJ will, as in <A><ZWJ><Y><AA>
encoding A+YOPHOLA+AA. The cost is just one additional
code point for each script.

The "cost" is not measured in code points, but in change
of model, change of implementations, normalization of
data, mismatching and failures of searches on data
represented differently, and on and on...

Post by Gautam Sengupta
Note that we will no
longer need the combining vowels or an additional code
point for YAPHOLA.
I agree with you on all of these issues. You have in
fact summed up my critique of the ISCII/Unicode model.
The only point I'd like to add here is that these
mistakes were avoidable and should have been avoided.
There can be no excuses for placing the Assamese r and
v the way they are currently placed. The same goes for
the long syllabic R and L.

Placement in the code charts is, however, irrelevant to the
correct ordering of strings represented using those
code points. That is done by a collation algorithm with
weight tables -- not by the presumed mechanism of binary
ordering implied by ISCII.

Post by Marco Cimarosti
But, all summed up, leaving with these little flaws
is *much* simpler than
trying to change the rules of a standard a dozen
years after people started
implementing it.

Take a second look.

Marco is, however, absolutely correct in his overall assessment
here.

Post by Gautam Sengupta
1. retaining the script-specific virama as it is. Its
existing behavior remains unchanged. I rename it as
"(script-specific) ZWJ" merely for my convenience and
conceptual clarity.
2. extending the role of this script-specific ZWJ to
encode combining forms of vowels in CV sequences,
entirely in line with the way it is used to encode CC
ligatures.
[1 and 2 may sound somewhat different from what I have
suggested above, but they are in effect the same].
3. introducing a script-specific explicit virama,
which we can very well afford after getting rid of all
the combining forms of vowels.

"Affording" this has nothing to do with available code
points. The problem is the reconstruction of the text
model. And "getting rid of all the combining forms of
vowels" would be a radical reconstruction of the text
model -- something which the Unicode Standard simply
cannot accomodate.

Post by Gautam Sengupta
4. getting rid of *all* precomposed forms including
the recent innovations in Devanagari that are used
only for transliteration. These not only fill up the
code space of Devanagari but also put constraints on
the placement of characters in the code spaces of
other Indian scripts.

Again, "filling up the code space" has nothing to do
with the assessment.

Post by Gautam Sengupta
How much recoding would these changes involve?

Extensive. And *any* recoding of Unicode characters is
simply disallowed by the stability guarantees associated
with the standard:

http://www.unicode.org/standard/stability_policy.html

Post by Gautam Sengupta
Would
the cost be really unacceptable?

Yes. Absolutely.

--Ken

Post by Gautam Sengupta
Best, Gautam

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Unicode (public)

2003-10-08 21:26:47 UTC

Gautam--

Post by Gautam Sengupta
1. retaining the script-specific virama as it is. Its
existing behavior remains unchanged. I rename it as
"(script-specific) ZWJ" merely for my convenience and conceptual

clarity.

Post by Gautam Sengupta
2. extending the role of this script-specific ZWJ to
encode combining forms of vowels in CV sequences,
entirely in line with the way it is used to encode CC ligatures.
[1 and 2 may sound somewhat different from what I have suggested above,

but they are in effect the same].

Post by Gautam Sengupta
3. introducing a script-specific explicit virama,
which we can very well afford after getting rid of all
the combining forms of vowels.
4. getting rid of *all* precomposed forms including
the recent innovations in Devanagari that are used
only for transliteration. These not only fill up the
code space of Devanagari but also put constraints on
the placement of characters in the code spaces of
other Indian scripts.
How much recoding would these changes involve? Would
the cost be really unacceptable?

Yes, the cost is really unacceptable.

Two of the most basic Unicode stability policies dictate that character
assignments, once made, are never removed and character names can never
change. Step 4 cannot happen; the best that can happen is that the code
points in question can be deprecated. The renaming you suggest in 1
cannot happen either.

The change in the encoding model for the virama can't happen either;
there are too many implementations based on it, and there are too many
documents out there that use the current encoding model. Your
suggestion wouldn't make them unreadable when opened with software that
did things the way you're suggesting, but it would change their
appearance in ways that are unlikely to be acceptable.

[I preface what follows with the observation that I'm not by any stretch
of the imagination an expert on Indic scripts, but I do fancy myself an
expert on Unicode.]

I'm also pretty sure that using ZWJ as a virama won't work and isn't
intended to work. KA + ZWJ + KA means something totally different from
KA + VIRAMA + KA, and I, for one, wouldn't expect them to be drawn the
same. U+0915 represents the letter KA with its inherent vowel sound;
that is, it represents the whole syllable KA. Two instances of U+0915
in a row would thus represent "KAKA", completely irrespective of how
they're drawn. Introducing a ZWJ in the middle would allow the two
SYLLABLES to ligate, but there's no ligature that represents "KAKA", so
you should get the same appearance as you do without the ZWJ. The
virama, on the other hand, cancels the vowel sound on the KA, turning it
into K: The sequence KA + VIRAMA + KA represents the syllable KKA, again
irrespective of how it is drawn.

In other words, ZWJ is intended to change the APPEARANCE of a piece of
text without changing its MEANING (there are exceptions in the Arabic
script, but this is the general rule). Having KA + ZWJ + KA render as
the syllable KKA would break this rule: the ZWJ would be changing the
MEANING of the text.

Whether the syllable KKA gets drawn with a virama, a half-form, or a
ligature is the proper province of ZWJ and ZWNJ, and this is what
they're documented in TUS to do. But ZWJ can't (and shouldn't) be used
to turn KAKA into KKA.

Maybe it was unfortunate to call U+094D a "virama," since it doesn't
necessarily get drawn as a virama (or, indeed, as anything), but it's
too late to revisit that decision. For that matter, it may have been a
mistake to use the virama model to encode conjunct forms in Bengali, but
it's too late to change that now. Real users generally shouldn't have
to care, though; this is an issue for programmers and font designers.
Their lives may be harder than they should have been, but unless it's
horribly hard for them to produce the right effects for their users, it
isn't worth it to reopen the issue of Unicode encoding of Indic scripts,
especially the ones that have been in Unicode for more than a decade
now.

There are lots of things that suck about Unicode, but on the whole, it's
way better than what came before and solves more problems that it
creates. Backward compatibility is a pain in the butt, and it forces us
to live with a lot of mistakes and suboptimal solutions we wish we
didn't have to live with. But backward compatibility is also good-- it
means the solution was good enough in the first place that people are
using it.

--Rich Gillam
Language Analysis Systems, Inc.
"Unicode Demystified"

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-10-08 22:25:04 UTC

... But backward compatibility is also good-- it
means the solution was good enough in the first place that people are
using it.

Not sure about this one, in the Unicode context in general. I have been
told of all sorts of things which cannot be done in the name of backward
compatibility even when it is demonstrated that the original solution
was completely broken and it seems that no one had ever used it -
because it cannot be guaranteed that no one has tried to use it, and so
there just might be some broken or kludged texts out there whose
integrity has to be guaranteed. I'm not saying that is a bad policy,
just that the existence of the policy is not grounds for
self-congratulation that none of the old solutions are broken.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Gautam Sengupta

2003-10-09 06:07:23 UTC

Post by Unicode (public)
Two of the most basic Unicode stability policies dictate that character
assignments, once made, are never removed and character names can never
change. Step 4 cannot happen; the best that can happen is that the code
points in question can be deprecated. The renaming you suggest in 1
cannot happen either.

[Gautam]: Well, too bad. I guess we still have an obligation to explore the extent of sub-optimal solutions that are being imposed upon South-Asian scripts for the sake of *backward compatibility* or simply because they are "fait accomplis". (See Peter Kirk's posting on this issue). However, I am by no means suggesting that the fault lies with the Unicode Consortium.

Post by Unicode (public)
The change in the encoding model for the virama can't happen either;
there are too many implementations based on it, and there are too many
documents out there that use the current encoding model. Your
suggestion wouldn't make them unreadable when opened with software that
did things the way you're suggesting, but it would change their
appearance in ways that are unlikely to be acceptable.

[Gautam]: This is again the "fait accompli" argument. We need to *know* whether adopting an alternative model WOULD HAVE BEEN PREFERABLE, even if the option to do so is no longer available to us. The model I am proposing is precisely the one that has been in use for centuries in the Indian grammatical tradition (/ki/ = k+virama+i). I don't think there are too many South-Asian documents out there encoded in Unicode. At any rate converting them would be a rather simple matter of searching for combining forms of vowels and replacing them by the [VIRAMA][VOWEL] sequence. The TDIL corpora are very small by current standards, and they require extensive reworking anyway.

Post by Unicode (public)
[I preface what follows with the observation that I'm not by any stretch
of the imagination an expert on Indic scripts, but I do fancy myself an
expert on Unicode.]
I'm also pretty sure that using ZWJ as a virama won't work and isn't
intended to work. KA + ZWJ + KA means something totally different from
KA + VIRAMA + KA, and I, for one, wouldn't expect them to be drawn the
same. U+0915 represents the letter KA with its inherent vowel sound;
that is, it represents the whole syllable KA. Two instances of U+0915
in a row would thus represent "KAKA", completely irrespective of how
they're drawn. Introducing a ZWJ in the middle would allow the two
SYLLABLES to ligate, but there's no ligature that represents "KAKA", so
you should get the same appearance as you do without the ZWJ. The
virama, on the other hand, cancels the vowel sound on the KA, turning it
into K: The sequence KA + VIRAMA + KA represents the syllable KKA, again
irrespective of how it is drawn.
In other words, ZWJ is intended to change the APPEARANCE of a piece of
text without changing its MEANING (there are exceptions in the Arabic
script, but this is the general rule). Having KA + ZWJ + KA render as
the syllable KKA would break this rule: the ZWJ would be changing the
MEANING of the text.
Whether the syllable KKA gets drawn with a virama, a half-form, or a
ligature is the proper province of ZWJ and ZWNJ, and this is what
they're documented in TUS to do. But ZWJ can't (and shouldn't) be used
to turn KAKA into KKA.

[Gautam]: I think there is a slight misunderstanding here. The ZWJ I am proposing is script-specific (each script would have its own), call it "ZWJ PRIME" or even "JWZ" (in order to avoid confusion with ZWJ). It doesn't exist yet and hence has no semantics. JWZ is a piece of formalism. Its meaning would be precisely what we chose to assign to it. It behaves like the existing (script-specific) VIRAMA's except that it also occurs between a consonant and an independent vowel, forcing the latter to show up in its combining form. In this respect, it is in fact *closer* or *more faithful* to the classical VIRAMA model. Call it VIRAMA if you will. The only reason why I don't wish to call it "VIRAMA" is because I plan to use it after a vowel as well, as in: <A><JWZ><Y><JWZ<AA> encoding A+YOPHOLA+AA. If YOPHOLA is assigned an independent code point then this move would be unnecessary and my JWZ would just be the usual VIRAMA with an extended function that would, in fact, make it more
compliant with the classical VIRAMA model.

Now that we have freed up all those code points occupied by the combining forms of vowels by introducing the VIRAMA with extended function, let us introduce an explicit (always visible) VIRAMA. That's all.

Post by Unicode (public)
Maybe it was unfortunate to call U+094D a "virama," since it doesn't
necessarily get drawn as a virama (or, indeed, as anything), but it's
too late to revisit that decision.

No, the decision is not unfortunate because of that, but rather because U+094D doesn't behave like a virama in all respects, and hence my proposal for extension of its functions.

Post by Unicode (public)
For that matter, it may have been a mistake to use the virama model to encode
conjunct forms in Bengali, ...

Not really. But once adopted, the model should have been implemented in full, eliminating the need for combing forms of vowels. Thanks a lot Rich.

-Gautam

---------------------------------
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search

Kenneth Whistler

2003-10-08 21:31:05 UTC

Post by Gautam Sengupta
The encoding of most Indic scripts is based on ISCII
- and that's not going
to change. It was adopted since ISCII was the
pre-existing Indian national
character encoding standard for these scripts.

Why should they? The positions of these characters in
the Unicode code chart for the Bengali script has nothing
to do with the ISCII chart, in any case. They are
*additions* beyond the ISCII chart. In the case of
the Assamese letters, these additions separate out
the *distinct* forms for Assamese /r/ and /v/ from
the Bangla forms, and *enable* correct sorting, rather
than inhibiting it. The addition of the long syllabic
/r/ and /l/ *enables* the representation of Sanskrit
material in the Bengali script, and the code position in
the charts is immaterial.

By the way, the relevant organization now would be
TDIL, within the Indian Ministry of Communications and
Information Technology -- not the Dept. of Electronics.
But be that as it may, they have nothing to do with
the code point choices in the range U+09E0..U+09FF,
as should be clear from the documentation of the
Unicode Standard. See The Unicode Standard, Version 4.0,
p. 219, available online.

--Ken

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Gautam Sengupta

2003-10-09 02:21:42 UTC

Post by Gautam Sengupta
The encoding of most Indic scripts is based on

ISCII

Post by Gautam Sengupta
- and that's not going
to change. It was adopted since ISCII was the
pre-existing Indian national
character encoding standard for these scripts.

I understand that this is so. But perhaps it is
worthwhile for us to be aware of the flaws in

ISCII

Post by Gautam Sengupta
that were inherited by Unicode. It is also

necessary

Post by Gautam Sengupta
to recognize the fact that the bureaucrats in a
government are not necessarily the most competent
people to adjudicate on how a script should be
encoded. I wonder whether the Dept of Electronics,
Govt of India, would have any reasons to offer
justifying the placement of Assammese /r/ and /v/

and

Post by Gautam Sengupta
the long syllabic /r/ and /l/ in their current
positions.

Why should they? The positions of these characters
in
the Unicode code chart for the Bengali script has
nothing
to do with the ISCII chart, in any case. They are
*additions* beyond the ISCII chart.

[Gautam]: Yes, they do. The arrangement is identical
in a new code space of the same size.

Post by Peter Kirk
In the case of
the Assamese letters, these additions separate out
the *distinct* forms for Assamese /r/ and /v/ from
the Bangla forms, and *enable* correct sorting,
rather
than inhibiting it. The addition of the long
syllabic
/r/ and /l/ *enables* the representation of Sanskrit
material in the Bengali script, and the code
position in
the charts is immaterial.

[Gautam]: Nobody is objecting to the addition of these
forms, only to their placement vis-s-vis the other
forms.

Post by Peter Kirk
By the way, the relevant organization now would be
TDIL, within the Indian Ministry of Communications
and
Information Technology -- not the Dept. of
Electronics.

[Gautam]: Yes, indeed. The ministry has been renamed.
TDIL remains the same.

Post by Peter Kirk
But be that as it may, they have nothing to do with
the code point choices in the range U+09E0..U+09FF,
as should be clear from the documentation of the
Unicode Standard. See The Unicode Standard, Version
4.0, p. 219, available online.

[Gautam]: I did look up the document and this is what
I found:

"The Devanagari block of the Unicode Standard is based
on ISCII 1988. ...
The Unicode Standard encodes Devanagari characters in
the same relative positions as those coded in
positions A0-F4 in the ISCII 1988 standard. The same
character code layout is followed for eight other
Indic scripts in the Unicode Standard ... This
parallel code layout ... follows the stated intention
of the Indian coding standard to enable one-to-one
mappings between analogous coding positions in
different scripts in the family."

Clearly ISCII has a *lot to do* with code point
choices in the range U+09E0..U+09FF.

Best, Gautam.

__________________________________
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Gautam Sengupta

2003-10-09 10:04:28 UTC

Ken,

I stand corrected. Long syllabic /r l/ as well as
Assamese /r v/ are indeed additions beyond the ISCII
code chart. My objection, however, was not against
their inclusion but against their placement. I
understand why long syllabic /r l/ could not be placed
with the vowels, but why were Assamese /r v/ assigned
U+09F0 and U+09F1 instead of U+09B1 and U+09B5
respectively?

In the case of the Assamese letters, these
additions separate out the *distinct* forms for
Assamese /r/ and /v/ from the Bangla forms, and
*enable* correct sorting, rather than inhibiting it.

I fail to understand why Assamese /r v/ wouldn't be
correctly sorted if placed in U+09F0 and U+09F1. Why
do they need to be separated out from the Bangla forms
in order to enable correct sorting?

The addition of the long syllabic /r/ and /l/
*enables* the representation of Sanskrit
material in the Bengali script, and the code
position in the charts is immaterial.

As stated earlier, my objection is not against their
inclusion, but against their positioning on the code
chart. Why is their relative position in the chart
immaterial for sorting? If it is merely because there
are script-specific sorting mechanisms already in
place, then it's just a bad excuse for a sloppy job. I
sincerely hope there is more to it than just that.

But be that as it may, they (TDIL) have nothing to
do with the code point choices in the range
U+09E0..U+09FF ...

If this is indeed the case, then I must say it's
rather unfortunate. As a full corporate member
representing the Republic of India, the Ministry of
Information Technology should have had a BIG say in
the matter. Were they ever consulted on the issue? Did
they try to intervene suo moto? Will a Unicode
official kindly let us know? Best, -Gautam.

__________________________________
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Marco Cimarosti

2003-10-09 09:03:43 UTC

Post by Marco Cimarosti
OK but, then, your <ZWJ> becomes exactly what
Unicode's <VIRAMA> has always
been: [...]

Good, good. We are making small steps forward.

What are you really asking for is that each Indic script has *two* viramas:

- a "soft virama", which is normally invisible and only displays visibly in
special cases (no ligatures for that cluster);

- a "hard virama" (or "explicit virama", as you correctly called it), which
always displays as such and never ligates with adjacent characters.

Let's assume that it would be handy to assign these two viramas to different
keys on the keyboard. Or, even better, let's assign the "soft virama" to the
plain key and the "hard virama" to the SHIFT key, OK? To avoid
misunderstandings with the term "virama", let's label this key "JOINER".

Now, this is what you *already* have in Unicode! On our hypothetic Bangla
keyboard:

- the "soft virama" (the plain JOINER key) is Unicode's <BENGALI SIGN
VIRAMA>;

- the "hard virama" (the SHIFT+JOINER key) is Unicode's <BENGALI SIGN
VIRAMA>+<ZWNJ>.

Not only Unicode allows all of the above, but it also has a third kind of
"virama", which may or may not be useful in Bangla but is certainly useful
in Devanagari and Gujarati:

- the "half-consonant virama" (let's assign it to the ALT+JOINER key in out
hypothetical keyboard) which forces the preceding consonant to be displayed
as an half consonant, if possible. This is Unicode's <BENGALI SIGN
VIRAMA>+<ZWJ>.

Notice that, once you have these three "viramas" on your keyboard, you don't
need to have keys for <ZWJ> and <ZWNJ>, as their only use, in Indic, is
after a <xxx SIGN VIRAMA>.

Apart the fact that two of the three viramas are encoded as a *pair* of code
points, how does the *current* Unicode model impede you to implement the
clean theoretical model that you have in mind?

Post by Gautam Sengupta
[...]

Post by Marco Cimarosti
- independent and dependent vowels were the same
characters;

[...]
I agree with you on all of these issues. You have in
fact summed up my critique of the ISCII/Unicode model.

OK. But are you sure that this critique should necessarily be moved to the
*encoding* model, rather than to some other part of the chain. I'll now try
to demonstrate how also the redundancy of dependent/independent vowels may
be solved at the *keyboard* level.

You are certainly aware that some national keyboards have the so-called
"dead keys". A dead key is a key which does not immediately send (a)
character(s) to the application but waits for a second key; in European
keyboards dead keys are used to type accented letters. E.g., let's see how
accented letters are typed on the Spanish keyboard (which, BTW, is by far
the best designed keyboard in Western Europe):

1. If you press the <´> key, nothing is sent to the application, but the
keystroke is memorized by the keyboard driver.

2. If you now press one of <a>, <e>, , <o>, or <y> keys, characters
<á>, <é>, <í>, <ó>, <ú> or <ý> are sent to the application.

3. If you press the space bar, character <´> itself is sent to the
application;

4. If you press any other key, e.g. <m>, the two characters <´> and <m> are
sent to the application in this order.

Now, in the description above substitute:

- the <´> key with <0985 BENGALI LETTER A> (but let's label it "VIRTUAL
CONSONANT");

- the <a> ... <y> keys with <09BE BENGALI VOWEL SIGN AA> ... <09CC BENGALI
VOWEL SIGN AU>;

- the <á> ... <ý> characters with <0986 BENGALI LETTER AA> ... <0994 BEGALI
LETTER AU>.

What you have is a Bangla keyboard where dependent vowels are typed with a
single <vowel> keystroke, and independent vowels are typed with the sequence
<VIRTUAL CONSONANT>+<vowel>.

Do you prefer your <cons>+<VIRAMA>+<vowel> model? Personally, I find it is
suboptimal, as it requires, on average, more keystrokes. However, if that's
what you want, in the Spanish keyboard description above substitute:

- the <´> key with the unshifted <JOINER> (= virama) key that we have
already defined above;

- the <a> ... <y> keys with <0986 BENGALI LETTER AA> ... <0994 BEGALI LETTER
AU>;

- the <á> ... <ý> characters with <09BE BENGALI VOWEL SIGN AA> ... <09CC
BENGALI VOWEL SIGN AU>.

Now you have a Bangla keyboard where independent vowels are typed with a
single keystroke, and dependent vowels are typed with the sequence
<JOINER>+<vowel>.

_ Marco

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Unicode (public)

2003-10-09 15:44:52 UTC

Peter--

... But backward compatibility is also good-- it
means the solution was good enough in the first place that people are
using it.

Not sure about this one, in the Unicode context in general. I have been
told of all sorts of things which cannot be done in the name of

backward

Post by Peter Kirk
compatibility even when it is demonstrated that the original solution
was completely broken and it seems that no one had ever used it -
because it cannot be guaranteed that no one has tried to use it, and so
there just might be some broken or kludged texts out there whose
integrity has to be guaranteed. I'm not saying that is a bad policy,
just that the existence of the policy is not grounds for
self-congratulation that none of the old solutions are broken.

Yeah, you're right.

I presume you're talking here mostly about the combining classes of the
Hebrew vowel points. That was a case where even though the Hebrew
encoding was clearly broken (insofar as Biblical Hebrew was concerned,
anyway), fixes for the problem were constrained because there was a need
to maintain backward compatibility ACROSS THE WHOLE STANDARD for reasons
unrelated to Biblical Hebrew. So yeah, here the need to preserve
backward compatibility tells us Unicode in general was good enough for
people to use it, even though they couldn't use it for Biblical Hebrew.
So yeah, I overstated my case.

--Rich Gillam
Language Analysis Systems

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-10-09 17:12:20 UTC

Post by Unicode (public)
...
Yeah, you're right.
I presume you're talking here mostly about the combining classes of the
Hebrew vowel points. ...

Mostly. I have come across other similar cases e.g. the Arabic hamza
issue recently discussed on the bidi list, perhaps also the distinction
between Greek tonos and acute. They are all cases where the stability
policy forbids changes of combining class or deletion of a redundant
character.

Post by Unicode (public)
... That was a case where even though the Hebrew
encoding was clearly broken (insofar as Biblical Hebrew was concerned,
anyway), ...

What is broken is the encoding of any sequence of vowels. Because of
this no one had used Unicode for sequences of vowels. Except that
someone may have tried, and although the resulting texts would be mixed
up and invalid, apparently for backward compatibility that mixed-upness
and invalidity has to be preserved.

Post by Unicode (public)
... fixes for the problem were constrained because there was a need
to maintain backward compatibility ACROSS THE WHOLE STANDARD for reasons
unrelated to Biblical Hebrew. So yeah, here the need to preserve
backward compatibility tells us Unicode in general was good enough for
people to use it, ...

Happily, yes! It would still have been good enough to use without those
stability guarantees. It seems to me that some unwise promises were made
which have caused the backward compatibility issue. I'm not convinced
that those promises contributed much to the usability of Unicode; they
may have made life a bit easier for some people, e.g. those who want to
rely on data being normalised without the overhead of checking it, but
made things a lot more difficult for some others.

Post by Unicode (public)
... even though they couldn't use it for Biblical Hebrew.
So yeah, I overstated my case.
--Rich Gillam
Language Analysis Systems

--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Unicode (public)

2003-10-09 16:14:01 UTC

Gautam--

[Gautam]: Well, too bad. I guess we still have an obligation to
explore the extent of sub-optimal solutions that are being imposed upon
South-Asian scripts for the sake of *backward compatibility* or simply
because they are "fait accomplis". (See Peter Kirk's posting on this
issue). However, I am by no means suggesting that the fault lies with
the Unicode Consortium.

I'm a little confused by this statement. What would be the difference
between sticking with a suboptimal solution because it's a fait accompli
and sticking with it out of the need for backward compatibility? The
need for backward compatibility exists because the suboptimal solution
is a fait accompli. Or are you stating that backward compatibility is a
specious argument because the encoding is so broken nobody's actually
using it?

[Gautam]: This is again the "fait accompli" argument. We need to
*know* whether adopting an alternative model WOULD HAVE BEEN PREFERABLE,
even if the option to do so is no longer available to us.

I don't understand. If the option to go to an alternative model is not
available, why is it important to know that the alternative model would
have been preferable?

[Gautam]: I think there is a slight misunderstanding here. The
ZWJ I am proposing is script-specific (each script would have its own),
call it "ZWJ PRIME" or even "JWZ" (in order to avoid confusion with
ZWJ). It doesn't exist yet and hence has no semantics.

Okay. Maybe I'm dense, but this wasn't clear to me from your other
emails. You're not proposing that U+200D be used to join Indic
consonants together; you're basically arguing for virama-like
functionality that goes far enough beyond what the virama does that
you're not comfortable calling it a virama anymore.

JWZ is a piece of formalism. Its meaning would be precisely
what we chose to assign to it. It behaves like the existing
(script-specific) VIRAMA's except that it also occurs between a
consonant and an independent vowel, forcing the latter to show up in its
combining form.

Aha! This is what I wasn't parsing out of your previous emails. It was
there, but I somehow didn't grok it. To summarize:

Tibetan deals with consonant clusters by encoding each of the consonants
twice: One series of codes is to be used for the first consonant in a
cluster, and the other series is to be used for the others. The Indian
scripts don't do this; they use a single series of codes for the
consonants and cause consonants to form clusters by adding a VIRAMA code
between them. But the Indian scripts still have two series of VOWELS
more or less analogous to the two series of consonants in Tibetan. When
you want a non-joining vowel, you use one series, and when you want a
joining vowel, you use the other.

You want to have one series of vowels and extend the virama model to
conbining vowels. Thus, you'd represent KI as KA + VIRAMA + I; KA + I
would represent two syllables: KA-I. Since a real virama never does
this, you're using a different term ("JWZ" in your most recent message)
for the character that causes the joining to happen. You're not
proposing any difference in how consonants are treated, other than
having this new character server the sticking-together function that the
VIRAMA now serves and changing the existing VIRAMA to always display
explicitly.

Now do I understand you? Sorry for my earlier misunderstandings.

Now that we have freed up all those code points occupied by the
combining forms of vowels by introducing the VIRAMA with extended
function, let us introduce an explicit (always visible) VIRAMA. That's
all.

As far as Unicode is concerned, you can't "free up" any code points.
Once a code point is assigned, it's always assigned. You can deprecate
code points, but that doesn't free them up to be reused; it only (with
luck) keeps people from continuing to use them.

It seems to me that a system could support the usage you want and the
old usage at the same time. I could be wrong, but I'm guessing that KA
+ VIRAMA + I isn't a sequence that makes any sense with current
implementations and isn't being used. It would be possible to extend
the meaning of the current VIRAMA to turn the independent vowels into
dependent vowels. Future use of the dependent-vowel code points could
be discouraged in favor of VIRAMA plus the independent-vowel code
points. Old documents would continue to work, but new documents could
use the model you're after. (You get the explicit virama the same way
you do now: VIRAMA + ZWNJ.) This solution would involve encoding no new
characters and no removal of existing characters, but just a change in
the semantics of the VIRAMA.

That said, I'm not sure this is a good idea. If what you're really
concerned about is typing and editing of text, you can have that work
the way you want without changing the underlying encoding model. It
involves somewhat more complicated keyboard handling, but I'm pretty
sure all the major operating systems allow this. The basic idea is that
you have one set of vowel keys that normally generate the
independent-vowel code points, but if one of them is preceded by the
VIRAMA key, the two keystrokes map to a single character: the
dependent-vowel code point. This is a simple solution that can be
implemented today with very little fuss and involves no changes to
Unicode or to the various fonts and rendering engines that would be
required of the VIRAMA code point took on a new meaning. From a user's
point of view, things work the way they're supposed to, and they work
that way sooner than if Unicode is changed. Only programmers have to
worry about the actual encoding details, and unless keeping the existing
model makes THEIR jobs significantly harder, the encoding itself
shouldn't change.

I hope this makes sense...

--Rich Gillam
Language Analysis Systems, Inc.

Gautam Sengupta

2003-10-10 04:22:22 UTC

Post by Unicode (public)
Gautam--
...
I don't understand. If the option to go to an alternative model is not available, why is it
important to know that the alternative model would have been preferable?

[Gautam]: Just for the sake of knowing, I guess. "... ripeness is all".

Post by Unicode (public)
[Gautam]: I think there is a slight misunderstanding here. The ZWJ I am proposing is
script-specific (each script would have its own), call it "ZWJ PRIME" or even "JWZ"
(in order to avoid confusion with ZWJ). It doesn't exist yet and hence has no
semantics.
Okay. Maybe I'm dense, but this wasn't clear to me from your other emails.

[Gautam]: Heavens, no! It must be my non-native English that's creating all these communication gaps.

Post by Unicode (public)
You're not proposing that U+200D be used to join Indic consonants together; you're
basically arguing for virama-like functionality that goes far enough beyond what the
virama does that you're not comfortable calling it a virama anymore.

[Gautam]: Indeed. You got it just right. Let us introduce the term "Ind VIRAMA" to refer to the virama used in Sanskrit and other Indic languages, and Uni VIRAMA" to refer to the virama in Unicode. The two are *not* identical. Uni VIRAMA lacks the full functionality of Ind Virama. I am proposing two extensions to Uni Virama:

1. extension of its functionality to allow cons+combining vowel to be encoded as <Cons><VIRAMA><full Vowel>, and

2. extension of its functionality further to allow vowel+yophola to be encoded as <Vowel><VIRAMA><full Y>

(1) merely confers on Uni VIRAMA the full functionality of Ind VIRAMA, making the two functionally identical.

(2) is a hack, a crude ad hoc solution to the problem of how to encode Bangla vowel+yophola sequences. It is THIS latter extension that would make Uni VIRAMA un-VIRAMA-like, and hence my discomfiture with the name "VIRAMA". But (2) can be avoided if we can find some other solution to the YOPHOLA problem, such as assigning a code point to YOPHOLA in addition to the one already assigned to Y. And this (that is, addition of a distinct YOPHOLA on the code chart), by the way, would also disambiguate <R><Y> sequences in Bangla. (See Paul Nelson, "Bengali Script: Formation of the Reph and use of the ZERO WIDTH JOINER and ZERO WIDTH NON-JOINER"). I now feel that it is better to avoid extension 2 for the sake of keeping the model clean. Let us say we find some other acceptable solution to the problems raised by combinations involving YOPHOLA.

Post by Unicode (public)
Tibetan deals with consonant clusters by encoding each of the consonants twice: One
series of codes is to be used for the first consonant in a cluster, and the other series is
to be used for the others. The Indian scripts don't do this; they use a single series
of codes for the consonants and cause consonants to form clusters by adding a
VIRAMA code between them. But the Indian scripts still have two series of VOWELS
more or less analogous to the two series of consonants in Tibetan. When you want a
non-joining vowel, you use one series, and when you want a joining vowel, you use the
other.

[Gautam]: In Unicode Indic CV and CC sequences are treated differently. It uses the VIRAMA model for CC clusters, but the Tibetan model for CV's. I am suggesting the use of the VIRAMA model for BOTH.

Post by Unicode (public)
You want to have one series of vowels and extend the virama model to combining
vowels. Thus, you'd represent KI as KA + VIRAMA + I; KA + I would represent two
syllables: KA-I.

[Gautam]: Yes.

Post by Unicode (public)
Since a real virama never does this, you're using a different term ("JWZ" in your most
recent message) for the character that causes the joining to happen.

[Gautam]: No, the *real* Ind VIRAMA does exactly this. Hence with this extension only (that is, as long as extension 2 is not implemented) I feel no compulsion to rename VIRAMA.

Post by Unicode (public)
You're not proposing any difference in how consonants are treated, other than having
this new character server the sticking-together function that the VIRAMA now serves
and changing the existing VIRAMA to always display explicitly.
Now do I understand you? Sorry for my earlier misunderstandings.

[Gautam]: Yes, but note the clarifications provided in the preceding paragraphs.

Post by Unicode (public)
Now that we have freed up all those code points occupied by the combining forms of
vowels by introducing the VIRAMA with extended function, let us introduce an explicit
(always visible) VIRAMA. That's all.
As far as Unicode is concerned, you can't "free up" any code points. Once a code
point is assigned, it's always assigned. You can deprecate code points, but that
doesn't free them up to be reused; it only (with luck) keeps people from continuing to
use them.

[Gautam]: This is just too bad.

Post by Unicode (public)
It seems to me that a system could support the usage you want and the old usage at
the same time. I could be wrong, but I'm guessing that KA + VIRAMA + I isn't a
sequence that makes any sense with current implementations and isn't being used. It
would be possible to extend the meaning of the current VIRAMA to turn the
independent vowels into dependent vowels. Future use of the dependent-vowel code
points could be discouraged in favor of VIRAMA plus the independent-vowel code
points. Old documents would continue to work, but new documents could use the
model you're after. (You get the explicit virama the same way you do now: VIRAMA + > ZWNJ.) This solution would involve encoding no new characters and no removal of
existing characters, but just a change in the semantics of the VIRAMA.

[Gautam]: That sounds good. I would prefer an independent code point for the explicit VIRAMA, but on second thought VIRAMA+ZWNJ is not too bad either.

Post by Unicode (public)
That said, I'm not sure this is a good idea.

Here comes the punch line!

Post by Unicode (public)
If what you're really concerned about is typing and editing of text,

[Gautam]: No, that's certainly not my primary concern.

Post by Unicode (public)
you can have that work the way you want without changing the underlying encoding
model. It involves somewhat more complicated keyboard handling, but I'm pretty
sure all the major operating systems allow this. The basic idea is that you have one
set of vowel keys that normally generate the independent-vowel code points, but if one
of them is preceded by the VIRAMA key, the two keystrokes map to a single
character: the dependent-vowel code point. This is a simple solution that can be
implemented today with very little fuss and involves no changes to Unicode or to the
various fonts and rendering engines that would be required of the VIRAMA code point
took on a new meaning. From a user's point of view, things work the way they're
supposed to, and they work that way sooner than if Unicode is changed.

[Gautam]: I have been aware of this solution all along since my corpus and language related work often involves keyboard remapping. This solution was also highlighted by Marco Cimarosti in a recent posting on this list. (Marco, I hope you are reading this). But that is NOT what I am after.

Post by Unicode (public)
Only programmers have to worry about the actual encoding details, and unless
keeping the existing model makes THEIR jobs significantly harder, the encoding itself
shouldn't change.

Yes, but not just programmers who are concerned with how a Unicode text should be encoded, but also those who are going to have to process these texts for various purposes. Let us first introduce a small notational convention and then consider a rather minor example.

Let the lowercase vowels henceforth denote *combining* vowels. In Bangla K+R+i and J+aa+I mean "I do" and "I go" respectively. Given these two forms as input, a morphological analyzer should ideally yield the following analyses: KRi = KR<VIRAMA> + I, JaaI = Jaa + I. (I am assuming orthographic - not phonemic/phonetic - input-output). In other words, the analyzer would have to insert an explicit virama after KR and somehow recognize the final in KRi as .

Now let's consider the same pair of inputs in *my* representation. They would be K+R+VIRAMA+I and J+VIRAMA+AA+I. All that the morphological analyzer would have to do is chop off the rightmost . The leftovers are exactly what we need: K+R+VIRAMA and J+VIRAMA+AA. Isn't it amazing how evidence from diverse fields of inquiry seem to converge on the *correct* solution?

Post by Unicode (public)
I hope this makes sense...

-Gautam

---------------------------------
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search

Peter Kirk

2003-10-10 11:11:23 UTC

Post by Gautam Sengupta
...
Yes, but not just programmers who are concerned with how a Unicode
text should be encoded, but also those who are going to have to
process these texts for various purposes. Let us first introduce a
small notational convention and then consider a rather minor example.
Let the lowercase vowels henceforth denote *combining* vowels. In
Bangla K+R+i and J+aa+I mean "I do" and "I go" respectively. Given
these two forms as input, a morphological analyzer should ideally
yield the following analyses: KRi = KR<VIRAMA> + I, JaaI = Jaa + I. (I
am assuming orthographic - not phonemic/phonetic - input-output). In
other words, the analyzer would have to insert an explicit virama
after KR and somehow recognize the final in KRi as .
Now let's consider the same pair of inputs in *my* representation.
They would be K+R+VIRAMA+I and J+VIRAMA+AA+I. All that the
morphological analyzer would have to do is chop off the rightmost .
The leftovers are exactly what we need: K+R+VIRAMA and J+VIRAMA+AA.
Isn't it amazing how evidence from diverse fields of inquiry seem to
converge on the *correct* solution?

Post by Unicode (public)
I hope this makes sense...

-Gautam

It would surely be trivial for any morphological analyser to understand
i as a ligature or contraction of <VIRAMA, I>, split it into the
sequence, and then analyse the version with the sequence. Any
morphological analyser is going to have to deal with ligatures and
contractions. It could be programmed as a morphophonemic contraction,
even if that is not technically linguistically correct.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Gautam Sengupta

2003-10-11 12:37:39 UTC

Post by Gautam Sengupta
...
Yes, but not just programmers who are concerned

with how a Unicode

Post by Gautam Sengupta
text should be encoded, but also those who are

going to have to

Post by Gautam Sengupta
process these texts for various purposes. Let us

first introduce a

Post by Gautam Sengupta
small notational convention and then consider a

rather minor example.

Post by Gautam Sengupta
Let the lowercase vowels henceforth denote

*combining* vowels. In

Post by Gautam Sengupta
Bangla K+R+i and J+aa+I mean "I do" and "I go"

respectively. Given

Post by Gautam Sengupta
these two forms as input, a morphological analyzer

should ideally

Post by Gautam Sengupta
yield the following analyses: KRi = KR<VIRAMA> +

I, JaaI = Jaa + I. (I

Post by Gautam Sengupta
am assuming orthographic - not phonemic/phonetic -

input-output). In

Post by Gautam Sengupta
other words, the analyzer would have to insert an

explicit virama

Post by Gautam Sengupta
after KR and somehow recognize the final in

KRi as .

Post by Gautam Sengupta
Now let's consider the same pair of inputs in *my*

representation.

Post by Gautam Sengupta
They would be K+R+VIRAMA+I and J+VIRAMA+AA+I. All

that the

Post by Gautam Sengupta
morphological analyzer would have to do is chop

off the rightmost .

Post by Gautam Sengupta
The leftovers are exactly what we need: K+R+VIRAMA

and J+VIRAMA+AA.

Post by Gautam Sengupta
Isn't it amazing how evidence from diverse fields

of inquiry seem to

Post by Gautam Sengupta
converge on the *correct* solution?

Post by Unicode (public)
I hope this makes sense...

-Gautam

[Gautam]: I did hedge my claim by saying that I was
going to cite a rather minor example. But why would I
want to do this extra bit of computing - however
trivial - when I could have avoided it by adopting a
more "appropriate" encoding in the first place? After
all, what I am suggesting is that the VIRAMA model
once adopted ought to have been implemented in full.
Is there any particular reason why it should be
adopted for CC but not for CV sequences?

Encoding /ki/ as <K> (using lowercase vowels to
denote combining forms and letters within slashes to
denote phonemes rather than characters) is also
semantically inappropriate. <K> stands for /ka/ not
/k/, and being a combining form of simply
stands for /i/. So <K> should stand for /kai/
rather than /ki/ unless a VIRAMA is inserted between
the <K> and the to remove the default inherent
vowel /a/ from <K>.

I hope this makes sense. Best, Gautam.

__________________________________
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Peter Kirk

2003-10-11 13:33:35 UTC

Post by Gautam Sengupta
...
[Gautam]: I did hedge my claim by saying that I was
going to cite a rather minor example. But why would I
want to do this extra bit of computing - however
trivial - when I could have avoided it by adopting a
more "appropriate" encoding in the first place? After
all, what I am suggesting is that the VIRAMA model
once adopted ought to have been implemented in full.
Is there any particular reason why it should be
adopted for CC but not for CV sequences?
Encoding /ki/ as <K> (using lowercase vowels to
denote combining forms and letters within slashes to
denote phonemes rather than characters) is also
semantically inappropriate. <K> stands for /ka/ not
/k/, and being a combining form of simply
stands for /i/. So <K> should stand for /kai/
rather than /ki/ unless a VIRAMA is inserted between
the <K> and the to remove the default inherent
vowel /a/ from <K>.
I hope this makes sense. Best, Gautam.

I see where you are coming from. But you seem to be trying to redefine
one of the basic characteristics of Indic scripts, that an explicit
vowel replaces the implicit vowel. If you start on this path, you may as
well do the further simplification to do away with the confusing virama
and use a simple phonetic encoding i.e <k, a> for ka, <k, i> for ki, and
<k> alone for k with virama mark. Smart font technology can easily
substitute the required ligatures, virama marks etc., in the same sort
of way that it does Arabic shaping, ligatures etc. And if we were
starting from scratch we might have decided that that was the better way
to go. But we are not, we are starting from where we are, so we should
probably make as few changes as we can.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Kenneth Whistler

2003-10-10 01:02:22 UTC

Post by Gautam Sengupta
I stand corrected. Long syllabic /r l/ as well as
Assamese /r v/ are indeed additions beyond the ISCII
code chart. My objection, however, was not against
their inclusion but against their placement. I
understand why long syllabic /r l/ could not be placed
with the vowels, but why were Assamese /r v/ assigned
U+09F0 and U+09F1 instead of U+09B1 and U+09B5
respectively?

Because the 7th and 8th rows in each of these Indic
scripts was where additions beyond the ISCII repertoire
were added.

In the case of the Assamese letters, these
additions separate out the *distinct* forms for
Assamese /r/ and /v/ from the Bangla forms, and
*enable* correct sorting, rather than inhibiting it.

I fail to understand why Assamese /r v/ wouldn't be
correctly sorted if placed in U+09F0 and U+09F1.

I presume you mean U+09B1 and U+09B5.

The answer is that no Indic script is correctly sorted
simply by using code point order, anyway. You need
a more sophisticated algorithm. And since such an
algorithm will have weight tables, it doesn't *matter*
where a particular character is in the code chart.

See:

http://www.unicode.org/notes/tn1/

for a discussion of these issues.

Post by Gautam Sengupta
Why
do they need to be separated out from the Bangla forms
in order to enable correct sorting?

So that a tailored sorting for Assamese can be based
on Assamese letters, and a tailored sorting for Bangla
can be based on Bangla letters.

The addition of the long syllabic /r/ and /l/
*enables* the representation of Sanskrit
material in the Bengali script, and the code
position in the charts is immaterial.

As stated earlier, my objection is not against their
inclusion, but against their positioning on the code
chart. Why is their relative position in the chart
immaterial for sorting?

See the above technical note. If it will help you visualize
the answer in some way, here is an excerpt from the
Default Unicode Collation Element Table for the
Unicode Collation Algorithm (Version 4.0), showing the
default weight assignments for the relevant portion of the
Bengali script:

09AA ; [.15C4.0020.0002.09AA] # BENGALI LETTER PA
09AB ; [.15C5.0020.0002.09AB] # BENGALI LETTER PHA
09AC ; [.15C6.0020.0002.09AC] # BENGALI LETTER BA
09AD ; [.15C7.0020.0002.09AD] # BENGALI LETTER BHA
09AE ; [.15C8.0020.0002.09AE] # BENGALI LETTER MA
09AF ; [.15C9.0020.0002.09AF] # BENGALI LETTER YA
09DF ; [.15C9.0020.0002.09AF][.0000.00FD.0002.09BC] # BENGALI LETTER YYA; QQCM
09B0 ; [.15CA.0020.0002.09B0] # BENGALI LETTER RA
09F0 ; [.15CB.0020.0002.09F0] # BENGALI LETTER RA WITH MIDDLE DIAGONAL <---
09B2 ; [.15CC.0020.0002.09B2] # BENGALI LETTER LA
09F1 ; [.15CD.0020.0002.09F1] # BENGALI LETTER RA WITH LOWER DIAGONAL <---
09B6 ; [.15CE.0020.0002.09B6] # BENGALI LETTER SHA
09B7 ; [.15CF.0020.0002.09B7] # BENGALI LETTER SSA
09B8 ; [.15D0.0020.0002.09B8] # BENGALI LETTER SA
^^^^
primary weights, in sorted order

As you can see, the two additional letters in question,
in the default table, sort in exactly the order you
are suggesting, and as I said, the position in the
*code chart* doesn't matter.

Post by Gautam Sengupta
If it is merely because there
are script-specific sorting mechanisms already in
place, then it's just a bad excuse for a sloppy job. I
sincerely hope there is more to it than just that.

It truly does not matter. *No* script in the Unicode
Standard is encoded completely in a collation order.
*All* scripts must be handled via weight tables in
order to produce desired sorting behavior. That is
true for Latin, Greek, Cyrillic, ..., as well as Devanagari,
Bengali, Gujarati, ..., so this is nothing particularly
different about the encoding of Bengali.

But be that as it may, they (TDIL) have nothing to
do with the code point choices in the range
U+09E0..U+09FF ...

Of course, once they got involved. And they have been
making suggestions ever since. But you need to recognize
that the particular characters you are concerned about
were standardized and published by ISO in 1993 (based,
it is true, on charts published by Unicode even earlier,
which in turn were based on the ISCII standard),
well before the Government of India became a member of
the Unicode Consortium.

--Ken

Post by Gautam Sengupta
Did
they try to intervene suo moto? Will a Unicode
official kindly let us know? Best, -Gautam.

Gautam Sengupta

2003-10-11 10:48:40 UTC