Discussion:
Internal Representation of Unicode
m***@users.sourceforge.net
2003-09-26 00:53:07 UTC
Permalink
Hi,

In a plain text environment, there is often a need to encode more than
just the plain character. A console, or terminal emulator, is such an
environment. Therefore I propose the following as a technical report
for internal encoding of unicode characters; with one goal in mind:
character equalence is binary equalence.

Since I'm using 64 bits, I call it Excessive Memory Usage Encoding, or
EMUE.

I thought of dividing the 64 bit code space into 32 variably wide
plains, one for control characters, one for latin characters, one for
han characters, and so on; using 5 bits and the next 3 fixed to zero
(for future expansion and alignment to an octet).

I call plain 0 control characters and won't discuss it further.

Plain 1, I had intended for latin characters with the following
encoding method in mind:

bits 63..59 58..56 55..40 39..32 31..24 23..16 15..8 7..0
+-------+------+------+------+------+------+------+------+
| plain | zero | attr | res | uacc | lacc | res | char |
+-------+------+------+------+------+------+------+------+

* Plain Plain (5 bits)
* Zero Zero bits (3 bits)
* Attr Attributes (16 bits)
* Res Reserved (8 bits)
* Uacc Upper Accent (8 bits)
* Lacc Lower Accent (8 bits)
* Res Reserved (8 bits)
* Char Character (8 bits)

All of these fields are actually implementation defined, with just one
rule for char: don't include characters that can be made with
combinations, that's what the accent fields are for. This allows for
255 upper and lower accents which should be enough -- for now.

For Han characters I thought of the following encoding method (with no
particular plain in mind):

bits 63..59 58..56 55..40 39..32 31 .. 0
+-------+------+------+-------+--------------------------+
| plain | zero | attr | style | char |
+-------+------+------+-------+--------------------------+

* Plain Plain (5 bits)
* Zero Zero bits (3 bits)
* Attr Attributes (16 bits)
* Style Stylistic Variation (8 bits)
* Char Character (32 bits)

Again, all fields are implementation defined. Telling something like
a terminal emulator what stylistic variation to use is outside the
scope of this email, but for attributes, there are standardized escape
sequences; but I suspect language tags can be used.

I was also thinking of a plain for punctuation and symbolic characters.

I will be pleased if anyone can come up with better encoding methods
than I did, and I call upon other people to come up with encodings for
scripts I know nothing about, such as arabic and others. Then let's
wrap it up in a technical report and be done with it ;)


Any comments?

Johann
--
Sometimes I do not think at all! Does that mean I don't exist
in the mean time?



------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
John Cowan
2003-09-26 03:22:10 UTC
Permalink
Post by m***@users.sourceforge.net
All of these fields are actually implementation defined, with just one
rule for char: don't include characters that can be made with
combinations, that's what the accent fields are for. This allows for
255 upper and lower accents which should be enough -- for now.
The problem is that multiple accents above are quite common -- Vietnamese
depends on them heavily. There may also be multiple accents below,
for all I know.
--
John Cowan http://www.ccil.org/~cowan ***@reutershealth.com
Be yourself. Especially do not feign a working knowledge of RDF where
no such knowledge exists. Neither be cynical about RELAX NG; for in
the face of all aridity and disenchantment in the world of markup,
James Clark is as perennial as the grass. --DeXiderata, Sean McGrath


------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
m***@users.sourceforge.net
2003-09-26 03:52:23 UTC
Permalink
Hi,
Post by John Cowan
The problem is that multiple accents above are quite common -- Vietnamese
depends on them heavily. There may also be multiple accents below,
for all I know.
That does not have to be a problem, as long as there are no more than
255 accents and combinations of them. As for vietnamese, I just don't
know how many there are, or how many characters they use.


Johann
--
Emacs is not a text editor -- it's a way of life



------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Doug Ewell
2003-09-26 05:44:40 UTC
Permalink
Post by m***@users.sourceforge.net
That does not have to be a problem, as long as there are no more than
255 accents and combinations of them. As for vietnamese, I just don't
know how many there are, or how many characters they use.
You'll need UTF-8 and a fairly comprehensive font to read the following.

For Vietnamese, you should count on supporting the following vowels:

a à ả ã á ạ ă ằ ẳ ẵ ắ ặ â ầ ẩ ẫ ấ ậ e è ẻ ẽ é ẹ ê ề ể ễ ế ệ i ì ỉ ĩ í ị
o ò ỏ õ ó ọ ô ồ ổ ỗ ố ộ ơ ờ ở ỡ ớ ợ u ù ủ ũ ú ụ ư ừ ử ữ ứ ự y ỳ ỷ ỹ ý ỵ

the following consonant (in addition to most other English consonants):

đ

and this currency sign:



For purposes of your mechanism, you can think of each vowel as having up
to 2 accents: (upper, right-attached, or none) plus (upper, lower, or
none). The way Vietnamese think of it is that the circumflex, breve,
and horn are part of the base letter (making a total of 12 base vowels),
whereas the grave, hook above, tilde, acute, and dot below are
considered diacritics (6 × 12 = 72 total vowels). All combinations are
possible.

Of course, all of the letters (not the dong sign) come in both uppercase
and lowercase.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/



------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Peter Kirk
2003-09-26 09:30:27 UTC
Permalink
Post by m***@users.sourceforge.net
Hi,
Post by John Cowan
The problem is that multiple accents above are quite common -- Vietnamese
depends on them heavily. There may also be multiple accents below,
for all I know.
That does not have to be a problem, as long as there are no more than
255 accents and combinations of them. As for vietnamese, I just don't
know how many there are, or how many characters they use.
Johann
In Hebrew there are more than 255 accents and combinations of them, if
you count vowel points, dagesh, shin and sin dots as accents. There are
potentially many thousands of combinations. From a quick search,
something like 700-800 are in actual use in the Hebrew Bible.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/




------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Juhani Lehtiranta
2003-09-26 09:16:25 UTC
Permalink
In some scripts there can be even four or five diacritics by one base
letter. E.g. in Uralic phonetic alphabet you can see often combinations
like U+006F + U+0355 + U+032E + U+0307 + U+0304 (base letter + two
diacritics below + two diacritics above).

There are also many other latin based literary languages than
Vietnamese, that can have more than one accent by one base letter
(Livonian, Schwiizertüütsch and so on).

Johann, think twice before assuming that all languages would fit into
your schema.

Juhani Lehtiranta
Post by John Cowan
Post by m***@users.sourceforge.net
All of these fields are actually implementation defined, with just one
rule for char: don't include characters that can be made with
combinations, that's what the accent fields are for. This allows for
255 upper and lower accents which should be enough -- for now.
The problem is that multiple accents above are quite common -- Vietnamese
depends on them heavily. There may also be multiple accents below,
for all I know.
--
John Cowan http://www.ccil.org/~cowan
Be yourself. Especially do not feign a working knowledge of RDF where
no such knowledge exists. Neither be cynical about RELAX NG; for in
the face of all aridity and disenchantment in the world of markup,
James Clark is as perennial as the grass. --DeXiderata, Sean McGrath
Yahoo! Groups Sponsor
ADVERTISEMENT
<http://rd.yahoo.com/M=194081.3897168.5135684.1261774/D=egroupweb/S=1707209084:HM/A=1732163/R=0/SIG=11n0nglqg/*http://www.ediets.com/start.cfm?code=30510&media=zone>
This mailing list is just an archive. The instructions to join the
true Unicode List are on
http://www.unicode.org/unicode/consortium/distlist.html
Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service
<http://docs.yahoo.com/info/terms/>.
--
Juhani Lehtiranta
Kistaantie 1, FIN-01900 Nurmij?rvi
(office: Keskustie 14, FIN-01900 Nurmij?rvi)
tel. home +358-9-2901575, office +358-9-2767633, fax +358-9-2767644
j***@att.net
2003-09-26 07:03:42 UTC
Permalink
.
Jóhann Gunnar Óskarsson wrote,
Post by m***@users.sourceforge.net
That does not have to be a problem, as long as there are no more than
255 accents and combinations of them. As for vietnamese, I just don't
know how many there are, or how many characters they use.
The Combining Diacritical Marks range of Unicode 4.0 lists 107
combining marks which can be used in any combination. Some
combining marks are supposed to span two base characters.

Peter Constable (IIRC) reported on this list a while ago that there was
a Latin-based writing system used for an indigenous South American
language which stacks up to three marks above.

Best regards,

James Kass
.


------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
P***@sil.org
2003-09-27 05:28:33 UTC
Permalink
Post by j***@att.net
Peter Constable (IIRC) reported on this list a while ago that there was
a Latin-based writing system used for an indigenous South American
language which stacks up to three marks above.
Good memory, James! The language is Ticuna.


Peter


------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Marco Cimarosti
2003-09-26 10:42:50 UTC
Permalink
Post by m***@users.sourceforge.net
In a plain text environment, there is often a need to encode more than
just the plain character. A console, or terminal emulator, is such an
environment. Therefore I propose the following as a technical report
character equalence is binary equalence.
I guess you meant "equivalence".

Q1: But what are "character equivalence" and "binary equivalence", and
why did you choose them as your goals?
Post by m***@users.sourceforge.net
I thought of dividing the 64 bit code space into 32 variably wide
plains,
Q2: What are these "plains" for? Why are there 32 of them?
Post by m***@users.sourceforge.net
one for control characters, one for latin characters, one for
han characters,
Q3: Why do you want to treat Latin character and Han characters
differently?

There is nothing special with Latin or Han characters in Unicode: they are
just 2 of the about 50 scripts currently supported in Unicode. (see
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt and
http://www.unicode.org/Public/UNIDATA/Scripts.txt)

Q4: And how do you plan to distinguish them?

Both Latin and Han characters are scattered all over the Unicode space, so
you need to check many ranges to determine which character belongs to which
category.

Q5: And what about all character which are neither Latin nor Han?
Post by m***@users.sourceforge.net
and so on; using 5 bits and the next 3 fixed to zero
(for future expansion and alignment to an octet).
I call plain 0 control characters and won't discuss it further.
Q6: Why do control characters have a special handling?

Q7: Don't control characters have properties attached like any other
characters?

One example of properties which could be useful to attach to control
character is directionality. E.g., a TAB is always a TAB but, after it
passed through the Bidirectional Algorithm, its directionality can be
resolved to be either LTR or RTL.
Post by m***@users.sourceforge.net
Plain 1, I had intended for latin characters with the following
bits 63..59 58..56 55..40 39..32 31..24 23..16 15..8 7..0
+-------+------+------+------+------+------+------+------+
| plain | zero | attr | res | uacc | lacc | res | char |
+-------+------+------+------+------+------+------+------+
* Plain Plain (5 bits)
* Zero Zero bits (3 bits)
* Attr Attributes (16 bits)
Q8: What kind of information are these three fields for?

Q9: In case your answer to Q8 is "they are application-defined", then
what is the rationale for defining and naming more than one field? I mean:
if they are application-defined, why not leave the task of defining
sub-fields to the application?
Post by m***@users.sourceforge.net
* Res Reserved (8 bits)
* Uacc Upper Accent (8 bits)
* Lacc Lower Accent (8 bits)
Q10: Why do treat "accents" specially?

They are just characters as any others. In Unicode there is no special
limitation as to how many "accents" can be applied to a base character.
There is also no obligation for accents to have a base character.
Post by m***@users.sourceforge.net
* Res Reserved (8 bits)
* Char Character (8 bits)
Q11: How can you store a Latin character in 8 bits?

Unicode has 938 Latin characters, and their codes range from U+0041 to
U+FF5A.
Post by m***@users.sourceforge.net
All of these fields are actually implementation defined, with just one
rule for char: don't include characters that can be made with
combinations, that's what the accent fields are for.
But characters are non necessarily decomposed in one "Latin character" with
one "upper accent" and one "lower accent". E.g., U+01D5 (LATIN CAPITAL
LETTER U WITH DIAERESIS AND MACRON) decomposes to U+0055 U+0308 U+0304
(LATIN CAPITAL LETTER U, COMBINING DIAERESIS, COMBINING MACRON). Both
COMBINING DIAERESIS and COMBINING MACRON are "upper accents".

Q12: How are you going to deal with a combination of, e.g., a base letter
+ 5 "upper accents" + 3 "lower accents"?
Post by m***@users.sourceforge.net
This allows for 255 upper and lower accents which should be enough -- for
now.

I counted 129 "upper accents". But their codes range from U+0300 to U+1D1AD.

Q13: How are you going to compress these codes into 8 bits? Are you
planning to use a conversion table from the Unicode code to your internal
8-bit code?
Post by m***@users.sourceforge.net
For Han characters I thought of the following encoding method (with no
bits 63..59 58..56 55..40 39..32 31 .. 0
+-------+------+------+-------+--------------------------+
| plain | zero | attr | style | char |
+-------+------+------+-------+--------------------------+
* Plain Plain (5 bits)
* Zero Zero bits (3 bits)
* Attr Attributes (16 bits)
* Style Stylistic Variation (8 bits)
Q14: What kind of information is in field "Style"?

Q15: Why do only Han characters have this?

Letters in many other scripts may have stylistic variations. E.g., "é" is
one and the same character (or combination of characters), but its
typographical shape is different in Italian and Polish.
Post by m***@users.sourceforge.net
* Char Character (32 bits)
Q16: Why 32 bits?

*Any* Unicode code points range from U+0000 to U+10FFFF, so all of them can
fit in 21 bits (or 24, if you want to stick to 8-bit boundaries).
Post by m***@users.sourceforge.net
Again, all fields are implementation defined. Telling something like
a terminal emulator what stylistic variation to use is outside the
scope of this email, but for attributes, there are standardized escape
sequences; but I suspect language tags can be used.
Q17: Why are you mentioning language tags? What do they have to do with
escape sequences?
Post by m***@users.sourceforge.net
I was also thinking of a plain for punctuation and symbolic
characters.
Q18: And what about all other characters, e.g., Arabic letters?
Post by m***@users.sourceforge.net
I will be pleased if anyone can come up with better encoding methods
than I did, and I call upon other people to come up with encodings for
scripts I know nothing about, such as arabic and others. Then let's
wrap it up in a technical report and be done with it ;)
Any comments?
See my 18 questions above.

Some more general comments now. I understand that it can be useful in many
circumstances to internally store character codes together with some kind of
properties. But I fail to understand most points of the architecture you are
proposing. Particularly:

A) I don't see why you want to treat characters of different scripts in
different ways. The purpose of Unicode is exactly to encode any character
from any kind of script in an uniform way. Moreover, determining the script
to which a character belongs is a relatively complex and time-consuming
operation.

B) I don't see why you make all those assumptions about to the structure of
the properties attached to characters. If these properties have to be
application defined, let it be application defined... I don't see a reason
for defining all those "Plain", "Zero", "Attr", "Res", "Style" fields: just
put all the available bits together and call them "Properties": it will be
the task of the application programmer to decide how to use these bits.

C) I don't see why you want to store a letter and its "accents" as a single
units. Beside the fact that this is an impossible task, because a letter can
have an arbitrary number of "accents", I fail to see any need for it. Also
consider that, doing this, a letter and its accent(s) cannot have
*different* properties, and this can be useful in a number of cases. An
example which comes to mind is the entries in the most widespread Italian
dictionary are typed in "bold" type, but the accent on the letters are in
"bold" type if they are mandatory in the orthography and in "normal" type if
they are optional, so you can have a "bold" letter with a "normal" accent.
Another example, best known, is that of Arabic religious text, where the
letters are normally black and the "accents" (representing vowels and other
phonetic data) are red.

If your assumption is that each character plus its attributes will take 64
bits, the logical partition of these bits would be:

Application-defined Properties: 43 bits
Character code: 21 bits

If, for any reason, you want to stick to 8-bit boundaries, you can use this
alternative partition:

Application-defined Properties: 40 bits
Character code: 24 bits

In either case, 40 or 43 bits is a huge space for the properties of a single
character, and there are plenty possible useful uses for that space. In the
unlikely case that the needed properties would not fit in 40-43 bits, the
field can be used to store an index to an external array of properties.
However long and complex a text can be, I doubt that 1 or 8 *trillions* of
different character properties will not suffice!

If an application needs "accents" to have the same properties as their base
character, the application can define a special property value which means
"this characters inherits the properties of previous character/the character
on its left/the character on its right/the character at position N/etc.".

But if you really want to stick to you (sorry, insane) idea of storing a
letter and its accent in the same unit, I would suggest at least to:
1) limit this to the *first* accent only, without distinguishing
between "upper" and "lower" accents (any subsequent "accent" will take its
own 64-bit entry);
2) encode the "accent" with its regular Unicode code point, rather
than with an ad-hoc 8-bit code.

This would result to this partition:

Application-defined Properties: 22 bits
Base character code: 21 bits
Accent character code: 21 bits

Or, if you need to stick to 8-bit boundaries:

Application-defined Properties: 16 bits
Base character code: 24 bits
Accent character code: 24 bits

A field of 16 or 22 bits is still a fair amount of space, especially if the
application uses it as an index to an external table.

And now comes my last and more fundamental question:


Q0: why do you want to propose all this as a Unicode "technical report"?


Internal data structures and algorithms are, by definition, "internal", so I
see no need of standardizing them, or even of publishing them.

_ Marco


------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Rick McGowan
2003-09-26 16:05:10 UTC
Permalink
Post by m***@users.sourceforge.net
In a plain text environment, there is often a need to encode more than
just the plain character.
...
Post by m***@users.sourceforge.net
Since I'm using 64 bits, I call it Excessive Memory Usage Encoding, or
EMUE.
...
Post by m***@users.sourceforge.net
I thought of dividing the 64 bit code space into 32 variably wide
plains, one for control characters, one for latin characters, one for
han characters, and so on;
This all seems to me like something of a pointless excercise. Or maybe
you're not making clear what is your intented audience of users and
problems that you're trying to solve.

Decent libraries exist that already do nice things with strings having
attributes. And that, in my opinion, is a better model than bit-hacking in
a 64-bit space with vague implementation-defined attributes that change
depending on the "script" of a character. Such "attributed strings" are
easy to work with and provide a much higher-level model than this.

You might want to check out Apple's Cocoa environment, particularly the
definitions of the attributed string classes. For example...
http://developer.apple.com/documentation/Cocoa/Reference/Foundation/Java/Classes/NSAttributedString.html
or even the intro:
http://developer.apple.com/documentation/Cocoa/Conceptual/AttributedStrings/index.html

I'm sure there are libraries with similar capabilities for storing
characters + attributes in Java and other languages, I'm just not familiar
with them. Maybe some of the developers can chime in with their favorite
attributed string libraries. Even if you don't use one, you might find the
attributed string model educational.

(All of the above of course reflects only my personal opinion.)

Rick



------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Jill Ramonsky
2003-09-30 11:19:00 UTC
Permalink
Ludvig, this Pastoral Symphony of yours all seems to me like something
of a pointless excercise.
And Albert, this "Theory of Relativity" of yours all seems to me like
something of a pointless excercise.

Never discourage someone else's creativity.
Jill
-----Original Message-----
Sent: Friday, September 26, 2003 5:05 PM
Subject: Re: Internal Representation of Unicode
This all seems to me like something of a pointless excercise.
------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
John Cowan
2003-09-30 15:15:07 UTC
Permalink
Post by Jill Ramonsky
Ludvig, this Pastoral Symphony of yours all seems to me like something
of a pointless excercise.
And Albert, this "Theory of Relativity" of yours all seems to me like
something of a pointless excercise.
Never discourage someone else's creativity.
The whole point of standardization is to *redirect* (not discourage) people's
creativity into useful channels. Isaac Newton spent an unconscionable amount
of time, by our standards, messing about with astrology and numerology --
far more than he ever put into physics or calculus. The "standardization"
of science since his day has helped reduce such effects.
--
John Cowan http://www.ccil.org/~cowan ***@reutershealth.com
Please leave your values Check your assumptions. In fact,
at the front desk. check your assumptions at the door.
--sign in Paris hotel --Cordelia Vorkosigan


------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Asmus Freytag
2003-09-30 18:01:26 UTC
Permalink
Post by John Cowan
Isaac Newton spent an unconscionable amount
of time, by our standards, messing about with astrology and numerology
One of the aspects of character encoding and standardization that seems to
have an unholy fascination for people is its numerical aspect. It starts
with the catalog number for 10646, which was deliberately jiggered to
incorporate the number 646, which is the catalog number for the 7-bit
standards. It continues with the desire to see certain characters are
specific code locations (for example the byte order mark) and continues
with the never-ending stream of (re-)encoding forms.

It's just human nature, I guess.

A./


------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
John Jenkins
2003-09-30 18:40:07 UTC
Permalink
Post by Asmus Freytag
Post by John Cowan
Isaac Newton spent an unconscionable amount
of time, by our standards, messing about with astrology and numerology
One of the aspects of character encoding and standardization that
seems to have an unholy fascination for people is its numerical
aspect. It starts with the catalog number for 10646, which was
deliberately jiggered to incorporate the number 646, which is the
catalog number for the 7-bit standards. It continues with the desire
to see certain characters are specific code locations (for example the
byte order mark) and continues with the never-ending stream of
(re-)encoding forms.
It's just human nature, I guess.
Maybe we should add something to the submission form: "Has this
proposal been approved by a numerologist?"

========
John H. Jenkins
***@apple.com
***@mac.com
http://homepage..mac.com/jhjenkins/



------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
j***@spin.ie
2003-10-01 08:41:41 UTC
Permalink
Post by John Cowan
Post by Asmus Freytag
Post by John Cowan
Isaac Newton spent an unconscionable amount
of time, by our standards, messing about with astrology and
numerology
Post by Asmus Freytag
One of the aspects of character encoding and standardization that
seems to have an unholy fascination for people is its numerical
aspect. It starts with the catalog number for 10646, which was
deliberately jiggered to incorporate the number 646, which is the
catalog number for the 7-bit standards. It continues with the desire
to see certain characters are specific code locations (for example the
byte order mark) and continues with the never-ending stream of
(re-)encoding forms.
It's just human nature, I guess.
Maybe we should add something to the submission form: "Has this
proposal been approved by a numerologist?"
First they'd want numeric value properties added to the Hebrew and Greek letters, then when they came to do the same for the Latin letters the ensuing flamewar would bring the whole effort to a standstill.

Still, there are good reasons for the BOM being where it is...






------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
John Cowan
2003-10-01 11:12:12 UTC
Permalink
Post by j***@spin.ie
First they'd want numeric value properties added to the Hebrew and
Greek letters, then when they came to do the same for the Latin letters
the ensuing flamewar would bring the whole effort to a standstill.
Numeric values for Hebrew, Greek, and Cyrillic make a lot of sense, actually.
--
I am expressing my opinion. When my John Cowan
honorable and gallant friend is called, ***@reutershealth.com
he will express his opinion. This is http://www.ccil.org/~cowan
the process which we call Debate. --Winston Churchill


------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Jill Ramonsky
2003-10-01 10:45:36 UTC
Permalink
Yeah, but dude, wasting time on stupid ideas goes with the territory if
you happen to be a creative genius. Some of your ideas won't work.
Others will be magnificent. I'd put good money on the notion that if
Newton had been prevented from pursuing astrology or numerology, this
restriction would have had serious negative consequences for his genius,
and then maybe we wouldn't have calculus or the laws of motion either.
Creativity is all about playing with your mind, not about playing in a
sandbox. Anyone who doesn't understand that is doomed to constantly
cripple the expression of genius, to the detriment of society as a whole.

I can see why someone would want to make a console or terminal emulator
work with Unicode. I've messed around myself with ideas to make it work
(haven't come up with anything yet though). I say, go for it Johann. If
it turns out to have been a good idea, people will use it. If not, you
will have learned a great deal. It's definitely a no-lose situation.

Jill
-----Original Message-----
Isaac Newton spent an unconscionable amount of time, by our
standards, messing about with astrology and numerology -- far more than
he ever put into physics or calculus. The "standardization" of science
since his day has helped reduce such effects.
Loading...