Discussion:
Fun with proof by analogy, was Re: Mojibake on my Web pages
j***@spin.ie
2003-09-25 08:51:59 UTC
Permalink
Suppose you made a document and sent it to me via conventional post.
The last agent handling the document would be the mail carrier.
Does the mail carrier have the right to open the mailing and
replace your document with garbage?
No, however if I receive a letter in the post written in German I'm going to ask someone to translate it rather than try to cope with a language (c.f. encoding) I don't understand.

Besides what is happening here isn't the server replacing the document with garbage, it's the server mis-identifying what the document is - analogous either to our hypothetical translater having a break-down, insisting that all of our mail was german and handing us non-sequitors as "translations", or with the postal service getting the delivery wrong (which is something that has certainly happened to my mail).
Author = Host
Document = Wine
Reader = Guest
Server = Cup
If the host pours a cup of wine for the guest, would we allow a
mere cup to adulterate our wine?
The argument only holds as much as the analogies hold (both the analogy with snail mail and the one you actually refer to as an analogy). These analogies do hold in certain cases, and the case that started the thread is an example, but it does not hold in the general case. In other scenarios better analogies would be:

Author = Scribe
Document = Draft
Reader = em, Reader
Server = Editor.

Or Author = scattered data sources of varying degrees of reliability - Server = researcher.

In general, from the browser's perspective the server is the author (which may or may not be an accurate view of what goes on "behind the scenes"). Re-encoding, if done right, can be very useful in making web documents more widely accessible.

Of course we'll soon be able to just rely on assuming that every step in the process can understand UTF-8 and UTF-16 :)






------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Peter Kirk
2003-09-25 11:32:33 UTC
Permalink
Post by j***@spin.ie
Suppose you made a document and sent it to me via conventional post.
The last agent handling the document would be the mail carrier.
Does the mail carrier have the right to open the mailing and
replace your document with garbage?
No, however if I receive a letter in the post written in German I'm going to ask someone to translate it rather than try to cope with a language (c.f. encoding) I don't understand.
Yes, if that's what you ask for. But as I know some German I may prefer
to do my own translation. And if the recipient is a German who knows no
English, they certainly aren't going to be amused if their letters get
translated whether they want them to be or not. So the mail carrier
should do this only if specifically asked to do so.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/




------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
j***@spin.ie
2003-09-25 11:05:50 UTC
Permalink
Post by j***@spin.ie
Suppose you made a document and sent it to me via conventional post.
The last agent handling the document would be the mail carrier.
Does the mail carrier have the right to open the mailing and
replace your document with garbage?
No, however if I receive a letter in the post written in German I'm going
to ask someone to translate it rather than try to cope with a language (c.f.
encoding) I don't understand.
Yes, if that's what you ask for. But as I know some German I may prefer
to do my own translation. And if the recipient is a German who knows no
English, they certainly aren't going to be amused if their letters get
translated whether they want them to be or not. So the mail carrier
should do this only if specifically asked to do so.
Indeed. Remember the problem here isn't a server performing translation, transliteration or re-encoding - but rather a server misidentifying an encoding (hence my analogy of the translator having a nervous break-down, that and the fact that the image struck me as funny).

However to enable a correctly functioning server to perform such re-encoding *when asked to do so* we have to have the rule that HTTP-headers over-ride embedded self-description for text-based formats. This causes problems in cases like those described, but not when the webserver has a rough idea of what the hell it is doing.

One could argue against the rule of headers having precedence on the basis that it is brittle, but it is no more brittle than trusting copy-and-paste <meta/> elements which are also likely to be wrong (trust me I've seen enough that my anecdotal experience is approaching statistical validity).

But one day it will all be Unicode... one day...






------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
P***@sil.org
2003-09-26 07:14:16 UTC
Permalink
Post by j***@spin.ie
The last agent handling the document would be the mail carrier.
Does the mail carrier have the right to open the mailing and
replace your document with garbage?
No, however if I receive a letter in the post written in German I'm
going to ask someone to translate it rather than try to cope with a
language (c.f. encoding) I don't understand.
Unlike Jame's cup of wine, this really is a good analogy. Suppose the
document is stored on the server in ISO 8859-1 and the browser requesting
the page understands only EBCDIC. The server must convert it -- if it
doesn't, it will appear on the client as complete garbage. As Jon
mentioned, the server is the last one to touch it, and this illustrates
why it is appropriate for the server to touch it.


Peter


------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Peter Kirk
2003-09-26 09:21:59 UTC
Permalink
Post by P***@sil.org
Post by j***@spin.ie
The last agent handling the document would be the mail carrier.
Does the mail carrier have the right to open the mailing and
replace your document with garbage?
No, however if I receive a letter in the post written in German I'm
going to ask someone to translate it rather than try to cope with a
language (c.f. encoding) I don't understand.
Unlike Jame's cup of wine, this really is a good analogy. Suppose the
document is stored on the server in ISO 8859-1 and the browser requesting
the page understands only EBCDIC. The server must convert it -- if it
doesn't, it will appear on the client as complete garbage. As Jon
mentioned, the server is the last one to touch it, and this illustrates
why it is appropriate for the server to touch it.
Peter
Is server software actually obliged to perform such conversions on
request? Surely, rather, browsers should be expected to support a
certain minimum set of encodings, or else it should be left to the
content provider and the reader and/or their software to agree on
something acceptable. After all, if someone in China sends me snail
mail, the mail carrier is not under any obligation to translate it for
me. On the contrary, I would be offended if they tried, without my
explicit permission, on the basis that the content of the letter is none
of their business. I need to agree with the sender to send it in English
rather than Chinese, or else get it translated myself.

In any case, I would assume that any in practice any browser can at
least understand ASCII, and if presented with a page in UTF-8 will at
worst display 0020-007F correctly and the rest as some kind of mojibake.
And if it can't understand any other script in UTF-8, chances are it
can't understand whatever the coding it is presented with, so there is
little gained by converting it to some specific code page.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/




------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
P***@sil.org
2003-09-27 05:28:32 UTC
Permalink
Post by Peter Kirk
Post by P***@sil.org
Unlike Jame's cup of wine, this really is a good analogy. Suppose the
document is stored on the server in ISO 8859-1 and the browser requesting
the page understands only EBCDIC. The server must convert it -- if it
doesn't, it will appear on the client as complete garbage. As Jon
mentioned, the server is the last one to touch it, and this illustrates
why it is appropriate for the server to touch it.
Is server software actually obliged to perform such conversions on
request? Surely, rather, browsers should be expected to support a
certain minimum set of encodings...
Folks, feel free to spend your time bantering on about whether something
should or shouldn't do this or that. But while you're at it, if you want
to know whether the http encoding declaration is supposed to have
precedence over the encoding declaration inside the HTML doc, go read the
specs to find the definitive answer.



Peter


------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
j***@att.net
2003-09-26 08:46:43 UTC
Permalink
.
Peter Constable wrote,
Post by P***@sil.org
Post by j***@spin.ie
No, however if I receive a letter in the post written in German I'm
going to ask someone to translate it rather than try to cope with a
language (c.f. encoding) I don't understand.
Unlike Jame's cup of wine, this really is a good analogy. Suppose the
document is stored on the server in ISO 8859-1 and the browser requesting
the page understands only EBCDIC. The server must convert it -- if it
doesn't, it will appear on the client as complete garbage. As Jon
mentioned, the server is the last one to touch it, and this illustrates
why it is appropriate for the server to touch it.
But, this simply isn't the case with Doug Ewell's web pages. Doug's
pages are properly encoded using the world's standard for text
encoding and properly tagged. The server isn't performing any
conversion, it's just adulterating the content of the web pages
by adding an incorrect protocol resulting in the display of
mojibake.

Jon Hanna wrote,
Post by P***@sil.org
However to enable a correctly functioning server to perform
such re-encoding *when asked to do so* we have to have the rule
that HTTP-headers over-ride embedded self-description for
text-based formats. This causes problems in cases like those
described, but not when the webserver has a rough idea of
what the hell it is doing.
This is the operative phrase, "when *asked* to do so".
Post by P***@sil.org
Author = Scribe
Document = Draft
Reader = em, Reader
Server = Editor.
But, the notion that it is acceptable for a server to blithely assume
that any given user is incompetent is repugnant. I no more want
my server to generate incorrect protocols for my web pages than
I want my server to run a spell-checker on the contents.

Fortunately, rather than Doug's server assuming incompetence,
it appears to be merely over-reacting to a mis-perceived
security threat.

Deepayan Sarkar wrote,
Post by P***@sil.org
... Or that any sufficiently advanced cup is allowed to take action
to remove any poisonous substance from the wine served in it.
Ethyl alcohol is toxic, but, as miracles go... our postulated sophisticated
cup's ability to tranform wine into water probably wouldn't be as
widely acclaimed as the ability to do the opposite.

Best regards,

James Kass
.


------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
P***@sil.org
2003-09-27 05:28:32 UTC
Permalink
Post by j***@att.net
But, this simply isn't the case with Doug Ewell's web pages. Doug's
pages are properly encoded using the world's standard for text
encoding and properly tagged. The server isn't performing any
conversion, it's just adulterating the content of the web pages
by adding an incorrect protocol resulting in the display of
mojibake.
Doug's server may be doing the wrong thing, but that isn't a
counterargument to the general principle of whether the browser should
believe what the server says or what the document says about the encoding.
That was the question to which I and, I think, Jon were responding.


Peter


------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
j***@spin.ie
2003-09-26 09:29:18 UTC
Permalink
Post by j***@att.net
But, the notion that it is acceptable for a server to blithely assume
that any given user is incompetent is repugnant. I no more want
my server to generate incorrect protocols for my web pages than
I want my server to run a spell-checker on the contents.
Fortunately, rather than Doug's server assuming incompetence,
it appears to be merely over-reacting to a mis-perceived
security threat.
Doug's server isn't assuming incompetence - Doug's server *is* incompetent.

Indeed when a server does re-encode it isn't assuming incompetence either, it needs to trust the author on what encoding the source is in in order to re-encode successfully.

The issue is with the browser trusting the server over the author. The browser doesn't share our knowledge of Doug's competence or of his server's incompetence and assumes the server is reasonably competent (any fall-back behaviour can only kick in if something proves that the server messed up).

For an example of what happens when the browser doesn't trust the server try sending a HTML source as plain text to IE.






------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
j***@spin.ie
2003-09-26 09:34:29 UTC
Permalink
Post by Peter Kirk
Is server software actually obliged to perform such conversions on
request?
No, there is no obligation.

Surely, rather, browsers should be expected to support a
Post by Peter Kirk
certain minimum set of encodings,
Ah but how minimum is acceptable?

Of course, and I've said this already in this thread, we can now just make sure that every server and ever client supports UTF-8 and UTF-16. However it will be some time before servers can assume all browsers can accept these, and before all browsers can assume that all servers can send them. The rule of http headers over-riding embedded self-desciption is going to be necessary until this has come to pass.
Even after then it's going to be necessary as there is only one http header which states encoding, but there is an unlimited number of mechanisms for self-description in an unlimited number of potential document types.






------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Peter Kirk
2003-09-26 12:04:22 UTC
Permalink
Post by j***@spin.ie
Post by Peter Kirk
Is server software actually obliged to perform such conversions on
request?
No, there is no obligation.
Surely, rather, browsers should be expected to support a
Post by Peter Kirk
certain minimum set of encodings,
Ah but how minimum is acceptable?
Of course, and I've said this already in this thread, we can now just make sure that every server and ever client supports UTF-8 and UTF-16. However it will be some time before servers can assume all browsers can accept these, and before all browsers can assume that all servers can send them. ...
Since there is plenty of good and free browser software available which
does support UTF-8, perhaps servers should start assuming that browsers
can support it, and that will gently encourage software vendors and
users to upgrade. Well, actually it won't affect most US users as they
mostly only use ASCII. For us in the UK and Ireland, we will just get
mojibake pounds and euros.

Anyway, isn't this the way W3C standards are going? I thought they were
moving to XML compatibility which implies UTF-8 support. Browsers which
can't support the latest W3C standards will surely become obsolete very
quickly.
Post by j***@spin.ie
... The rule of http headers over-riding embedded self-desciption is going to be necessary until this has come to pass.
Even after then it's going to be necessary as there is only one http header which states encoding, but there is an unlimited number of mechanisms for self-description in an unlimited number of potential document types.
This follows only if you accept the principle that the carrier has the
duty to ensure that the recipient can understand what they receive. I
don't accept that the carrier has the duty or even the right to do that.
In fact I would suggest that it is an infringement of my civil liberties
to do so just as much as it would be for the snail mail service to
censor or even reformat my mail.

Perhaps they have the right to check for security holes, yes, but that
is a separate issue. A carrier has the right to check for security
issues which might compromise their service, yes, and perhaps the right
to check for certain kinds of illegal activity which might include
deliberately spreading viruses - but only if such things are clearly set
out in the carrier's conditions of service or by law. But the carrier
does not have the right to mess around with the content because it
thinks that there might be some possibility of compromising the security
of ill-configured software at the recipient. The recipient is
responsible for their own security, and should not rely on carriers for
this except as part of a specific value-added agreement.

Summary: Servers, keep your hands off my e-mail and web pages! What I
have written I have written, and if it's garbage that's my problem, not
yours. If it really is a virus etc, I'll let you make that an exception,
but you must have real evidence, not just an assumption that anything
with a particular encoding is dangerous.

Well, that's my opinion, anyway.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/




------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
John Cowan
2003-09-26 12:33:25 UTC
Permalink
Post by Peter Kirk
Since there is plenty of good and free browser software available which
does support UTF-8, perhaps servers should start assuming that browsers
can support it, and that will gently encourage software vendors and
users to upgrade.
Alas, we are in a cleft stick: almost all users are now using a
browser that is not going to be upgraded, short of replacing their
operating system.
Post by Peter Kirk
Anyway, isn't this the way W3C standards are going? I thought they were
moving to XML compatibility which implies UTF-8 support. Browsers which
can't support the latest W3C standards will surely become obsolete very
quickly.
Technically obsolete does not mean dead.
--
I don't know half of you half as well John Cowan
as I should like, and I like less than half ***@reutershealth.com
of you half as well as you deserve. http://www.ccil.org/~cowan
--Bilbo http://www.reutershealth.com


------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Peter Kirk
2003-09-26 13:08:37 UTC
Permalink
Post by John Cowan
Post by Peter Kirk
Since there is plenty of good and free browser software available which
does support UTF-8, perhaps servers should start assuming that browsers
can support it, and that will gently encourage software vendors and
users to upgrade.
Alas, we are in a cleft stick: almost all users are now using a
browser that is not going to be upgraded, short of replacing their
operating system.
Almost all users of what? This isn't true of Windows, and for better or
for worse the majority of all browser users use Windows. Windows, at
least 98+, nags you to upgrade to the latest version of IE whether you
want to or not - which is annoying on my old PC which hasn't got the
disk space for IE6. Mozilla also nags you to upgrade, so it's not just a
Microsoft thing. And probably most users of non-Windows systems are
either reasonably computer literate or are supported by IT departments
which should do the upgrade.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/




------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
John Cowan
2003-09-26 13:16:39 UTC
Permalink
Post by Peter Kirk
Almost all users of what? This isn't true of Windows, and for better or
for worse the majority of all browser users use Windows. Windows, at
least 98+, nags you to upgrade to the latest version of IE whether you
want to or not - which is annoying on my old PC which hasn't got the
disk space for IE6.
True. But whatever isn't fixed in IE6 won't be fixed at all -- no more
upgrades (except, presumably, security-related ones) after that.
Microsoft has said so.
--
Not to perambulate John Cowan <***@reutershealth.com>
the corridors http://www.reutershealth.com
during the hours of repose http://www.ccil.org/~cowan
in the boots of ascension. --Sign in Austrian ski-resort hotel


------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Peter Kirk
2003-09-26 16:52:44 UTC
Permalink
Post by John Cowan
Post by Peter Kirk
Almost all users of what? This isn't true of Windows, and for better or
for worse the majority of all browser users use Windows. Windows, at
least 98+, nags you to upgrade to the latest version of IE whether you
want to or not - which is annoying on my old PC which hasn't got the
disk space for IE6.
True. But whatever isn't fixed in IE6 won't be fixed at all -- no more
upgrades (except, presumably, security-related ones) after that.
Microsoft has said so.
Well, looks like Microsoft won the browser wars and nearly killed off
Netscape only to hand over their victory on a plate to whoever feels
like taking the prize. Mozilla will be stepping up to take it. The
problem is that this victory is expensive, and doesn't bring revenue
because people have got used to browsers being free.

But to come back to the issue, IE6 has adequate support for UTF-8 and
many other encodings, so this is no excuse for not serving web pages in
UTF-8.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/




------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Dao Xuan Nam
2003-09-29 10:34:40 UTC
Permalink
----- Original Message -----
From: "Peter Kirk" <***@qaya.org>
To: "John Cowan" <***@mercury.ccil.org>
Cc: <***@unicode.org>
Sent: Friday, September 26, 2003 11:52 PM
Subject: Re: Fun with proof by analogy, was Re: Mojibake on my Web pages
Post by Peter Kirk
Post by John Cowan
Post by Peter Kirk
Almost all users of what? This isn't true of Windows, and for better or
for worse the majority of all browser users use Windows. Windows, at
least 98+, nags you to upgrade to the latest version of IE whether you
want to or not - which is annoying on my old PC which hasn't got the
disk space for IE6.
True. But whatever isn't fixed in IE6 won't be fixed at all -- no more
upgrades (except, presumably, security-related ones) after that.
Microsoft has said so.
Well, looks like Microsoft won the browser wars and nearly killed off
Netscape only to hand over their victory on a plate to whoever feels
like taking the prize. Mozilla will be stepping up to take it. The
problem is that this victory is expensive, and doesn't bring revenue
because people have got used to browsers being free.
But to come back to the issue, IE6 has adequate support for UTF-8 and
many other encodings, so this is no excuse for not serving web pages in
UTF-8.
--
Peter Kirk
http://www.qaya.org/
------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Michael Everson
2003-09-26 21:31:01 UTC
Permalink
And probably most users of non-Windows systems are either reasonably
computer literate or are supported by IT departments which should do
the upgrade.
Or have Macs and don't need any help. :-)
--
Michael Everson * * Everson Typography * * http://www.evertype.com


------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Peter Kirk
2003-09-26 23:27:18 UTC
Permalink
Post by Michael Everson
And probably most users of non-Windows systems are either reasonably
computer literate or are supported by IT departments which should do
the upgrade.
Or have Macs and don't need any help. :-)
Indeed, unless they are "still running a Mac Plus with Mac OS 3.1 on
it". I knew I had missed out the Mac community, largely because I don't
know enough about Mac browsers except that there is at least one Unicode
compatible one available, even if it is from Microsoft.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/




------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Michael Everson
2003-09-27 09:19:20 UTC
Permalink
Post by Michael Everson
And probably most users of non-Windows systems are either
reasonably computer literate or are supported by IT departments
which should do the upgrade.
Or have Macs and don't need any help. :-)
Indeed, unless they are "still running a Mac Plus with Mac OS 3.1 on it".
In which case I doubt very much if you can use a browser at all....
I knew I had missed out the Mac community, largely because I don't
know enough about Mac browsers except that there is at least one
Unicode compatible one available, even if it is from Microsoft.
What? Sorry, IE5 is not at all Unicode compatible -- at least it does
a woeful job of displaying Unicode text. And it's no longer being
supported for Mac OS. Safari and OmniWeb in my view do the best
rendering; Safari is better at bidi.
--
Michael Everson * * Everson Typography * * http://www.evertype.com


------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Peter Kirk
2003-09-27 10:28:01 UTC
Permalink
Post by Michael Everson
...
Post by Peter Kirk
I knew I had missed out the Mac community, largely because I don't
know enough about Mac browsers except that there is at least one
Unicode compatible one available, even if it is from Microsoft.
What? Sorry, IE5 is not at all Unicode compatible -- at least it does
a woeful job of displaying Unicode text. ...
Maybe, but it doesn't present UTF-8 as mojibake, surely, and gives no
justification for servers refusing to serve UTF-8 which was the original
point.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/




------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Peter Kirk
2003-09-27 10:36:05 UTC
Permalink
Post by Michael Everson
...
Post by Peter Kirk
I knew I had missed out the Mac community, largely because I don't
know enough about Mac browsers except that there is at least one
Unicode compatible one available, even if it is from Microsoft.
What? Sorry, IE5 is not at all Unicode compatible -- at least it does
a woeful job of displaying Unicode text. ...
Maybe, but it doesn't present UTF-8 as mojibake, surely, and gives no
justification for servers refusing to serve UTF-8 which was the original
point.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/





------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Michael Everson
2003-09-26 21:27:37 UTC
Permalink
Post by John Cowan
Alas, we are in a cleft stick: almost all users are now using a
browser that is not going to be upgraded, short of replacing their
operating system.
Which they all surely will, at some stage. I mean, who's really still
running a Mac Plus with Mac OS 3.1 on it?
--
Michael Everson * * Everson Typography * * http://www.evertype.com


------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
John Cowan
2003-09-26 23:43:51 UTC
Permalink
Post by Michael Everson
Post by John Cowan
Alas, we are in a cleft stick: almost all users are now using a
browser that is not going to be upgraded, short of replacing their
operating system.
Which they all surely will, at some stage. I mean, who's really still
running a Mac Plus with Mac OS 3.1 on it?
Granted. But it's a long, long time to Longhorn (the next release of
Windows, which will presumably have IE 7.0 in it).
--
John Cowan <***@reutershealth.com> www.ccil.org/~cowan www.reutershealth.com
Micropayment advocates mistakenly believe that efficient allocation of
resources is the purpose of markets. Efficiency is a byproduct of market
systems, not their goal. The reasons markets work are not because users
have embraced efficiency but because markets are the best place to allow
users to maximize their preferences, and very often their preferences are
not for conservation of cheap resources. --Clay Shirkey


To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
j***@spin.ie
2003-09-26 13:00:17 UTC
Permalink
Post by j***@spin.ie
Post by j***@spin.ie
... The rule of http headers over-riding embedded self-desciption is going
to be necessary until this has come to pass.
Post by j***@spin.ie
Even after then it's going to be necessary as there is only one http header
which states encoding, but there is an unlimited number of mechanisms for
self-description in an unlimited number of potential document types.
This follows only if you accept the principle that the carrier has the
duty to ensure that the recipient can understand what they receive. I
don't accept that the carrier has the duty or even the right to do that.
In fact I would suggest that it is an infringement of my civil liberties
to do so just as much as it would be for the snail mail service to
censor or even reformat my mail.
My webserver has a duty and right to do that if I want it to. Again what is happening in Doug's case is clearly an error on the part of the server, I am only saying that the error is not in the policy of http headers over-riding document self-desciption.

Had Doug more control over his server he might want it to do the sort of re-encoding when a browser requested it. In other scenarios the server might be the closest thing there is to an "author", or the author might provide some XML giving the core of the document with the server adding other features such as navigation, records of user comments etc. which puts the server in a far more authoritative position than when used for "straight" file transfer (and much of my bread-and-butter is such systems).

In Doug's case a server should act exactly as you say and leave well alone, but that is not the only case that the protocols involved have to serve.






------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
j***@att.net
2003-09-27 06:50:36 UTC
Permalink
Peter Constable wrote,
Post by P***@sil.org
Folks, feel free to spend your time bantering on about whether something
should or shouldn't do this or that. But while you're at it, if you want
to know whether the http encoding declaration is supposed to have
precedence over the encoding declaration inside the HTML doc, go read the
specs to find the definitive answer.
Here's one of the specs:

http://www.w3.org/TR/html4/charset.html#h-5.2.2


And, here's a page which extrapolates from the specs, please note
that the "TO SUM IT ALL UP" is most instructive:

http://www.webstandards.org/learn/askw3c/dec2002.html

Best regards,

James
.


------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
j***@att.net
2003-09-27 07:12:43 UTC
Permalink
.
Peter Constable wrote,
Post by P***@sil.org
Doug's server may be doing the wrong thing, but that isn't a
counterargument to the general principle of whether the browser should
believe what the server says or what the document says about the encoding.
That was the question to which I and, I think, Jon were responding.
The specs list an order of priority in which character set information
is sought.

First, the browser checks the HTTP header, then the XML declaration
(which is not relevant to HTML), then the HTML meta tag.

Apparently, upon finding character set information, the operation
stops, so if information is present in the HTTP header, the meta
tag won't be consulted.

This approach seems flawed -- illustrated by the problems caused
by Adelphia's apparent incompetence in this regard.

All of the data should be consulted and there should be some kind
of protocol in place to handle conflicting character set info.

In the event of a conflict between the HTTP header and the HTML meta
tag, of course the browser should believe the HTML meta tag. After
all, who knows better than the author the encoding used to construct
the file? Where the server has performed a character set conversion
upon request from a browser, then, as a part of the character set
conversion process, the HTML meta tag needs to be re-written in case
the page is archived by the visitor for later off-line viewing.

If this were the case, we wouldn't be having this thread.

Best regards,

James Kass
.


------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
John Cowan
2003-09-27 14:47:51 UTC
Permalink
Post by j***@att.net
First, the browser checks the HTTP header, then the XML declaration
(which is not relevant to HTML), then the HTML meta tag.
Apparently, upon finding character set information, the operation
stops, so if information is present in the HTTP header, the meta
tag won't be consulted.
It's worse than that. If the HTTP header says "text/xml" or "text/html",
and no charset information is provided, a fully conforming browser
MUST treat this as if the charset "us-ascii" is specified. That's
just insane, but such are the rules.

Only if there is no header, or if the header says "application/xml",
do we get to proceed to other sources of knowledge.
Post by j***@att.net
All of the data should be consulted and there should be some kind
of protocol in place to handle conflicting character set info.
It *is* in place and fully specified. It's just that most of us
don't care for the results, and most programs don't fully conform
for that reason.
--
Some people open all the Windows; John Cowan
wise wives welcome the spring ***@reutershealth.com
by moving the Unix. http://www.reutershealth.com
--ad for Unix Book Units (U.K.) http://www.ccil.org/~cowan
(see Loading Image...)


------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Francois Yergeau
2003-09-29 14:31:18 UTC
Permalink
Post by John Cowan
It's worse than that. If the HTTP header says "text/xml" or
"text/html",
and no charset information is provided, a fully conforming browser
MUST treat this as if the charset "us-ascii" is specified.
Nit: this is not the case for text/html, which fortunately took exception
from the MIME specs on this. From the HTML 4.01 spec, 5.2.2:

"Therefore, user agents must not assume any default value for the "charset"
parameter."
Post by John Cowan
That's just insane, but such are the rules.
Correct for text/xml. Better not use that, then, and favor application/xml.
--
François Yergeau


------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Francois Yergeau
2003-09-29 14:27:18 UTC
Permalink
Post by j***@att.net
In the event of a conflict between the HTTP header and the HTML meta
tag, of course the browser should believe the HTML meta tag. After
all, who knows better than the author the encoding used to construct
the file?
Who knows better the encoding used to *send* the file? The last server to
touch it.

It used to be common, the norm in fact, for Russian servers to store files
in various legacy encodings (KOI-8, 8859-5, DOS-something,...) and to serve
them in some other encoding, after transcoding on-the-fly based on the
User-Agent. There were also transcoding proxies for Asian character sets
that one could use to overcome the limitations of browsers of that era.
These practices were still around when the HTML 4 spec was released in 1997
and no doubt contributed to getting things as they are.
Post by j***@att.net
Where the server has performed a character set conversion
upon request from a browser, then, as a part of the character set
conversion process, the HTML meta tag needs to be re-written in case
the page is archived by the visitor for later off-line viewing.
It takes large amounts of tricky code to reliably parse real-life HTML. It
is unreasonable to expect servers, which have no business parsing HTML, to
contain this code. Browsers have it and *they* should adjust the meta tag
when they do a "Save as..."
Post by j***@att.net
If this were the case, we wouldn't be having this thread.
If servers would just shut up when they don't know (as required by the HTML
spec)....
--
François Yergeau


------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Peter Kirk
2003-09-29 15:35:38 UTC
Permalink
Post by Francois Yergeau
...
It takes large amounts of tricky code to reliably parse real-life HTML. It
is unreasonable to expect servers, which have no business parsing HTML, to
contain this code. ...
Agreed. But if they don't parse the HTML they don't know what the
content of the document is and so they have no business to mess around
with that content by re-encoding it.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/




------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Jill Ramonsky
2003-09-29 15:01:49 UTC
Permalink
I don't see anything wrong with the spec. So far as I can see it is
doing the right thing. Although the behaviour of the described server
could be better.

First point - if no information is present, assume "us-ascii". Sounds
/extremely sensible/ to me. ASCII is the intersection of Latin-1, UTF-8,
and various other commonly used encodings. Moreover, in order to even
/read/ the name of the encoding, the name of the encoding must have
itself been encoded in /something/. It makes sense to me to assume the
absolute minimum. If you want more than the minimum, declare your
encoding. This should not be a problem.

Second point - the "search order" - (1) server; (2) XML tag; (3) HTML
meta tag. This also makes sense to me. Yes, the document author should
know best, but it is the /_server_/, not the /_client_/, which should
take notice of the meta tag.

As far as the browser is concerned, meta tags in the document _/must
not/_ override the headers, as this could result in security holes
exploitable by attackers.

The issue is slightly more complicated. The browser /must/ believe the
HTTP headers. However, if the meta tags and HTTP headers are in conflict
then I believe _the server is at fault_, in not making the correct
declaration. In other words, if the document author says (in a meta tag)
"this is in UTF-8", then the server should (in my opinion) send the
document to the browser with an encoding type of UTF-8. In other words,
the server should (again, in my opinion), ensure that the HTTP header is
not in conflict with a meta tag, by changing the HTTP header to match
the meta tag. However, if a server does not do this, still, then the
browser must believe the HTTP header.

Jill
-----Original Message-----
Sent: Saturday, September 27, 2003 3:48 PM
Subject: Re: Fun with proof by analogy, was Re: Mojibake on
my Web pages
Post by j***@att.net
First, the browser checks the HTTP header, then the XML declaration
(which is not relevant to HTML), then the HTML meta tag.
Apparently, upon finding character set information, the operation
stops, so if information is present in the HTTP header, the meta
tag won't be consulted.
It's worse than that. If the HTTP header says "text/xml" or
"text/html",
and no charset information is provided, a fully conforming browser
MUST treat this as if the charset "us-ascii" is specified. That's
just insane, but such are the rules.
Only if there is no header, or if the header says "application/xml",
do we get to proceed to other sources of knowledge.
Post by j***@att.net
All of the data should be consulted and there should be some kind
of protocol in place to handle conflicting character set info.
It *is* in place and fully specified. It's just that most of us
don't care for the results, and most programs don't fully conform
for that reason.
--
Some people open all the Windows; John Cowan
by moving the Unix. http://www.reutershealth.com
--ad for Unix Book Units (U.K.) http://www.ccil.org/~cowan
(see http://cm.bell-labs.com/cm/cs/who/dmr/unix3image.gif)
Peter Kirk
2003-09-29 16:33:25 UTC
Permalink
Post by Jill Ramonsky
...
As far as the browser is concerned, meta tags in the document _/must
not/_ override the headers, as this could result in security holes
exploitable by attackers.
The issue is slightly more complicated. The browser /must/ believe the
HTTP headers. However, if the meta tags and HTTP headers are in
conflict then I believe _the server is at fault_, in not making the
correct declaration. In other words, if the document author says (in a
meta tag) "this is in UTF-8", then the server should (in my opinion)
send the document to the browser with an encoding type of UTF-8. In
other words, the server should (again, in my opinion), ensure that the
HTTP header is not in conflict with a meta tag, by changing the HTTP
header to match the meta tag. However, if a server does not do this,
still, then the browser must believe the HTTP header.
Jill
I know I don't understand all the issues here, but I think I spot one
flaw in the argument. This seems to imply that all security holes are
the work of the content providers and none related to the servers. In
other words, that all servers and their administrators are entirely
trustworthy. This is certainly not necessarily true. And if a content
provider can compromise security by confusing encodings, so can a server.

This could become a significant security hole when we get Unicode domain
names. A malicious server administrator could register the mojibake
equivalent of a legitimate security sensitive domain name and then
deliberately serve the mojibake version to users, etc etc.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/




------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
j***@spin.ie
2003-09-29 16:02:25 UTC
Permalink
Post by Peter Kirk
Agreed. But if they don't parse the HTML they don't know what the
content of the document is and so they have no business to mess around
with that content by re-encoding it.
There is no re-encoding! There just might be is all.

There might also be a lot of other things going on, and hence a lot of headers sent, and those sending the headers have a responsibility to ensure their accuracy and those receiving them have a responsibility to read and act upon them (though not necessarily with blind trust if it could raise security issues).

Anyway, browsers aren't required to examine <meta/> elements at all, though they may, it's only xml declarations that have a strong place in the list of sources of encoding information.






------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Francois Yergeau
2003-09-29 17:21:57 UTC
Permalink
Post by Jill Ramonsky
First point - if no information is present, assume "us-ascii".
Sounds extremely sensible to me.
Sounds very misguided to me.
Post by Jill Ramonsky
ASCII is the intersection of Latin-1, UTF-8, and various other
commonly used encodings.
How does that make it more likely that guessing ASCII would be correct?
Post by Jill Ramonsky
Moreover, in order to even read the name of the encoding, the
name of the encoding must have itself been encoded in something.
See Appendix F of the XML spec for how you can do much better than assuming
ASCII to read the encoding name.
Post by Jill Ramonsky
It makes sense to me to assume the absolute minimum. If you want
more than the minimum, declare your encoding. This should not be
a problem.
It makes much more sense to me to assume UTF-8, as XML does. If you want
*less* than that, declare your encoding. This is not a problem.
--
François


------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Rick McGowan
2003-09-29 17:39:45 UTC
Permalink
François --

You might be interested to know that all of your recent mail has the
following header attached to it! Sounds to me like your outgoing server is
tagging mail, and it's getting things wrong.

Rick
X-Spam-Report: This mail is probably spam. The original message has been
attached along with this report, so you can recognize or block similar
unwanted mail in future. See http://spamassassin.org/tag/ for more
details. Content preview: Jill Ramonsky wrote: > First point - if no
information is present, assume "us-ascii". > Sounds extremely sensible
to me. Sounds very misguided to me. > ASCII is the intersection of Latin-1,
UTF-8, and various other > commonly used encodings. [...] Content
analysis details: (-109.70 points, 5 required) EMAIL_ATTRIBUTION (-6.5
points) BODY: Contains what looks like an email attribution
QUOTED_EMAIL_TEXT (-3.2 points) BODY: Contains what looks like a quoted
email text USER_IN_WHITELIST (-100.0 points)From: address is in the
user's white-list
------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Jill Ramonsky
2003-09-30 10:44:50 UTC
Permalink
Good point. But there has to be an actual attacker here, as in, a hacker
engaged in a purposefully malevalent attempt to (say) run arbitrary code
on a victim's machine (the victim being an end-user, a web-page
viewer). To achieve this, the attacker must exploit "features" of the
victim's browser. Yes, I was assuming that the attacker was a document
author -- but if the attacker was a server (or at least, a server
administrator), then it's difficult to see what a document author can do
to guard against this. If the server is an attacker, they could of
course modify all documents served anyway, in any manner they chose. In
such a circumstance, document authors would be well advised to move
their documents to another server ... assuming they ever found out.

The attack is only theoretical, so far as I know, but basically it works
like this: the attacker places a link to (say)
"C:\WINNT\SYSTEM32\CMD.EXE (plus some nasty parameters)" in a hyperlink
and encourages you to click on it. If all is well, the browser should
forbid this. But if the string is written in encoding A, and the
browser parses it assuming it to be encoding B, it is possible that the
browser may not recognise the path as being absolute, and so may allow
it. Of course, you'd have to try /really hard/ to find encodings A and
B such that this becomes feasable, but you never know, it might be
doable. Plus, you'd have to find a user dumb enough to be running a
sufficiently old browser that it was still prone to this exploit. (I'm
pretty sure modern browsers will have closed that hole by now, but
again, you never know). But even a buggy and stupid browser will never
fall victim to this exploit if the browser is able to infer the correct
encoding for the document.

But look at it like this. Suppose a html document had a meta tag which
claimed: <META HTTP-EQUIV="Content-length" CONTENT=1>. In this
circumstance, which would you prefer to believe: The HTTP Content-length
header? Or the meta tag? (One can certainly imagine buffer-overrun
exploits if browsers were to make the wrong choice).

Of course, having said that, document authors /can/ affect HTTP headers
directly anyway. If the document were to be written in PHP instead of
HTML then a document author could generate any HTTP headers they wanted!
(I've actually done this to deliver documents in UTF-8 against the
server's default). All I can assume is maybe there's some sort of threat
model in place which assumes that anyone who can code in PHP can't
possibly be an attacker! If so, it's clearly nonsense.

I still maintain, though (in agreement with Jon) that a server should
obey the document author by taking notice of meta tags and transforming
them into HTTP tags. (At the very /least/, it should take the meta tag
as a hint, and use it as an HTTP tag if the hint turns out to be true).
To ignore them altogether is just dumb.

Jill

PS. I haven't mentioned Unicode domain names. That's a different kettle
of fish altogether. Maybe we could have another thread for that.
-----Original Message-----
Sent: Monday, September 29, 2003 5:33 PM
To: Jill Ramonsky
Subject: Re: Fun with proof by analogy, was Re: Mojibake on
my Web pages
I know I don't understand all the issues here, but I think I spot one
flaw in the argument. This seems to imply that all security holes are
the work of the content providers and none related to the servers. In
other words, that all servers and their administrators are entirely
trustworthy. This is certainly not necessarily true. And if a content
provider can compromise security by confusing encodings, so
can a server.
This could become a significant security hole when we get
Unicode domain
names. A malicious server administrator could register the mojibake
equivalent of a legitimate security sensitive domain name and then
deliberately serve the mojibake version to users, etc etc.
Peter Kirk
2003-09-30 12:45:30 UTC
Permalink
... Plus, you'd have to find a user dumb enough to be running a
sufficiently old browser that it was still prone to this exploit. (I'm
pretty sure modern browsers will have closed that hole by now, but
again, you never know). ...
But the whole motivation for the server changing the encoding is based
on the assumption that users are likely to be running sufficiently old
browsers that they don't recognise UTF-8 as an encoding.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/




------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Loading...