need help understanding diacritical encoding

Discussion:

Steve Pruitt

2003-09-24 18:37:20 UTC

I have a form that posts diacritical characters.Ffor example, when my browser has the encoding set to utf-8 and the form posts the character É
the post data has these two bytes C3 and 89, which when echoed back on a new page is displayed as Ã?. Can someone explain when the character is converted to two bytes how I get C3 and 89?

Thanks,

Steve Pruitt

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

t***@eatoni.com

2003-09-24 19:17:19 UTC

Permalink

Steve> I have a form that posts diacritical characters.Ffor example,
Steve> when my browser has the encoding set to utf-8 and the form
Steve> posts the character É the post data has these two bytes C3 and
Steve> 89, which when echoed back on a new page is displayed as Ã?.
Steve> Can someone explain when the character is converted to two
Steve> bytes how I get C3 and 89?

0xC3 0x89 is the two-byte UTF-8 encoding of your latin-1 É (0xC9)
character. The browser that's displaying these two bytes is
interpreting them as two (probably) Latin-1 characters: (Ã) 0xC3
followed by a control character (0x89) which it's just displaying as a ?

So the data posted by the form is correct (UTF-8) but the browser is
not interpreting that data as UTF-8, rather as some single byte
encoding (likely Latin-1).

Get it?

Regards,
Terry

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

t***@eatoni.com

2003-09-24 19:27:20 UTC

Permalink

Steve> Can someone explain when the character is converted to two
Steve> bytes how I get C3 and 89?

Just a little more to answer this. Your LATIN CAPITAL LETTER E WITH
ACUTE has a latin-1 (and various other 8-bit encodings) value of 0xC9.
This is represented in UTF-8 as two bytes:

110-00011 10-001001

I put a hyphen between the parts of the UTF-8 that have no payload and
the parts that do. The first UTF-8 byte says that the UTF-8 sequence
will have 2 bytes (110). The next 5 bits in that first byte are the
start of the payload (00011). The next byte is marked (10) as carrying
the rest of the payload (001001). Putting the two parts of the payload
together, we have 00011001001. Drop the leading zeroes and you have
11001001, the 8-bit value 0xC9. Hence 0xC3 0x89 is the UTF-8
representation of 0xC9.

Terry

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

Stephane Bortzmeyer

2003-09-25 08:26:33 UTC

Permalink

On Wed, Sep 24, 2003 at 02:37:20PM -0400,

I have a form that posts diacritical characters. For example, when
my browser has the encoding set to utf-8

^^^^^
OK

and the form posts the character É the post data has these two bytes
C3 and 89,

It seems reasonable, "0xC3 0x89" is UTF-8 for É.

which when echoed back on a new page is displayed as Ã?.

Your Web browser cannot properly display UTF-8 (it is probably
configured to display as Latin-1). The exact solution depends on it.

*or*

Your Web server sent back the reply as UTF-8 but tagged it as
Latin-1. Check the HTTP headers to be sure.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

j***@spin.ie

2003-09-25 09:54:42 UTC

Permalink

Post by Steve Pruitt
I have a form that posts diacritical characters.Ffor example, when my browser
has the encoding set to utf-8 and the form posts the character É
the post data has these two bytes C3 and 89, which when echoed back on a new
page is displayed as Ã?. Can someone explain when the character is converted
to two bytes how I get C3 and 89?

UTF-8 is explained in section 3.9 of the Unicode standard and elsewhere (RFC 2279 is a heavily-referenced document, note that its description includes the encoding of codepoints outside of the Unicode range).

É is U+00C9 and in binary that is:

0000000011001001

UTF-8 encoding results in different numbers of bytes depending on how many bits you have when you remove the leading zeros (8 bits in this case - resulting in two bytes).

It then puts those bits from the codepoint into bytes as so:

00000000 0xxxxxxx -> 0xxxxxxx
00000yyy yyxxxxxx -> 110yyyyy 10xxxxxx
zzzzyyyy yyxxxxxx -> 1110zzzz 10yyyyyy 10xxxxxx
000uuuuu zzzzyyyy yyxxxxxx -> 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx

In the case of U+00C9 the second of these is the shortest form possible, so it is used. The bits 00011 are placed in 110yyyyy to give you 11000011 (0xC3) and the bits 001001 are placed in 10xxxxx to give you 10001001 (0x89).

The problem is that this didn't happen when the bytes went back out again - rather the bytes where interpreted as being part of a string encoded in some other way (most likely ISO 8859-1, which certainly would produce Ã followed by a control character from those bytes). It may be that all you need to do is to correctly report the encoding, by sending a HTTP header of the mime-type and charset (some server-side APIs make this easy, e.g. in ASP you would use Response.Charset = "utf-8"). It may be that you need to do futher work (depending on just what it is you are doing with the form).

------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/