Discussion:
a little more help understanding diacritical encoding
Steve Pruitt
2003-09-25 16:02:47 UTC
Permalink
Thanks for the excellent responses. I now understand how C3 and 89 are derived. I tried getting everything set the way I intrepreted what the list responses said to do. The scenario is:
I have a page with some diacritical characters displayed and a input text box and a submit button. I copy and past one of the displayed characters into the input box and then submit. What is submitted gets echoed back. The pages use style sheets so I cut and pasted the relevant tags, etc.

I thought I found the problem. My response had a character encoding of null. I read null defaults to 8859-1 which seemed consistent with my echoed page. So, I explicitly set the response character encoding to UTF-8 via the setContentType method.

I used a TCP tunneler to see what my request and responses look like. My browser is set to utf-8 also.
From the tunneler my request had the following posted data: v904=%C3%89 this is correct according to how the utf encoding algo was explained.
The http response had the following:

Content-Type: text/html; charset=UTF-8 this is correct.

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8"> is a child in the <head> tag

<span class="text29">&#201; &#234; &#235; &#237; &#238; &#239; &#240; &#241; &#243; &#244; &#245; &#246;</span> these are the listed characters on the previous page I cut and past from they are listed on this page just for reference - (#201 = C9) is É.

<span class="text17">Accented Characters from&nbsp;&nbsp;previous form:&nbsp;&nbsp;&#195;&#137; </span>
this is echoed back. #195 = C3 and #137 = 89. These, of course, are displayed as Ã?.

I checked the browser to be sure and its encoding is still set to utf-8 and it is. This is everything I know to check. What am I missing?


------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
j***@spin.ie
2003-09-25 16:12:19 UTC
Permalink
This is likely an issue with whatever you are using to read and echo back the characters. If you just push the exact same bytes back then you will be okay, but anything more clever gives you an opportunity to go wrong - especially if you are using an API that thinks it knows better than you do.

What are you using to write and run this code?






------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Paul Deuter
2003-09-25 20:31:20 UTC
Permalink
It would appear that your server side is in Java.
There is a well known issue in older versions of the
Java servlet spec that cause the request class to
assume that %HH encoded octets are 8859-1 octets.
It seems that this is your problem.
The workaround is to get the parameters from the
request object and turn them back into bytes and
then re-interpret them as UTF-8 (because that is
what they are).

The code to do that looks like this:

String strFoo = new String(request.getParameter("whatever").getBytes(8859_1), "UTF-8");

-Paul



-----Original Message-----
From: Steve Pruitt [mailto:***@exstream.com]
Sent: Thursday, September 25, 2003 9:03 AM
To: ***@unicode.org
Subject: a little more help understanding diacritical encoding


Thanks for the excellent responses. I now understand how C3 and 89 are derived. I tried getting everything set the way I intrepreted what the list responses said to do. The scenario is:
I have a page with some diacritical characters displayed and a input text box and a submit button. I copy and past one of the displayed characters into the input box and then submit. What is submitted gets echoed back. The pages use style sheets so I cut and pasted the relevant tags, etc.

I thought I found the problem. My response had a character encoding of null. I read null defaults to 8859-1 which seemed consistent with my echoed page. So, I explicitly set the response character encoding to UTF-8 via the setContentType method.

I used a TCP tunneler to see what my request and responses look like. My browser is set to utf-8 also.
From the tunneler my request had the following posted data: v904=%C3%89 this is correct according to how the utf encoding algo was explained.
The http response had the following:

Content-Type: text/html; charset=UTF-8 this is correct.

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8"> is a child in the <head> tag

<span class="text29">&#201; &#234; &#235; &#237; &#238; &#239; &#240; &#241; &#243; &#244; &#245; &#246;</span> these are the listed characters on the previous page I cut and past from they are listed on this page just for reference - (#201 = C9) is É.

<span class="text17">Accented Characters from&nbsp;&nbsp;previous form:&nbsp;&nbsp;&#195;&#137; </span>
this is echoed back. #195 = C3 and #137 = 89. These, of course, are displayed as Ã?.

I checked the browser to be sure and its encoding is still set to utf-8 and it is. This is everything I know to check. What am I missing?



------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Loading...