(Encore du technique, merci pour votre patience).
(More technical stuff, thanks for your understanding).
Problem: you have a page that has accented characters, that is supposed to be opening in utf-8 charset encoding, but sometimes doesn't (maybe I'm the only person that ever had this problem..., in my case it was the HTML editor Xinha, opened in a popup).
How to at least warn your users that there is a problem? Can javascript detect the encoding of the page in which it is placed?
Well, no, and yes.
This is the solution I found, which is rather elegant (I feel):
if ('é'.length==2) alert('Houston, we have a problem');
How does it work?
I may be wrong, but as I understand it, 'normal' text takes up one byte per character. Unicode takes up two bytes per character. UTF-8 takes up one byte, except for 'problem' characters, for which it users two bytes.
So if a utf-8 encoded page is badly intepreted as being 'normal', an 'é' character (which takes up two bytes), will be interpreted as two one-byte characters.
Conversely, if the browser thinks the page is utf-8, it will interpret the 'é' combination as one character.
So if javascript tells you the length is two, you know that you're not in UTF-8.
(Last detail, and here I'm in unknown territory - contrary to the postscript in my previous post about utf-8, you shouldn't save web pages as UTF-8, but normal files, but with the charset meta tag set tu utf-8. I think I'm right about that, but no idea why...)
2 comments yet :
Ah, I know how you got into this problem.... :-)
astring.match(/[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2}/) is viable method to detect ISO coded UTF-8 string.
Post a Comment