24 August 2006

Google maps UTF-8 problem

A while ago I came across a problem with the google geocoder apparently returning Latin1 encoded characters rather than UTF-8. I posted an enquiry to the Google Maps API group but didn't get any responses.

Now I've had time to look at this in more detail and found how to fix it. From my investigations I found that:


  1. wget, curl and requests made with Python urllib2 all returned responses encoded in Latin1. Requests made with Firefox returned responses encoded in UTF-8.

  2. Regardless of the actual encoding returned, the XML always stated encoding="UTF-8".

  3. The Content-Type header in the HTTP response correctly gave the returned encoding (either UTF-8 or ISO-8859-1).


So it looked like this had something to do with the headers sent in the HTTP request. I used curl to play around with these and see if I could get a UTF-8 response. The obvious ones (e.g. Accept-Charset: utf-8) didn't work. But what did work was changing the User-agent header. So, if you want to ensure you get a UTF-8 response, pretend to be Firefox:
curl -H'User-Agent: Mozilla/5.0' 'http://maps.google.com/maps/geo?key=&q=cologne&output=xml'

All this means that you can now search for cologne on worldinpictures.org and it will display Köln rather than K�ln.

9 comments:

maport said...

Comment from Yosh:

Hi!
I just stumbled about the same problem as you posted in the Google Groups.

I got a hint from another guy.
Try adding "&oe=utf-8" to the calling URL. And you will see that the Google Server responds to you in UTF-8 in the Response header instead of ISO-8859-1!!!

Regards

maport said...

Cool - I don't have to pretend to be Firefox anymore. Thought there must be a better way.

I'm sure the Google docs say everything should always be UTF8 anyway but that certainly works.

Thanks,

Mike.

maport said...

Comment from Daniel:

Was beating my head against the wall with this issue. I was trying to scrub output from translate.google.com. Setting Konqueror useragent would always get me non-unicode answers. drove me crazy.
Demn google for user-agent polling... And great thanks to you for this little post.

maport said...

Comment from mickey:

Thx a bunch 4 this forum and many thx 2 Yosh. I use that parameter "oe=UTF-8" in my HTTP-Request and my Java API doesnt have problems anymore by parsing the xml-output-stream. I didn't try with changing the user-agent though. I'll try later.

maport said...

Comment from Christophe:

Thank you Yosh, you just saved us some time & dirty hacks.
Cheers!

maport said...

Comment from Greg:

Many thanks as well! I spent 4 four futile hours, and then 4 wonderful minutes to read this site :)

Anonymous said...

Yes, this was the problem. I am new to this. Just a quick hint to others. $request_url = $base_url . "&oe=utf-8&q=" . urlencode($address);

Anonymous said...

Thank you so much for this hint

HOrst

Silver Surfer said...

finding this post&comments rescued me from going insane. spent 1 hour digging until i found this post. this should be stated in big red letters in den google documentation.