I've been using the Python ElementTree library for parsing web service responses for my worldinpictures.org site and generally found it reliable and easy to use.
Character encoding issues have caused me a number of problems recently and I've come across another one with ElementTree:
>>> from elementtree import ElementTree as ET
>>> ET.XML('<?xml version="1.0" encoding="utf-8" ?><title>Good morning Mazatl\xc3\xa1n!</title>').text
u'Good morning Mazatl\xe1n!'
>>> ET.XML('<?xml version="1.0" encoding="utf-8" ?><title>Good morning Mazatln!</title>').text
'Good morning Mazatln!'
It seems that if the element contains any non-ASCII characters then the result will be a unicode string otherwise it will be a plain string.
It would be preferable to have a consistent return type (e.g. always unicode or always in the input encoding).
So, in my case, I pass the result through unicode() to ensure I always get a unicode result.
(There's an issue here with the unicode function and its reliance on the default encoding but that belongs in another post...)
0 comments:
Post a Comment