[openstack-dev] Sprint at Pycon: Port OpenStack to Python 3

Victor Stinner victor.stinner at enovance.com
Tue Apr 1 16:43:00 UTC 2014


Hi,

Le mardi 1 avril 2014, 10:48:21 John Dennis a écrit :
> Oh, almost forgot. One of the significant issues in Py3 string handling
> occurs when dealing with the underlying OS, specifically Posix, the
> interaction with Posix "objects" such as pathnames, hostnames,
> environment values, etc. Virtually any place where in C you would pass a
> pointer to char in the Posix API where the intention is you're passing a
> character string. Unfortunately Posix does not enforce the concept of a
> character or a character string, the pointer to char ends up being a
> pointer to octets (e.g. binary data) which means you can end up with
> strings that can't be encoded.

I know well these things because I worked directly in Python to have the best 
Unicode support for filenames on UNIX and Windows. The summary is "Python 3 
just works with Unicode filenames". You don't have to do anything.

For example, you don't have to worry of the encoding *of the filename* for such 
code :

    for filename in os.listdir("conf/*.conf"):
        with open(filename, "rb", encoding="utf-8") as fp: ...

os.listdir() and open() use the same encoding: the filesystem encoding 
(sys.getfilesystemencoding()) with "surrogateescape" error handler.

If you prefer bytes filenames, there are also supported on UNIX, but deprecated 
on Windows.

> Py3 has attempted to deal with this by introducing something called
> "surrogate escapes" which attempts to preserve non-encodable binary data
> in what is supposed to be a character string so as not to corrupt data
> as it transitions between Py3 and a host OS.
> 
> OpenStack deals a lot with Posix API's, thus this is another area where
> we need to be careful and have clear guidelines. We're going to have to
> deal with the whole problem of encoding/decoding in the presence of
> surrogates.

My policy is to always store filenames as Unicode. It's easy to follow this 
policy since Python 3 returns Unicode filenames (ex: os.listdir("directory")).

You should not have to worry of invalid filenames / surrogate characters since 
this case should be very rare.

---

Surrogates are only used if a filename contains a non-ASCII character and the 
locale encoding is unable to decode it. This case should be very rare. 
Usually, file content can contain non-ASCII data, it's more rare for file names. 
I mean for a Linux server running OpenStack, but it's common for user 
documents on Windows for example.

You only have to worry of surrogates if you want to display a filename or write 
filenames in a file (ex: configuration file). Another major change in Python 3 is 
that the UTF-8 encoder became strict: surrogate characters cannot be encoded 
anymore (except if you use the "surrogatepass" error handler). So if your 
locale encoding is UTF-8, print(filename) or open("test", "w").write(filename)  
will fail if the filename contains a surrogate character.

If you want to display filenames containing surrogate characters, use 
repr(filename) or an error handler: "replace", "backslashreplace" or 
"surrogateescape", depending on your use case.

If you want to write filenames containing surrogate characters into a file, you 
might use the "surrogateescape " error handler, but it's probably a bad idea 
because we may get error when *reading* again this file. It's maybe better to 
raise an error if a filename is invalid (contains surrogate characters).

If you system uses the ASCII locale encoding, another option is also to fix 
your setup configuration to use a locale using the UTF-8 encoding.

Victor



More information about the OpenStack-dev mailing list