[openstack-dev] Sprint at Pycon: Port OpenStack to Python 3
Victor Stinner
victor.stinner at enovance.com
Tue Apr 1 16:43:00 UTC 2014
Hi,
Le mardi 1 avril 2014, 10:48:21 John Dennis a écrit :
> Oh, almost forgot. One of the significant issues in Py3 string handling
> occurs when dealing with the underlying OS, specifically Posix, the
> interaction with Posix "objects" such as pathnames, hostnames,
> environment values, etc. Virtually any place where in C you would pass a
> pointer to char in the Posix API where the intention is you're passing a
> character string. Unfortunately Posix does not enforce the concept of a
> character or a character string, the pointer to char ends up being a
> pointer to octets (e.g. binary data) which means you can end up with
> strings that can't be encoded.
I know well these things because I worked directly in Python to have the best
Unicode support for filenames on UNIX and Windows. The summary is "Python 3
just works with Unicode filenames". You don't have to do anything.
For example, you don't have to worry of the encoding *of the filename* for such
code :
for filename in os.listdir("conf/*.conf"):
with open(filename, "rb", encoding="utf-8") as fp: ...
os.listdir() and open() use the same encoding: the filesystem encoding
(sys.getfilesystemencoding()) with "surrogateescape" error handler.
If you prefer bytes filenames, there are also supported on UNIX, but deprecated
on Windows.
> Py3 has attempted to deal with this by introducing something called
> "surrogate escapes" which attempts to preserve non-encodable binary data
> in what is supposed to be a character string so as not to corrupt data
> as it transitions between Py3 and a host OS.
>
> OpenStack deals a lot with Posix API's, thus this is another area where
> we need to be careful and have clear guidelines. We're going to have to
> deal with the whole problem of encoding/decoding in the presence of
> surrogates.
My policy is to always store filenames as Unicode. It's easy to follow this
policy since Python 3 returns Unicode filenames (ex: os.listdir("directory")).
You should not have to worry of invalid filenames / surrogate characters since
this case should be very rare.
---
Surrogates are only used if a filename contains a non-ASCII character and the
locale encoding is unable to decode it. This case should be very rare.
Usually, file content can contain non-ASCII data, it's more rare for file names.
I mean for a Linux server running OpenStack, but it's common for user
documents on Windows for example.
You only have to worry of surrogates if you want to display a filename or write
filenames in a file (ex: configuration file). Another major change in Python 3 is
that the UTF-8 encoder became strict: surrogate characters cannot be encoded
anymore (except if you use the "surrogatepass" error handler). So if your
locale encoding is UTF-8, print(filename) or open("test", "w").write(filename)
will fail if the filename contains a surrogate character.
If you want to display filenames containing surrogate characters, use
repr(filename) or an error handler: "replace", "backslashreplace" or
"surrogateescape", depending on your use case.
If you want to write filenames containing surrogate characters into a file, you
might use the "surrogateescape " error handler, but it's probably a bad idea
because we may get error when *reading* again this file. It's maybe better to
raise an error if a filename is invalid (contains surrogate characters).
If you system uses the ASCII locale encoding, another option is also to fix
your setup configuration to use a locale using the UTF-8 encoding.
Victor
More information about the OpenStack-dev
mailing list