[openstack-dev] [oslo] i18n Message improvements

Doug Hellmann doug.hellmann at dreamhost.com
Fri Oct 18 22:41:09 UTC 2013


On Fri, Oct 18, 2013 at 2:21 PM, John Dennis <jdennis at redhat.com> wrote:

> On 10/18/2013 12:57 PM, Doug Hellmann wrote:
> >
> >
> >
> > On Thu, Oct 17, 2013 at 2:24 PM, John Dennis <jdennis at redhat.com
> > <mailto:jdennis at redhat.com>> wrote:
> >
> >     On 10/17/2013 12:22 PM,  Luis A. Garcia wrote:
> >     > On 10/16/2013 1:11 PM, Doug Hellmann wrote:
> >     >>
> >     >> [snip]
> >     >> Option 3 is closer to the new plan for Icehouse, which is to have
> _()
> >     >> return a Message, allow Message to work in a few contexts like a
> >     string
> >     >> (so that, for example, log calls and exceptions can be left
> >     alone, even
> >     >> if they use % to combine a translated string with arguments), but
> >     then
> >     >> have the logging and API code explicitly handle the translation of
> >     >> Message instances so we can always pass unicode objects outside of
> >     >> OpenStack code (to logging or to web frameworks). Since the
> >     logging code
> >     >> is part of Oslo and the API code can be, this seemed to provide
> >     >> isolation while removing most of the magic.
> >     >>
> >     >
> >     > I think this is exactly what we have right now inherited form
> Havana.
> >     > The _() returns a Message that is then translated on-demand by the
> API
> >     > or in a special Translation log handler.
> >     >
> >     > We just did not make Message look and feel enough like a str() and
> >     some
> >     > outside components (jsonifier in Glance and log Formatter all
> >     over) did
> >     > not know how to handle non text types correctly when non-ascii
> >     > characters were present.
> >     >
> >     > I think extending from unicode and removing all the
> implementations in
> >     > place such that the unicode implementation kick in for all magic
> >     methods
> >     > will solve the problems we saw at the end of Havana.
> >
> >     I'm relatively new to OpenStack so I can't comment on prior OpenStack
> >     implementations but I'm a long standing veteran of Python i18n
> issues.
> >
> >     What you're describing sounds a lot like problems that result from
> the
> >     fact Python's default encoding is ASCII as opposed to the more
> sensible
> >     UTF-8. I have a long write up on this issue from a few years ago but
> >     I'll cut to the chase. Python will attempt to automatically encode
> >     Unicode objects into ASCII during output which will fail if there are
> >     non-ASCII code points in the Unicode. Python does this is in two
> >     distinct contexts depending on whether destination of the output is a
> >     file or terminal. If it's a terminal it attempts to use the encoding
> >     associated with the TTY. Hence you can different results if you
> output
> >     to a TTY or a file handle.
> >
> >
> > That was related to the problem we had with logging and Message
> instances.
> >
> >
> >
> >     The simple solution to many of the encoding exceptions that Python
> will
> >     throw is to override the default encoding and change it to UTF-8. But
> >     the default encoding is locked by site.py due to internal Python
> string
> >     optimizations which cache the default encoded version of the string
> so
> >     the encoding happens only once. Changing the default encoding would
> >     invalidate cached strings and there is no mechanism to deal with
> that,
> >     that's why the default encoding is locked. But you can change the
> >     default encoding using this trick if you do early enough during the
> >     module loading process:
> >
> >
> > I don't think we want to have force the encoding at startup. Setting the
> > locale properly through the environment and then using unicode objects
> > also solves the issue without any startup timing issues, and allows
> > deployers to choose the encoding for output.
>
>
> Setting the locale only solves some of the problems, the locale is only
> respected some of the time. The discrepancies and inconsistencies in how
> Unicode conversion occurs in Python2 is maddening and one of the worst
> aspects of Python2, it was never carefully thought out, Unicode in
> Python2 is basically a bolted on hack that only works if every piece of
> code plays by the exact same rules which of course they don't and never
> will. I can almost guarantee unless you attack this problem at the core
> you'll continue to get bitten. Either code is encoding aware and
> explicitly forces a codec (presumably utf-8) or the code is encoding
> naive and allows the default encoding to be applied, except when the
> locale is respected which overrides the default encoding for the naive
> case.
>

The vast majority of our code should not care at all about encodings or
locales. If we're encoding and decoding strings all over the place, we're
doing it wrong. That's why I wanted Message.__str__() to raise an exception
-- to help us find the places where we are treating something that should
be a unicode string like it is a byte string.


>
> When Python3 was being worked on one of the major objectives was to
> clean up the horrible state of strings and unicode in Python2. Python3
> to the best of my knowledge has gotten it right. What's the default
> encoding in Python3? UTF-8, Can you change the default encoding in
> Python3? No. It's hardwired to UTF-8 period. You can override the
> encoding at obvious points (e.g. when opening IO streams) or allow
> things like TextIOWrapper to default to what
> locale.getpreferredencoding() returns, but the main point is it's
> consistently applied, it's not the haphazard mess in Python2 where
> you're never quite sure how a Unicode string is going to be encoded (in
> part because it depends on the destination of the IO).
>
> Given UTF-8 is Python3's default, that UTF-8 is the default in virtually
> every network protocol and that UTF-8 is the default in virtually every
> Linux library making UTF-8 be default in Python2 applications makes
> sense to me. So many problems in Python2 will go away if the default
> encoding is UTF-8 but I realize this is not an opinion shared by
> everyone. [1]
>
> For those who say forcing the default encoding to be UTF-8 early in the
> module load sounds like a terrible hack I would have to agree 100%. But
> things aren't always pretty due to unfortunate history that can't be
> undone, the best you can do is adapt to something sensible given the
> constraints.
>

We had one place where the default encoding came into play, because we were
passing a Message instance to logging and it was being implicitly converted
to a byte string. Adding one line to convert the Message to a unicode
object fixed that. I don't understand why we need to force the default
encoding.


>
> [1] Many of the objections centered around making the UTF-8 be the
> default for the system supplied Python because every piece of Python
> code every executed on the platform might be subject to some unexpected
> behavior if the default changed. But we're not in that situation, we're
> running a constrained set of code, we're not trying to support every
> possible piece of Python code written, rather we need to ensure the code
> that executes in OpenStack behaves as we expect to and that expectation
> is the the encoding is UTF-8.
>
> >
> >
> >
> >     import sys
> >     reload(sys)
> >     sys.setdefaultencoding('utf-8')
> >
> >     The reason this works is because site.py deletes the
> setdefaultencoding
> >     from the sys module, but after reloading sys it's available again.
> One
> >     can also use a tiny CPython module to set the default encoding
> without
> >     having to use the sys reload trick. The following illustrates the
> reload
> >     trick:
> >
> >     $ python
> >     Python 2.7.3 (default, Aug  9 2012, 17:23:57)
> >     [GCC 4.7.1 20120720 (Red Hat 4.7.1-5)] on linux2
> >     Type "help", "copyright", "credits" or "license" for more
> information.
> >     >>> import sys
> >     >>> sys.getdefaultencoding()
> >     'ascii'
> >     >>> sys.setdefaultencoding('utf-8')
> >     Traceback (most recent call last):
> >       File "<stdin>", line 1, in <module>
> >     AttributeError: 'module' object has no attribute 'setdefaultencoding'
> >     >>> reload(sys)
> >     <module 'sys' (built-in)>
> >     >>> sys.setdefaultencoding('utf-8')
> >     >>> sys.getdefaultencoding()
> >     'utf-8'
> >
> >
> >     Not fully undersanding the role of Python's default encoding and how
> >     it's application differs between terminal and non-terminal output can
> >     cause a lot of confusion and misunderstanding which can sometimes
> lead
> >     to false conclusions as to what is going wrong.
> >
> >     If I get a chance I'll try to publicly post my write-up on Python
> i18n
> >     issues.
> >
> >
> >
> >     --
> >     John
> >
> >     _______________________________________________
> >     OpenStack-dev mailing list
> >     OpenStack-dev at lists.openstack.org
> >     <mailto:OpenStack-dev at lists.openstack.org>
> >     http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> >
> >
> >
> > _______________________________________________
> > OpenStack-dev mailing list
> > OpenStack-dev at lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
>
>
> --
> John
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20131018/6a302f84/attachment.html>


More information about the OpenStack-dev mailing list