[openstack-dev] UTF-8 required charset/encoding for openstack database?

Clint Byrum clint at fewbar.com
Wed Mar 19 08:20:45 UTC 2014


Excerpts from Doug Hellmann's message of 2014-03-18 15:08:36 -0700:
> On Mon, Mar 10, 2014 at 4:02 PM, Ben Nemec <openstack at nemebean.com> wrote:
> 
> > On 2014-03-10 12:24, Chris Friesen wrote:
> >
> >> Hi,
> >>
> >> I'm using havana and recent we ran into an issue with heat related to
> >> character sets.
> >>
> >> In heat/db/sqlalchemy/api.py in user_creds_get() we call
> >> _decrypt() on an encrypted password stored in the database and then
> >> try to convert the result to unicode.  Today we hit a case where this
> >> errored out with the following message:
> >>
> >> UnicodeDecodeError: 'utf8' codec can't decode byte 0xf2 in position 0:
> >> invalid continuation byte
> >>
> >> We're using postgres and currently all the databases are using
> >> SQL_ASCII as the charset.
> >>
> >> I see that in icehouse heat will complain if you're using mysql and
> >> not using UTF-8.  There doesn't seem to be any checks for other
> >> databases though.
> >>
> >> It looks like devstack creates most databases as UTF-8 but uses latin1
> >> for nova/nova_bm/nova_cell.  I assume this is because nova expects to
> >> migrate the db to UTF-8 later.  Given that those migrations specify a
> >> character set only for mysql, when using postgres should we explicitly
> >> default to UTF-8 for everything?
> >>
> >> Thanks,
> >> Chris
> >>
> >
> > We just had a discussion about this in #openstack-oslo too.  See the
> > discussion starting at 2014-03-10T16:32:26 http://eavesdrop.openstack.
> > org/irclogs/%23openstack-oslo/%23openstack-oslo.2014-03-10.log
> >
> > While it seems Heat does require utf8 (or at least matching character
> > sets) across all tables, I'm not sure the current solution is good.  It
> > seems like we may want a migration to help with this for anyone who might
> > already have mismatched tables.  There's a lot of overlap between that
> > discussion and how to handle Postgres with this, I think.
> >
> > I don't have a definite answer for any of this yet but I think it is
> > something we need to figure out, so hopefully we can get some input from
> > people who know more about the encoding requirements of the Heat and other
> > projects' databases.
> >
> > -Ben
> >
> >
> > _______________________________________________
> > OpenStack-dev mailing list
> > OpenStack-dev at lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> 
> Based on the discussion from the project meeting today [1], the Glance
> team is going to write a migration to fix the database as the other
> projects have (we have not seen issues with corrupted data, so we believe
> this to be safe). However, there is one snag. In a follow-up conversation
> with Ben in #openstack-oslo, he pointed out that no migrations will run
> until the encoding is correct, so we do need to make some changes to the db
> code in oslo.
> 


Hi! Thanks for considering the plight of the users that have high-byte
characters, but in reading the referenced IRC log, there was a lot of
"hoping for the best" in the outcome.

However, I think Glance in particular is likely to find the bugs in this
approach, as users are more apt to name images with descriptive words
than networks, servers, and volumes.

Anyway, if you do have latin1 tables that have utf-8 encoded data already
in them, you can't just alter table. Let me explain the scenario with
a simple copy/paste:

First, let's assume you've done nothing really and the server is just
set to latin1, but your client is utf-8:

mysql> insert into t1 values (2, '♬ ♭');
Query OK, 1 row affected, 1 warning (0.05 sec)

The warning there is that this is a latin1 table, and those are wide
chars, so they got stripped:

mysql> select * from t1;
+----+---------+
| id | data    |
+----+---------+
|  1 | no utf8 |
|  2 | ? ?     |
+----+---------+

Now you may also have a situation where your client is defaulting to
latin1:

mysql> insert into t1 values (3, '♬ ♭');
Query OK, 1 row affected (0.05 sec)

Note, zero warnings. What happened here?

mysql> select * from t1;
+----+---------+
| id | data    |
+----+---------+
|  1 | no utf8 |
|  2 | ? ?     |
|  3 | ♬ ♭ |
+----+---------+

OH EXCELLENT! My data looks right. Let's ignore that mysql thinks it is
6 chars, not 2, because we probably won't ever notice that.

Now I fix my clients and they start using utf-8:

mysql> set names 'utf8';
Query OK, 0 rows affected (0.00 sec)

mysql> select * from t1;
+----+-----------------+
| id | data            |
+----+-----------------+
|  1 | no utf8         |
|  2 | ? ?             |
|  3 | ♬ ♭ |
+----+-----------------+
3 rows in set (0.00 sec)

Doh, what are those trademarked a's?

This is often where app writers give up, switch back to latin1, and
think they're fine because at least data is coming out the way it went
in. But now you have utf-8 in a latin1 table. If you alter this:

mysql> alter table t1 convert to character set 'utf8';
Query OK, 3 rows affected (0.28 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> select * from t1;
+----+-----------------+
| id | data            |
+----+-----------------+
|  1 | no utf8         |
|  2 | ? ?             |
|  3 | ♬ ♭ |
+----+-----------------+
3 rows in set (0.01 sec)

UGH! Oh wait, I'm still latin1... switch client to utf8:

mysql> select * from t1;
+----+-----------------+
| id | data            |
+----+-----------------+
|  1 | no utf8         |
|  2 | ? ?             |
|  3 | ♬ ♭         |
+----+-----------------+
3 rows in set (0.00 sec)

DOH!

This is really frustrating. Also detecting that this situation is even
present is difficult because detecting UTF-8 can be problematic.

What you have to do is create a new table that is utf-8, and then have a
client with two connections:

1) latin1 to the server to select out records incorrectly encoded
2) utf-8 to the server to insert correctly encoded utf-8 into new table.

You can probably do this, but I suggest testing it with high-byte UTF-8
in the db before the migration runs to see if they're still displayed
correctly.



More information about the OpenStack-dev mailing list