[Openstack] [Nova] Grizzly -> Havana DB Sync failures...

Jonathan Proulx jon at jonproulx.com
Wed Jan 8 13:39:11 UTC 2014


Hi All,

Last week I tried to upgrade my production system and ran into
https://bugs.launchpad.net/nova/+bug/1245502 (after having run the
test upgrade in a clean grizzly which is insufficient).  The fix for
this was in head (now backported to stable/havana) and only involved
one file 185_rename_unique_constraints.
py which I thought I copied in, reverted the DB from a previous dump
and then hit the same error (I'm not 100% sure I did what I though
since I can't reproduce that failure in testing, but we'll get to that
later).

Eventually I gave up on the production upgrade, reverted everything to
pre upgrade state and moved back into my testing world, but using the
dump of my production DB as the base rather than a clean and empty
grizzly schema.

The production and test sytems are both Ubuntu 12.04 using cloud
archive packages and  community puppet modules for management.  The
production system was originally installed with essex and updated for
folsom and grizzly in turn.  Including the shadow tables the DB has
history for approx 500k instances.

I've run into a fair number of issues in testing, but I'm dubious
about my test environment sinc eth efirst failure in testing was in
v183 which was sooner than I saw in production so clearly that had
worked.  Also after kludging my way through that v185 did apply
properly (which may just be that I screwed up in my previous
attempts).  Most strangely though after hacking through as far as
v208, having attempted a fix for some breakage in v209 it started
failing way back in v187.  I'd blame my last kludge for screwing
something up,but it complains that table instance_groups exists where
my last hack was deleting some rows from instance_actions_events.

I'm stuck at this point since while instance_groups is empty I can't
drop it due to existing constraints.  But since the early testing
steps do not match my experience with the production attempt I fear I
may be chasing ghosts that may not even exist in production or worse
missing issues that do.

Here's a step by step of what I've attempted and brief results at each stage:

----------------------------------------------------------------------

Test upgrade

1) install Grizzly based controller node on OpenStack instance using
production puppet config modulo IP addrs & hostnames

2) reload production DBs into test system

3) fix enpoint URLs to point back to test rather than production

4) stop all nova services:
   for i in nova-api nova-cert nova-conductor nova-consoleauth \
   nova-novncproxy nova-scheduler nova-objectstore;do service $i \
   stop;done

5)  mysqldump --all-databases # or atleast the nova db

6) snapshot instance

7) run puppet test environment (changes cloudarchive source to havana,
   installs new packages and fixes configs).  Expect to fail as bug
   fix isn't packaged yet.  But expected to fail at v184 not 182!

-> fails ending at at v182

 2014-01-07 19:18:22.193 1463 TRACE nova.db.sqlalchemy.utils
OperationalError: (OperationalError) (1050, "Table
'shadow_security_group_default_rules' already exists") '\nCREATE TABLE
shadow_security_group_default_rules (\n\tcreated_at DATETIME,
\n\tupdated_at DATETI
ME, \n\tdeleted_at DATETIME, \n\tdeleted INTEGER(11), \n\tid
INTEGER(11) NOT NULL AUTO_INCREMENT, \n\tprotocol VARCHAR(5),
\n\tfrom_port INTEGER(11), \n\tto_port INTEGER(11), \n\tcidr
VARCHAR(43), \n\tPRIMARY KEY (id)\n)ENGINE=InnoDB\n\n' ()
2014-01-07 19:18:22.193 1463 TRACE nova.db.sqlalchemy.utils
Command failed, please check log for more info
2014-01-07 19:18:22.197 1463 CRITICAL nova [-] Shadow table with name
shadow_security_group_default_rules already exists.

 /usr/bin/nova-manage db version
 182

8) stop all nova-services again

9) grab latest 185_rename_unique_constraints.py from git

    git log 185_rename_unique_constraints.py |head -5
2014-01-07 14:45:59 jon pts/15
    commit c620cafb700ca195db0bd0ef9d62a0c9459bdc38
    Author: Joshua Hesketh <josh at nitrotech.org>
    Date:   Tue Oct 29 09:40:41 2013 +1100

            Fix migration 185 to work with old fkey names

10) reload nova database as dumped at step 5
    /usr/bin/nova-manage db version
    161


11) nova-manage db sync

    still fails in same way.

11.1) mysql -e 'drop table  shadow_security_group_default_rules;' nova
      don't care at all about the contents of this table so let's be
      brutal

11.2) try again:
      nova-manage db sync

      fails in new way (notably 185 succeeds)

      2014-01-07 20:05:29.157 8499 CRITICAL nova [-] (IntegrityError)
(1452, 'Cannot add or update a child row: a foreign key constraint
fails (`nova`.`block_device_mapping`, CONSTRAINT
`block_device_mapping_instance_uuid_fkey` FOREIGN KEY
(`instance_uuid`) REFERENCES `instances` (`uuid`))') 'INSERT INTO
block_device_mapping (instance_uuid, source_type, destination_type,
device_type, boot_index, image_id) VALUES (%s, %s, %s, %s, %s, %s)'
('0acda551-e1f8-4e29-a7b3-2c8fe9d2fb72', 'image', 'local', 'disk', -1,
'aee1d242-730f-431f-88c1-87630c0f07ba')
      root at test:~# nova-manage db version
      185

      sure enough there is no instance with uuid
      0acda551-e1f8-4e29-a7b3-2c8fe9d2fb72 but there was (it's now in
      shadow_instances) also the block_device_mapping this is trying
      to insert into is currently a shadow_block_device_mapping.


11.3) OK I don't really care about that table either, let's rever and
      drop it along with the shadow_security_group_default_rules:

      root at test:~# mysql nova < nova.sql
      root at test:~# mysql -e 'drop table
shadow_security_group_default_rules;drop table
shadow_block_device_mapping;' nova
      root at test:~# nova-manage db sync

11.4) that didn't work becaus eit needs the table let's try just
      clearing it then:

      root at test:~# mysql nova < nova.sql
      root at test:~# mysql -e 'drop table
shadow_security_group_default_rules;TRUNCATE TABLE
shadow_block_device_mapping ;' nova
      root at test-nimbus:~# nova-manage db sync

      Failure, but progress:

      Command failed, please check log for more info
      2014-01-07 21:41:05.407 28650 CRITICAL nova [-] (IntegrityError)
(1451, 'Cannot delete or update a parent row: a foreign key constraint
fails (`nova`.`instance_actions_events`, CONSTRAINT
`instance_actions_events_ibfk_1` FOREIGN KEY (`action_id`) REFERENCES
`instance_actions` (`id`))') 'DELETE FROM instance_actions WHERE
instance_actions.instance_uuid NOT IN (SELECT instances.uuid \nFROM
instances)' ()

      root at test:~# nova-manage db version
      208

11.5) rewind and delete all the instance_actions_events that reference
      the instance actions this wants to delete


      root at test:~# mysql nova < nova.sql
      root at test:~# mysql -e 'drop table
shadow_security_group_default_rules;TRUNCATE TABLE
shadow_block_device_mapping ;DELETE FROM instance_actions_events WHERE
action_id IN (SELECT id FROM instance_actions WHERE
instance_actions.instance_uuid NOT IN (SELECT instances.uuid FROM
instances));' nova
      root at test-nimbus:~# nova-manage db sync


        insanely this is now failing earlier:

        root at test-nimbus:~# nova-manage db sync
        Command failed, please check log for more info
        2014-01-07 22:09:00.229 1898 CRITICAL nova [-]
(OperationalError) (1050, "Table 'instance_groups' already exists")
'\nCREATE TABLE instance_groups (\n\tcreated_at DATETIME,
\n\tupdated_at DATETIME, \n\tdeleted_at DATETIME, \n\tdeleted INTEGER,
\n\tid INTEGER NOT NULL AUTO_INCREMENT, \n\tuser_id VARCHAR(255),
\n\tproject_id VARCHAR(255), \n\tuuid VARCHAR(36) NOT NULL, \n\tname
VARCHAR(255), \n\tPRIMARY KEY (id), \n\tCONSTRAINT
uniq_instance_groups0uuid0deleted UNIQUE (uuid,
deleted)\n)ENGINE=InnoDB CHARSET=utf8\n\n' ()

        root at test-nimbus:~# nova-manage db version
        186

Since this is all in test and virtualized I can try any weird thing
any one might suggest without repercussion, but I'm fairly out of
ideas on my own.  I'm particularly interested in seeing if anyone can
spot a flaw in the initial set up of the test environment that might
make it diverge from my production system in ways I haven't seen.

Thanks,
-Jon




More information about the Openstack mailing list