[Openstack] [Nova] Grizzly -> Havana DB Sync failures...

Jonathan Proulx jon at jonproulx.com
Wed Jan 8 15:41:10 UTC 2014


Found the flaw in my test environment:

used 'mysqldump nova' when I should have used 'mysqldump
--add-drop-database --database nova'

So while tables that existed at the time of the dump were getting
dropped and recreated, it was not dropping the whole database and thus
collecting new tables and constraints which was tangling things up.

Typical after a week of banging my head on the problem I figure it out
about an hour after posting.  Got a 'db sync' to succeed but still
included a bit of database hackery so going to back up again and see
what of that was really necessary and file bugs for any of it that
was.

-Jon

On Wed, Jan 8, 2014 at 8:39 AM, Jonathan Proulx <jon at jonproulx.com> wrote:
> Hi All,
>
> Last week I tried to upgrade my production system and ran into
> https://bugs.launchpad.net/nova/+bug/1245502 (after having run the
> test upgrade in a clean grizzly which is insufficient).  The fix for
> this was in head (now backported to stable/havana) and only involved
> one file 185_rename_unique_constraints.
> py which I thought I copied in, reverted the DB from a previous dump
> and then hit the same error (I'm not 100% sure I did what I though
> since I can't reproduce that failure in testing, but we'll get to that
> later).
>
> Eventually I gave up on the production upgrade, reverted everything to
> pre upgrade state and moved back into my testing world, but using the
> dump of my production DB as the base rather than a clean and empty
> grizzly schema.
>
> The production and test sytems are both Ubuntu 12.04 using cloud
> archive packages and  community puppet modules for management.  The
> production system was originally installed with essex and updated for
> folsom and grizzly in turn.  Including the shadow tables the DB has
> history for approx 500k instances.
>
> I've run into a fair number of issues in testing, but I'm dubious
> about my test environment sinc eth efirst failure in testing was in
> v183 which was sooner than I saw in production so clearly that had
> worked.  Also after kludging my way through that v185 did apply
> properly (which may just be that I screwed up in my previous
> attempts).  Most strangely though after hacking through as far as
> v208, having attempted a fix for some breakage in v209 it started
> failing way back in v187.  I'd blame my last kludge for screwing
> something up,but it complains that table instance_groups exists where
> my last hack was deleting some rows from instance_actions_events.
>
> I'm stuck at this point since while instance_groups is empty I can't
> drop it due to existing constraints.  But since the early testing
> steps do not match my experience with the production attempt I fear I
> may be chasing ghosts that may not even exist in production or worse
> missing issues that do.
>
> Here's a step by step of what I've attempted and brief results at each stage:
>
> ----------------------------------------------------------------------
>
> Test upgrade
>
> 1) install Grizzly based controller node on OpenStack instance using
> production puppet config modulo IP addrs & hostnames
>
> 2) reload production DBs into test system
>
> 3) fix enpoint URLs to point back to test rather than production
>
> 4) stop all nova services:
>    for i in nova-api nova-cert nova-conductor nova-consoleauth \
>    nova-novncproxy nova-scheduler nova-objectstore;do service $i \
>    stop;done
>
> 5)  mysqldump --all-databases # or atleast the nova db
>
> 6) snapshot instance
>
> 7) run puppet test environment (changes cloudarchive source to havana,
>    installs new packages and fixes configs).  Expect to fail as bug
>    fix isn't packaged yet.  But expected to fail at v184 not 182!
>
> -> fails ending at at v182
>
>  2014-01-07 19:18:22.193 1463 TRACE nova.db.sqlalchemy.utils
> OperationalError: (OperationalError) (1050, "Table
> 'shadow_security_group_default_rules' already exists") '\nCREATE TABLE
> shadow_security_group_default_rules (\n\tcreated_at DATETIME,
> \n\tupdated_at DATETI
> ME, \n\tdeleted_at DATETIME, \n\tdeleted INTEGER(11), \n\tid
> INTEGER(11) NOT NULL AUTO_INCREMENT, \n\tprotocol VARCHAR(5),
> \n\tfrom_port INTEGER(11), \n\tto_port INTEGER(11), \n\tcidr
> VARCHAR(43), \n\tPRIMARY KEY (id)\n)ENGINE=InnoDB\n\n' ()
> 2014-01-07 19:18:22.193 1463 TRACE nova.db.sqlalchemy.utils
> Command failed, please check log for more info
> 2014-01-07 19:18:22.197 1463 CRITICAL nova [-] Shadow table with name
> shadow_security_group_default_rules already exists.
>
>  /usr/bin/nova-manage db version
>  182
>
> 8) stop all nova-services again
>
> 9) grab latest 185_rename_unique_constraints.py from git
>
>     git log 185_rename_unique_constraints.py |head -5
> 2014-01-07 14:45:59 jon pts/15
>     commit c620cafb700ca195db0bd0ef9d62a0c9459bdc38
>     Author: Joshua Hesketh <josh at nitrotech.org>
>     Date:   Tue Oct 29 09:40:41 2013 +1100
>
>             Fix migration 185 to work with old fkey names
>
> 10) reload nova database as dumped at step 5
>     /usr/bin/nova-manage db version
>     161
>
>
> 11) nova-manage db sync
>
>     still fails in same way.
>
> 11.1) mysql -e 'drop table  shadow_security_group_default_rules;' nova
>       don't care at all about the contents of this table so let's be
>       brutal
>
> 11.2) try again:
>       nova-manage db sync
>
>       fails in new way (notably 185 succeeds)
>
>       2014-01-07 20:05:29.157 8499 CRITICAL nova [-] (IntegrityError)
> (1452, 'Cannot add or update a child row: a foreign key constraint
> fails (`nova`.`block_device_mapping`, CONSTRAINT
> `block_device_mapping_instance_uuid_fkey` FOREIGN KEY
> (`instance_uuid`) REFERENCES `instances` (`uuid`))') 'INSERT INTO
> block_device_mapping (instance_uuid, source_type, destination_type,
> device_type, boot_index, image_id) VALUES (%s, %s, %s, %s, %s, %s)'
> ('0acda551-e1f8-4e29-a7b3-2c8fe9d2fb72', 'image', 'local', 'disk', -1,
> 'aee1d242-730f-431f-88c1-87630c0f07ba')
>       root at test:~# nova-manage db version
>       185
>
>       sure enough there is no instance with uuid
>       0acda551-e1f8-4e29-a7b3-2c8fe9d2fb72 but there was (it's now in
>       shadow_instances) also the block_device_mapping this is trying
>       to insert into is currently a shadow_block_device_mapping.
>
>
> 11.3) OK I don't really care about that table either, let's rever and
>       drop it along with the shadow_security_group_default_rules:
>
>       root at test:~# mysql nova < nova.sql
>       root at test:~# mysql -e 'drop table
> shadow_security_group_default_rules;drop table
> shadow_block_device_mapping;' nova
>       root at test:~# nova-manage db sync
>
> 11.4) that didn't work becaus eit needs the table let's try just
>       clearing it then:
>
>       root at test:~# mysql nova < nova.sql
>       root at test:~# mysql -e 'drop table
> shadow_security_group_default_rules;TRUNCATE TABLE
> shadow_block_device_mapping ;' nova
>       root at test-nimbus:~# nova-manage db sync
>
>       Failure, but progress:
>
>       Command failed, please check log for more info
>       2014-01-07 21:41:05.407 28650 CRITICAL nova [-] (IntegrityError)
> (1451, 'Cannot delete or update a parent row: a foreign key constraint
> fails (`nova`.`instance_actions_events`, CONSTRAINT
> `instance_actions_events_ibfk_1` FOREIGN KEY (`action_id`) REFERENCES
> `instance_actions` (`id`))') 'DELETE FROM instance_actions WHERE
> instance_actions.instance_uuid NOT IN (SELECT instances.uuid \nFROM
> instances)' ()
>
>       root at test:~# nova-manage db version
>       208
>
> 11.5) rewind and delete all the instance_actions_events that reference
>       the instance actions this wants to delete
>
>
>       root at test:~# mysql nova < nova.sql
>       root at test:~# mysql -e 'drop table
> shadow_security_group_default_rules;TRUNCATE TABLE
> shadow_block_device_mapping ;DELETE FROM instance_actions_events WHERE
> action_id IN (SELECT id FROM instance_actions WHERE
> instance_actions.instance_uuid NOT IN (SELECT instances.uuid FROM
> instances));' nova
>       root at test-nimbus:~# nova-manage db sync
>
>
>         insanely this is now failing earlier:
>
>         root at test-nimbus:~# nova-manage db sync
>         Command failed, please check log for more info
>         2014-01-07 22:09:00.229 1898 CRITICAL nova [-]
> (OperationalError) (1050, "Table 'instance_groups' already exists")
> '\nCREATE TABLE instance_groups (\n\tcreated_at DATETIME,
> \n\tupdated_at DATETIME, \n\tdeleted_at DATETIME, \n\tdeleted INTEGER,
> \n\tid INTEGER NOT NULL AUTO_INCREMENT, \n\tuser_id VARCHAR(255),
> \n\tproject_id VARCHAR(255), \n\tuuid VARCHAR(36) NOT NULL, \n\tname
> VARCHAR(255), \n\tPRIMARY KEY (id), \n\tCONSTRAINT
> uniq_instance_groups0uuid0deleted UNIQUE (uuid,
> deleted)\n)ENGINE=InnoDB CHARSET=utf8\n\n' ()
>
>         root at test-nimbus:~# nova-manage db version
>         186
>
> Since this is all in test and virtualized I can try any weird thing
> any one might suggest without repercussion, but I'm fairly out of
> ideas on my own.  I'm particularly interested in seeing if anyone can
> spot a flaw in the initial set up of the test environment that might
> make it diverge from my production system in ways I haven't seen.
>
> Thanks,
> -Jon




More information about the Openstack mailing list