[OpenStack-Infra] afs02 r/o volume mirrors - ongoing incident

Ian Wienand iwienand at redhat.com
Thu May 24 07:40:58 UTC 2018


Hi,

We were notified of an issue around 22:45GMT with the volumes backing
the storage on afs02.dfw.o.o, which holds R/O mirrors for our AFS
volumes.

It seems that during this time there were a number of "vos release"s
in flight, or started, that ended up with volumes in a range of
unreliable states that made them un-releaseable (essentially halting
mirror updates).

Several of the volumes were recoverable with a manual "vos unlock" and
re-releasing the volume.  However, others were not.

To keep it short, fairly extensive debugging took place [2], but we
had corrupt volumes and deadlocked transactions between afs01 & afs02
with no reasonable solution.

In an effort to resolve this, the afs01 & 02 servers were restarted to
clear all old transactions, and for the affected mirrors I essentially
removed their read-only copies and re-added them with:

 k5start -t -f /etc/afsadmin.keytab service/afsadmin -- vos unlock $MIRROR
 k5start -t -f /etc/afsadmin.keytab service/afsadmin -- vos remove -server afs02.dfw.openstack.org -partition a -id $MIRROR.readonly
 k5start -t -f /etc/afsadmin.keytab service/afsadmin -- vos release -v $MIRROR
 k5start -t -f /etc/afsadmin.keytab service/afsadmin -- vos addsite -server afs02.dfw.openstack.org -partition a -id $MIRROR

The following volumes needed to be recovered

 mirror.fedora
 mirror.pypi
 mirror.ubuntu
 mirror.ubuntu-ports
 mirror.debian

(these are the largest repositories, and maybe it's no surprise that's
why they became corrupt?)

I have placed mirror-update.o.o in the emergency file, and commented
out all cron jobs on it.

Right now, I am running a script in a screen as the root user on
mirror-update.o.o to "vos release" these in sequence
(/root/release.sh).  Hopefully, this brings thing back into sync by
recreating the volumes.  If not, more debugging will be required :/

Please feel free to check in on this, otherwise I will update tomorrow
.au time

-i

[1] http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2018-05-23.log.html#t2018-05-23T22:43:46
[2] http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2018-05-24.log.html#t2018-05-24T04:01:21



More information about the OpenStack-Infra mailing list