[OpenStack-Infra] pypi volume downtime
Ian Wienand
iwienand at redhat.com
Mon Dec 5 04:30:36 UTC 2016
Hi,
Today I was alerted to jobs failing on IRC, further investigation
showed the pypi volume did not seem to be responding on the mirror
servers.
---
ianw at mirror:/afs/openstack.org/mirror$ ls pypi
ls: cannot access pypi: Connection timed out
---
The bandersnatch logs suggested the vos release was not working, and a
manual attempt confirmed this
---
root at mirror-update:~# k5start -t -f /etc/afsadmin.keytab service/afsadmin -- vos release -v mirror.pypi
Kerberos initialization for service/afsadmin at OPENSTACK.ORG
mirror.pypi
RWrite: 536870931 ROnly: 536870932 RClone: 536870932
number of sites -> 3
server afs01.dfw.openstack.org partition /vicepa RW Site -- New release
server afs01.dfw.openstack.org partition /vicepa RO Site -- New release
server afs02.dfw.openstack.org partition /vicepa RO Site -- Old release
Failed to start transaction on RW clone 536870932
Volume not attached, does not exist, or not on line
Error in vos release command.
Volume not attached, does not exist, or not on line
---
I figured afs01 must be having issues. The problem seems to have
appeared at this point (note the .old logs, because I restarted
things, which seems to be the point it rotates):
--- FileLog.old ---
Sun Dec 4 23:36:06 2016 Volume 536870932 offline: not in service
Sun Dec 4 23:41:03 2016 fssync: breaking all call backs for volume 536870932
Sun Dec 4 23:46:05 2016 fssync: breaking all call backs for volume 536870932
Sun Dec 4 23:46:05 2016 VRequestSalvage: volume 536870932 online salvaged too many times; forced offline.
This then made the volume server unhappy:
--- VolserLog.old ---
Sun Dec 4 23:45:58 2016 1 Volser: Clone: Recloning volume 536870931 to volume 536870932
Sun Dec 4 23:46:11 2016 SYNC_ask: negative response on circuit 'FSSYNC'
Sun Dec 4 23:46:11 2016 FSYNC_askfs: FSSYNC request denied for reason=0
Sun Dec 4 23:46:11 2016 VAttachVolume: attach of volume 536870932 apparently denied by file server
Sun Dec 4 23:46:11 2016 attach2: forcing vol 536870932 to error state (state 0 flags 0x0 ec 103)
As for the root cause, I don't see anything else particularly
insightful in the logs. The salvage server logs, implicated above,
end in Feburary which isn't very helpful
--- SalsrvLog.old ---
12/02/2016 04:19:59 SALVAGING VOLUME 536870931.
12/02/2016 04:19:59 mirror.pypi (536870931) updated 12/02/2016 04:15
12/02/2016 04:20:02 totalInodes 1931509
12/02/2016 04:53:31 Salvaged mirror.pypi (536870931): 1931502 files, 442808916 blocks
I looked through syslog & other bits and pieces looking for anything
suspicious around the same time, and didn't see anything.
There may have been a less heavy-handed approach, but I tried a
restart of the openafs services on afs01 with the hope it would
re-attach, and it appears to have done so. At this point, the mirrors
could access the pypi volume again.
I have started a manual vos release on mirror-update.o.o. This seems
to have decided to recreate the volume on afs02.dfw.o.o which is still
going as I write this
---
root at mirror-update:~# k5start -t -f /etc/afsadmin.keytab service/afsadmin -- vos release -v mirror.pypi
Kerberos initialization for service/afsadmin at OPENSTACK.ORG
mirror.pypi
RWrite: 536870931 ROnly: 536870932 RClone: 536870932
number of sites -> 3
server afs01.dfw.openstack.org partition /vicepa RW Site -- New release
server afs01.dfw.openstack.org partition /vicepa RO Site -- New release
server afs02.dfw.openstack.org partition /vicepa RO Site -- Old release
This is a completion of a previous release
Starting transaction on cloned volume 536870932... done
Deleting extant RO_DONTUSE site on afs02.dfw.openstack.org... done
Creating new volume 536870932 on replication site afs02.dfw.openstack.org: done
Starting ForwardMulti from 536870932 to 536870932 on afs02.dfw.openstack.org (full release).
[ongoing]
---
That is where we're at right now. I did not really expect that to
happen and rather stupidly didn't run that "vos release" in a screen
session. So I think the only side-effect at the moment is that while
the bandersnatch cron update is running, AFS is locked and thus the
mirrors will not get a new volume release until this sync is done;
i.e. our pypi mirrors are a bit behind.
-i
More information about the OpenStack-Infra
mailing list