Open Stack

Mon Dec 5 04:30:36 UTC 2016

Hi,

Today I was alerted to jobs failing on IRC, further investigation
showed the pypi volume did not seem to be responding on the mirror
servers.

---
 ianw at mirror:/afs/openstack.org/mirror$ ls pypi
 ls: cannot access pypi: Connection timed out
---

The bandersnatch logs suggested the vos release was not working, and a
manual attempt confirmed this

---
 root at mirror-update:~# k5start -t -f /etc/afsadmin.keytab service/afsadmin -- vos release -v mirror.pypi
 Kerberos initialization for service/afsadmin at OPENSTACK.ORG

 mirror.pypi 
     RWrite: 536870931     ROnly: 536870932     RClone: 536870932 
     number of sites -> 3
        server afs01.dfw.openstack.org partition /vicepa RW Site  -- New release
        server afs01.dfw.openstack.org partition /vicepa RO Site  -- New release
        server afs02.dfw.openstack.org partition /vicepa RO Site  -- Old release
 Failed to start transaction on RW clone 536870932
 Volume not attached, does not exist, or not on line
 Error in vos release command.
 Volume not attached, does not exist, or not on line
---

I figured afs01 must be having issues.  The problem seems to have
appeared at this point (note the .old logs, because I restarted
things, which seems to be the point it rotates):

--- FileLog.old ---
 Sun Dec  4 23:36:06 2016 Volume 536870932 offline: not in service
 Sun Dec  4 23:41:03 2016 fssync: breaking all call backs for volume 536870932
 Sun Dec  4 23:46:05 2016 fssync: breaking all call backs for volume 536870932
 Sun Dec  4 23:46:05 2016 VRequestSalvage: volume 536870932 online salvaged too many times; forced offline.

This then made the volume server unhappy:

--- VolserLog.old ---
 Sun Dec  4 23:45:58 2016 1 Volser: Clone: Recloning volume 536870931 to volume 536870932
 Sun Dec  4 23:46:11 2016 SYNC_ask: negative response on circuit 'FSSYNC'
 Sun Dec  4 23:46:11 2016 FSYNC_askfs: FSSYNC request denied for reason=0
 Sun Dec  4 23:46:11 2016 VAttachVolume: attach of volume 536870932 apparently denied by file server
 Sun Dec  4 23:46:11 2016 attach2: forcing vol 536870932 to error state (state 0 flags 0x0 ec 103)

As for the root cause, I don't see anything else particularly
insightful in the logs.  The salvage server logs, implicated above,
end in Feburary which isn't very helpful

--- SalsrvLog.old ---
 12/02/2016 04:19:59 SALVAGING VOLUME 536870931.
 12/02/2016 04:19:59 mirror.pypi (536870931) updated 12/02/2016 04:15
 12/02/2016 04:20:02 totalInodes 1931509
 12/02/2016 04:53:31 Salvaged mirror.pypi (536870931): 1931502 files, 442808916 blocks

I looked through syslog & other bits and pieces looking for anything
suspicious around the same time, and didn't see anything.

There may have been a less heavy-handed approach, but I tried a
restart of the openafs services on afs01 with the hope it would
re-attach, and it appears to have done so.  At this point, the mirrors
could access the pypi volume again.

I have started a manual vos release on mirror-update.o.o.  This seems
to have decided to recreate the volume on afs02.dfw.o.o which is still
going as I write this

---
 root at mirror-update:~# k5start -t -f /etc/afsadmin.keytab service/afsadmin -- vos release -v mirror.pypi
 Kerberos initialization for service/afsadmin at OPENSTACK.ORG
 mirror.pypi 
     RWrite: 536870931     ROnly: 536870932     RClone: 536870932 
     number of sites -> 3
        server afs01.dfw.openstack.org partition /vicepa RW Site  -- New release
        server afs01.dfw.openstack.org partition /vicepa RO Site  -- New release
        server afs02.dfw.openstack.org partition /vicepa RO Site  -- Old release
 This is a completion of a previous release
 Starting transaction on cloned volume 536870932... done
 Deleting extant RO_DONTUSE site on afs02.dfw.openstack.org... done
 Creating new volume 536870932 on replication site afs02.dfw.openstack.org:  done
 Starting ForwardMulti from 536870932 to 536870932 on afs02.dfw.openstack.org (full release).
 [ongoing]
---

That is where we're at right now.  I did not really expect that to
happen and rather stupidly didn't run that "vos release" in a screen
session.  So I think the only side-effect at the moment is that while
the bandersnatch cron update is running, AFS is locked and thus the
mirrors will not get a new volume release until this sync is done;
i.e. our pypi mirrors are a bit behind.

-i

Open Stack

[OpenStack-Infra] pypi volume downtime

OpenStack

Community

Documentation

Branding & Legal