[Openstack-operators] Compute nodes reboot periodically by their own

Sarakaitis, Eric Eric.Sarakaitis at cbts.net
Fri Jul 25 18:54:25 UTC 2014


Definitely a failed DIMM it looks like

____________________________
Eric Sarakaitis
Sr. Systems Engineer
419.303.4624 – mobile
513.841.6329 – desk
Eric.Sarakaitis at cbts.net
[cid:6CA43733-F563-4CA3-A952-9ED23F408FB0]
[cid:86BCA359-A1CC-42A9-A471-ECDE1985C8AF]

From: Juan José Pavlik Salles <jjpavlik at gmail.com<mailto:jjpavlik at gmail.com>>
Date: Thursday, July 24, 2014 at 3:29 PM
To: "openstack-operators at lists.openstack.org<mailto:openstack-operators at lists.openstack.org>" <openstack-operators at lists.openstack.org<mailto:openstack-operators at lists.openstack.org>>
Subject: Re: [Openstack-operators] Compute nodes reboot periodically by their own

I found the exact messages in both servers and they match with the dates we had registered as reboots. So, I think this is my reboot problem. Thanks guys!


2014-07-24 15:20 GMT-03:00 Juan José Pavlik Salles <jjpavlik at gmail.com<mailto:jjpavlik at gmail.com>>:
Arne, you might be hitting the right nail here, I installed ipmitool and look at this:

root at cebolla:~# ipmitool sel elist
...
 47d | 07/15/2014 | 12:53:29 | System Event #0x83 | Timestamp Clock Sync | Asserted
 47e | 07/15/2014 | 12:53:30 | System Event #0x83 | Timestamp Clock Sync | Asserted
 47f | 07/15/2014 | 12:54:10 | System Event #0x83 | OEM System boot event | Asserted
 480 | 07/24/2014 | 06:04:08 | Memory Mmry ECC Sensor | Uncorrectable ECC | Asserted
 481 | 07/24/2014 | 06:05:10 | System Event #0x83 | Timestamp Clock Sync | Asserted
 482 | 07/24/2014 | 06:05:28 | System Event #0x83 | Timestamp Clock Sync | Asserted
 483 | 07/24/2014 | 06:06:05 | System Event #0x83 | OEM System boot event | Asserted
root at cebolla:~#

Going back in the logs, I see that the exact same 4 messages in the previous reboot dates. So it seems that one of my DIMM are dying.


2014-07-24 15:07 GMT-03:00 Arne Wiebalck <Arne.Wiebalck at cern.ch<mailto:Arne.Wiebalck at cern.ch>>:


Does that mean that

$ ipmitool sel elist

returns nothing?

Arne

Am 24.07.2014 19:29 schrieb =?ISO-8859-1?Q?Juan_Jos=E9_Pavlik_Salles?= <jjpavlik at gmail.com<mailto:jjpavlik at gmail.com>>:
No problem Arne. I checked the ipmi config:

root at cebolla:~# grep -v "^#" /etc/default/openipmi
IPMI_SI=yes
DEV_IPMI=yes
IPMI_WATCHDOG=no
IPMI_WATCHDOG_OPTIONS="timeout=60"
IPMI_POWEROFF=no
IPMI_POWERCYCLE=no
IPMI_IMB=no
root at cebolla:~#

Even though the IPMI interface is on, the watchdog is disabled. I'd like to try with another hardware just to check, but right now I haven't got any.


2014-07-24 13:30 GMT-03:00 Arne Wiebalck <Arne.Wiebalck at cern.ch<mailto:Arne.Wiebalck at cern.ch>>:

Oops, I apparently wasn't reading carefully enough and mixed your issue with hosts and mine with guests.

Sorry for the noise!
Arne

Am 24.07.2014 17:42 schrieb Tim Bell <Tim.Bell at cern.ch<mailto:Tim.Bell at cern.ch>>:

If it is the hypervisors rebooting, a possible scenario would be if you have a BMC and enabled watchdog. This will reboot the server if it does not call home to the BMC every 'n' seconds.

If you have a very busy hypervisor, you may need to tune the watchdog timeout.

I suspect something would be logged in the BMC ipmi sel logs but not sure.

Tim

> -----Original Message-----
> From: Arne Wiebalck [mailto:Arne.Wiebalck at cern.ch]
> Sent: 24 July 2014 17:10
> To: Juan José Pavlik Salles
> Cc: openstack-operators at lists.openstack.org<mailto:openstack-operators at lists.openstack.org>
> Subject: Re: [Openstack-operators] Compute nodes reboot periodically by their
> own
>
> Hi,
>
> Your compute nodes reboot or are shut off?
>
> I am currently looking at some cases where VMs seem to spontaneously shut
> themselves off. At least from the nova logs’ perspective there is no difference to
> a normal shutdown, VM owners however confirm that they did not touch their
> VMs. So far I was unable to explain this.
>
> This is with Havana on a RHEL6 derivative, though.
>
> Cheers,
>  Arne
>
> --
> Arne Wiebalck
> CERN IT
>
> On 24 Jul 2014, at 16:46, Juan José Pavlik Salles <jjpavlik at gmail.com<mailto:jjpavlik at gmail.com>> wrote:
>
> > Hello guys, We have got a small Grizzly cloud running since the begging of
> 2013 with Ubuntu 12.04. 2 compute nodes, a storage node and a controller,
> nothing too fancy. Everything works just fine, but... the compute nodes reboot
> themselves periodically, sometimes every 2 weeks, some times once a month.
> I've done almost everything I can think of: memory checks, analysed the logs,
> moved all the VMs to one node, and I just can't find the problem.
> >
> > Have you ever heard this kind of behaviour on compute nodes? Any ideas
> where I should look for the problem?
> >
> > Thanks in advance.
> >
> > --
> > Pavlik Salles Juan José
> > Blog - http://viviendolared.blogspot.com
> > _______________________________________________
> > OpenStack-operators mailing list
> > OpenStack-operators at lists.openstack.org<mailto:OpenStack-operators at lists.openstack.org>
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operator
> > s
>
>
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators at lists.openstack.org<mailto:OpenStack-operators at lists.openstack.org>
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators



--
Pavlik Salles Juan José
Blog - http://viviendolared.blogspot.com



--
Pavlik Salles Juan José
Blog - http://viviendolared.blogspot.com



--
Pavlik Salles Juan José
Blog - http://viviendolared.blogspot.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 8E65954F-A8E0-46CB-9A1D-15833AC223CB[27].png
Type: image/png
Size: 25386 bytes
Desc: 8E65954F-A8E0-46CB-9A1D-15833AC223CB[27].png
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20140725/8063ca69/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: B5030F5A-4EE8-4106-B01C-D66279B03636[27].png
Type: image/png
Size: 15992 bytes
Desc: B5030F5A-4EE8-4106-B01C-D66279B03636[27].png
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20140725/8063ca69/attachment-0003.png>


More information about the OpenStack-operators mailing list