[Openstack-operators] Compute nodes reboot periodically by their own

Juan José Pavlik Salles jjpavlik at gmail.com
Thu Jul 24 19:29:24 UTC 2014


I found the exact messages in both servers and they match with the dates we
had registered as reboots. So, I think this is my reboot problem. Thanks
guys!


2014-07-24 15:20 GMT-03:00 Juan José Pavlik Salles <jjpavlik at gmail.com>:

> Arne, you might be hitting the right nail here, I installed ipmitool and
> look at this:
>
> root at cebolla:~# ipmitool sel elist
> ...
>  47d | 07/15/2014 | 12:53:29 | System Event #0x83 | Timestamp Clock Sync |
> Asserted
>  47e | 07/15/2014 | 12:53:30 | System Event #0x83 | Timestamp Clock Sync |
> Asserted
>  47f | 07/15/2014 | 12:54:10 | System Event #0x83 | OEM System boot event
> | Asserted
> * 480 | 07/24/2014 | 06:04:08 | Memory Mmry ECC Sensor | Uncorrectable ECC
> | Asserted*
> * 481 | 07/24/2014 | 06:05:10 | System Event #0x83 | Timestamp Clock Sync
> | Asserted*
> * 482 | 07/24/2014 | 06:05:28 | System Event #0x83 | Timestamp Clock Sync
> | Asserted*
> * 483 | 07/24/2014 | 06:06:05 | System Event #0x83 | OEM System boot event
> | Asserted*
> root at cebolla:~#
>
> Going back in the logs, I see that the exact same 4 messages in the
> previous reboot dates. So it seems that one of my DIMM are dying.
>
>
> 2014-07-24 15:07 GMT-03:00 Arne Wiebalck <Arne.Wiebalck at cern.ch>:
>
>  Does that mean that
>>
>> $ ipmitool sel elist
>>
>> returns nothing?
>>
>> Arne
>> Am 24.07.2014 19:29 schrieb =?ISO-8859-1?Q?Juan_Jos=E9_Pavlik_Salles?= <
>> jjpavlik at gmail.com>:
>>  No problem Arne. I checked the ipmi config:
>>
>>  *root at cebolla:~# grep -v "^#" /etc/default/openipmi*
>> *IPMI_SI=yes*
>> *DEV_IPMI=yes*
>> *IPMI_WATCHDOG=no*
>> *IPMI_WATCHDOG_OPTIONS="timeout=60"*
>> *IPMI_POWEROFF=no*
>> *IPMI_POWERCYCLE=no*
>> *IPMI_IMB=no*
>> *root at cebolla:~# *
>>
>>  Even though the IPMI interface is on, the watchdog is disabled. I'd
>> like to try with another hardware just to check, but right now I haven't
>> got any.
>>
>>
>> 2014-07-24 13:30 GMT-03:00 Arne Wiebalck <Arne.Wiebalck at cern.ch>:
>>
>>>  Oops, I apparently wasn't reading carefully enough and mixed your
>>> issue with hosts and mine with guests.
>>>
>>> Sorry for the noise!
>>> Arne
>>> Am 24.07.2014 17:42 schrieb Tim Bell <Tim.Bell at cern.ch>:
>>>
>>> If it is the hypervisors rebooting, a possible scenario would be if you
>>> have a BMC and enabled watchdog. This will reboot the server if it does not
>>> call home to the BMC every 'n' seconds.
>>>
>>> If you have a very busy hypervisor, you may need to tune the watchdog
>>> timeout.
>>>
>>> I suspect something would be logged in the BMC ipmi sel logs but not
>>> sure.
>>>
>>> Tim
>>>
>>> > -----Original Message-----
>>> > From: Arne Wiebalck [mailto:Arne.Wiebalck at cern.ch
>>> <Arne.Wiebalck at cern.ch>]
>>> > Sent: 24 July 2014 17:10
>>> > To: Juan José Pavlik Salles
>>> > Cc: openstack-operators at lists.openstack.org
>>> > Subject: Re: [Openstack-operators] Compute nodes reboot periodically
>>> by their
>>> > own
>>> >
>>> > Hi,
>>> >
>>> > Your compute nodes reboot or are shut off?
>>> >
>>> > I am currently looking at some cases where VMs seem to spontaneously
>>> shut
>>> > themselves off. At least from the nova logs’ perspective there is no
>>> difference to
>>> > a normal shutdown, VM owners however confirm that they did not touch
>>> their
>>> > VMs. So far I was unable to explain this.
>>> >
>>> > This is with Havana on a RHEL6 derivative, though.
>>> >
>>> > Cheers,
>>> >  Arne
>>> >
>>> > --
>>> > Arne Wiebalck
>>> > CERN IT
>>> >
>>> > On 24 Jul 2014, at 16:46, Juan José Pavlik Salles <jjpavlik at gmail.com>
>>> wrote:
>>> >
>>> > > Hello guys, We have got a small Grizzly cloud running since the
>>> begging of
>>> > 2013 with Ubuntu 12.04. 2 compute nodes, a storage node and a
>>> controller,
>>> > nothing too fancy. Everything works just fine, but... the compute
>>> nodes reboot
>>> > themselves periodically, sometimes every 2 weeks, some times once a
>>> month.
>>> > I've done almost everything I can think of: memory checks, analysed
>>> the logs,
>>> > moved all the VMs to one node, and I just can't find the problem.
>>> > >
>>> > > Have you ever heard this kind of behaviour on compute nodes? Any
>>> ideas
>>> > where I should look for the problem?
>>> > >
>>> > > Thanks in advance.
>>> > >
>>> > > --
>>> > > Pavlik Salles Juan José
>>> > > Blog - http://viviendolared.blogspot.com
>>> > > _______________________________________________
>>> > > OpenStack-operators mailing list
>>> > > OpenStack-operators at lists.openstack.org
>>> > >
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operator
>>> > > s
>>> >
>>> >
>>> > _______________________________________________
>>> > OpenStack-operators mailing list
>>> > OpenStack-operators at lists.openstack.org
>>> >
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>>
>>
>>
>>
>>  --
>> Pavlik Salles Juan José
>> Blog - http://viviendolared.blogspot.com
>>
>
>
>
> --
> Pavlik Salles Juan José
> Blog - http://viviendolared.blogspot.com
>



-- 
Pavlik Salles Juan José
Blog - http://viviendolared.blogspot.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20140724/3790d4b6/attachment.html>


More information about the OpenStack-operators mailing list