[Openstack-operators] [scientific] Ironic Summit recap - ops experiences

Peter Love p.love at lancaster.ac.uk
Thu May 12 10:37:34 UTC 2016

Nice talk on this stuff: https://www.youtube.com/watch?v=GZeUntdObCA

On 12 May 2016 at 10:54, Matt Jarvis <matt.jarvis at datacentred.co.uk> wrote:
> Very familiar list Tim, and we end up working around a lot of them with
> horrible hardware specific code. Our bugbears also include :
> Required configuration only being available via a web interface - eg.
> setting hostname of the BMC on Supermicro hardware
> IPMI hanging and requiring complete removal and reload of the kernel modules
> to enable resetting
> Undocumented functions requiring raw IPMI commands - again on Supermicro
> there is some black magic to set dedicated ports, check power supply status
> etc.
> Web interfaces requiring Java, and totally broken on mainstream browsers -
> HP ILO's in particular, which are almost impossible to use with a Mac.
> Firmware and BIOS'es which don't allow command line updating from inside a
> running OS
> We're used to being able to flash BIOS images and CMOS settings by writing
> directly to the memory addresses, but more and more modern hardware won't
> let you do this anymore :(
> We're hoping Redfish will solve some of the configuration related issues,
> although obviously it won't make any difference to flaky BMC implementations
> and proprietary tooling to update firmware.
> On 12 May 2016 at 06:25, Tim Bell <Tim.Bell at cern.ch> wrote:
>> On 12/05/16 06:22, "Stig Telfer" <stig.openstack at telfer.org> wrote:
>> >Hi All -
>> >
>> >Jim Rollenhagen from the Ironic project has just posted a great summit
>> > report of Ironic team activities on the openstack-devs mailing list[1],
>> > which included this item which will be of interest to the Scientific WG
>> > members who are looking to work on bare metal activities this cycle:
>> >
>> >> # Making ops less worse
>> >>
>> >> [Etherpad](https://etherpad.openstack.org/p/ironic-newton-summit-ops)
>> >>
>> >> We discussed some common failure cases that operators see, and how we
>> >> can solve them in code.
>> >>
>> >> We discussed flaky BMCs, which end with the node in maintenance mode,
>> >> and if Ironic can get them out of that mode automagically. We
>> >> identified
>> >> the need to distinguish between maintenance set by ironic and set by
>> >> operators, and do things like attempt to connect to the BMC on a power
>> >> state request, and turn off maintenance mode if successful. JayF is
>> >> going to write a spec for this differentiation.
>> >>
>> >> Folks also expressed the desire to be able to reset the BMC via APIs.
>> >> We
>> >> have a BMC reset function in the vendor interface for the ipmitool
>> >> driver; dtantsur volunteered to write a spec to promote that method to
>> >> an official ManagementInterface method.
>> >>
>> >> We also talked for a while about stuck states. This has been mostly
>> >> solved in code, but is still a problem for some deployers. We decided
>> >> that we should not have a "reset-state" API like nova does, but rather
>> >> a
>> >> command line tool to handle this. lintan has volunteered to write a
>> >> proposal for this; I have also posted some [straw man
>> >> code](https://review.openstack.org/#/c/311273/) that someone is welcome
>> >> to take over or use.
>> >
>> >The operator issues already identified cover some things we’ve hit at
>> > Cambridge, please do scan through and contribute if there is anything they
>> > have not covered.
>> >
>> We have certainly had our share of BMC problems through the years. It is
>> often frustrating as the very time you find you need the console, it is not
>> working. Having Ironic doing an active monitoring (without overloading)
>> would be a real help.
>> The other item we’ve found difficult has been in the configuration:
>> - Software maintenance is very limited. Some vendors choose to produce new
>> versions of the BMC microcode without changing the version number reported
>> by the BMC which makes consistent management difficult. There is no common
>> API defined for updating the code.
>> - Implementations between IPMI 1.5 and IPMI 2.0 vary significantly and
>> between commodity white boxes and blades
>> - BMCs have different Lan channels according to manufacturer for remote
>> access
>> - The tty speeds vary which means that the booted OS needs to have
>> different cmdlines for the kernel according to the underlying hardware
>> - the number of additional accounts is limited in some BMCs and password
>> management is very basic. Currently, we define distinct users for read-only
>> access to the SDRs (e.g. monitoring), console and power operations since
>> these need to be kept in different systems. We also have unique passwords
>> for each machine, all of which requires tracking. Foreman helps here but it
>> is not ideal.
>> - BMC replacement is also frequent. A process to re-import a replacement
>> BMC (new MAC, no user accounts defined) would re-installing the box is
>> needed.
>> - we have a fairly complex reset process which hits the BMC with different
>> levels of reset. We’ve also sometimes found the need to reset the IPMI
>> kernel modules at the same time which go into a loop.
>> I’m not expecting Ironic to fix all of this but it would be great to have
>> a block of code which we can gradually improve together. There are other
>> good initiatives like OpenBMC but they won’t help with the existing boxes.
>> I think my best advice to Ironic for BMC management would be consider the
>> BMC as a potentially unreliable device. Thus, along with performing the
>> actions, checking they completed and probing that a function which was
>> working an hour ago is still working now (but not overloading it)… we’ll be
>> looking at Ironic this year so we’ll be able to help on the failure cases.
>> Tim
>> >Best wishes,
>> >Stig
>> >
>> >[1]
>> > http://lists.openstack.org/pipermail/openstack-dev/2016-May/094658.html
>> >_______________________________________________
>> >OpenStack-operators mailing list
>> >OpenStack-operators at lists.openstack.org
>> >http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>> _______________________________________________
>> OpenStack-operators mailing list
>> OpenStack-operators at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
> DataCentred Limited registered in England and Wales no. 05611763
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

More information about the OpenStack-operators mailing list