[User-committee] New working group: [fault genes]. Recap from Austin from "Taxonomy of Failure" Ops session and plans going forward as a working group

Nematollah Bidokhti Nematollah.Bidokhti at huawei.com
Thu May 5 01:22:34 UTC 2016


Hi,

This email is a recap from our OpenStack summit meeting "Taxonomy of Failure" in Austin. The purpose of this email is to provide a summary of the meeting and future plans.

We had between 55-60 people participating in our session and received a number of comments and suggestions. Basically all comments were positive and felt we are going in a right direction.

The goal is to look at OpenStack resiliency in holistic fashion by identifying all possible failure modes (either experienced to date or based on design implementation), classifying them, defining the ideal mitigation strategy, how should they be reported and how they can be re-created with the OpenStack version in mind. The results of this effort will be used throughout OpenStack lifecycle (design, development, test, deployment).

After our meeting I met with a lot of companies in the market place and received lots of encouragement to complete the effort that we have started. There were 20 companies that I met with and all expressed their interest to support this activity. As a result, we have decided to start a working group "Fault Genes" to focus on all OpenStack failure modes.

The plan is to start with email communications and filling out our Google Sheet template (https://docs.google.com/spreadsheets/d/1sekKLp7C8lsTh-niPHNa2QLk5kzEC_2w_UsG6ifC-Pw/edit#gid=2142834673) that we have set up, start out with a weekly meeting, adjusting as the group sees fit and in 3 months have a check point on what we have accomplished. Then, we should have a picture of what we have accomplished, where this will go and have information to present at OpenStack in Barcelona. Below is the link to the etherpad:

https://etherpad.openstack.org/p/AUS-ops-Taxonomy-of-Failures

For those who were in the meeting or discussed this at the summit, and  you understand spreadsheet, please take time and fill in the spreadsheet with the failure modes that you have experienced so far and related attributes for each failure mode.

I'll schedule a meeting to get those who weren't at the summit informed of the process and how to use the spreadsheet.

Suggestions of meeting times, or further discussion here is appreciated and appropriate.

My availability for meetings is:  1600-2359 UTC

Please use this link http://doodle.com/poll/8ymwuqva7itv84p8 to provide your suggested time.

Thanks,<http://www.timeanddate.com/worldclock/timezone/utc>

Nemat Bidokhti<http://www.timeanddate.com/worldclock/timezone/utc>
Chief Reliability Architect
IT Product Line, Computing Lab<http://www.timeanddate.com/worldclock/timezone/utc>
Futurewei Technologies, Inc.<http://www.timeanddate.com/worldclock/timezone/utc>
HUAWEI R&D USA<http://www.timeanddate.com/worldclock/timezone/utc>
Tel:         +1-408-330-4714<http://www.timeanddate.com/worldclock/timezone/utc>
Cell:       +1-408-528-4909<http://www.timeanddate.com/worldclock/timezone/utc>
Fax:        +1-408-330-5088<http://www.timeanddate.com/worldclock/timezone/utc>
E-mail: nematollah.bidokhti at huawei.com<http://www.timeanddate.com/worldclock/timezone/utc>
2330 Central Expressway <mailto:nematollah.bidokhti at huawei.com>
Santa Clara, CA 95050<mailto:nematollah.bidokhti at huawei.com>
http://www.huawei.com<mailto:nematollah.bidokhti@huawei.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/user-committee/attachments/20160505/27368119/attachment-0001.html>


More information about the User-committee mailing list