[openstack-dev] [OpenStack-Dev] [Nova][Neutron][Horizon][Cinder][Keystone][Glance][Ironic][Swift] Fault Classification Input Request

Matt Riedemann mriedemos at gmail.com
Fri Dec 1 16:41:14 UTC 2017


On 11/30/2017 6:05 PM, Nematollah Bidokhti wrote:
> Hi,
> 
> Our [Fault-Genes WG] has been working on defining the fault 
> classifications for key OpenStack projects in an effort to support 
> OpenStack fault management & self-healing.
> 
> We have been using machine learning (unsupervised data) as a method to 
> look into all bugs and issues submitted by the community and it has been 
> very challenging to define the classification completely by the machine.
> 
> We have decided to go with supervised data set. In order to do this, we 
> need to come up with our training data.
> 
> We need your help to generate the training data set. *Basically, we only 
> need 2 or 3 unique fault classifications with a short description and 
> the associated mitigations _from each member who is familiar with 
> OpenStack design & operation_. This way we can build a focused library 
> of faults & mitigations for each project.*
> 
> Once this data is accumulated, we will develop our own specific 
> algorithms that can be applied to all future OpenStack issues.
> 
> Thanks in advance for your support.
> 
> *No.*
> 
> 	
> 
> *Project*
> 
> 	
> 
> *Fault Classification*
> 
> 	
> 
> *Description*
> 
> 	
> 
> *Root Cause*
> 
> 	
> 
> *Mitigation*
> 
> *1*
> 
> 	
> 
> **
> 
> 	
> 
> **
> 
> 	
> 
> **
> 
> 	
> 
> **
> 
> 	
> 
> **
> 
> *2*
> 
> 	
> 
> **
> 
> 	
> 
> **
> 
> 	
> 
> **
> 
> 	
> 
> **
> 
> 	
> 
> **
> 
> *3*
> 
> 	
> 
> **
> 
> 	
> 
> **
> 
> 	
> 
> **
> 
> 	
> 
> **
> 
> 	
> 
> **
> 
> Below are examples of what a couple of developers in Neutron have 
> provided. I am sure there are other types of fault classifications in 
> Neurton that have not been captured in this table.
> 
> *Fault Classification*
> 
> 	
> 
> *Root Cause*
> 
> 	
> 
> *Mitigation*
> 
> Network Connectivity Issues
> 
> 	
> 
> Virtual interface in the VM admin down
> 
> 	
> 
> Un-shut the virtual interface
> 
> Virtual interface does not have IP address via DHCP
> 
> 	
> 
> Depends on lower level root cause
> 
> Virtual network does not have interface to the router
> 
> 	
> 
> Add virtual network as one of the router interfaces
> 
> vNICport of VM not active (stuck in build)
> 
> 	
> 
> Depends on lower level root cause
> 
> Security group lock in traffic
> 
> 	
> 
> Fix the security group to allow relevant traffic
> 
> Unable to Add Port to Bridge
> 
> 	
> 
> Libvirtdin Apparmor is blocking
> 
> 	
> 
> allow Libvirtd profile in Appamor
> 
> No Valid Host Found/insufficient hypervisor resources
> 
> 	
> 
> Compute nodes do not have sufficient resources
> 
> 	
> 
> free up required compute storage and memory resources on compute node
> 
> No Resource
> 
> 	
> 
> Configuration issues
> 
> 	
> 
> Change config setting
> 
> Authentication/permissions error
> 
> 	
> 
> Configuration error such as port # or Password
> 
> 	
> 
> Make sure end points are properly configured
> 
> Gateway access not reachable
> 
> 		
> 
> Use custom keep-alive health-check
> 
> Design issue of OpenStack Network node
> 
> 		
> 
> Out of band health checking mechanism
> 
> Security Group Mis-configuration
> 
> 	
> 
> The security group
> 
> 	
> 
> Change security rules/Programming the security group
> 
> DNS Attack
> 
> 		
> 
> Implement CERT alerts updates
> 
> Network design issue
> 
> 	
> 
> Network storm
> 
> 	
> 
> Reduce L2 broadcast domain
> 
> Nemat
> 
> 
> 
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 

I'm not entirely sure how you classify some of this stuff.

For example, here is a nova/neutron bug in triage:

https://bugs.launchpad.net/nova/+bug/1730637

In this case, the user tries to attach a port to an instance and it 
fails with a port binding failure.

 From the nova side, we have no idea if this is a user error or a 
problem in the networking backend. Therefore I wouldn't know how to 
classify this, or describe the root cause or how to mitigate it.

-- 

Thanks,

Matt



More information about the OpenStack-dev mailing list