[openstack-dev] [all][races][rally] Better way to work on fixing races in gates with Rally
boris at pavlovic.me
Sat Nov 29 15:45:54 UTC 2014
It's not a big secret that working on "races" especially in OpenStack gates
is quite complicated task.
1) Some tempest test run fails
2) Build rules for elastic-recheck & report bug
3) Recheck specific bug XXX
4) Collect stats
5) Attempt to fix BUG, believe that it fixes bug and MERGE IT!
6) Montior stats: http://status.openstack.org/elastic-recheck/
7) Repeat 5-6 until bug is fixed
With Rally we can improve this workflow.
As you probably know, in many projects we are running rally-job.
Usually it's called "gate-rally-dsvm-<something>"
This job does simple thing:
1) Run dsvm job that installs OpenStack + Rally
2) Run Rally Task (set of benchmarks) against this Cloud
3) Create a pretty page with results:
4) Put +1/-1 vote depending on criteria of success (sla) of benchmarks
specified in task
This job is very precise and flexible opposite to tempest job that just run
predefined in tempest and infra set of functional tests.
You have a plugins dir:
where you can put plugins. In Rally almost everything is pluggable:
success criteria, load generators, benchmark scenarios and context,...
2) You have task file with specification of what benchmark to run:
That allows you to specify what benchmarks to run in gates.
New workflow for fixing races with Rally:
1) Create or use existing benchmark that test code that will reproduce
raices close to 100%.
2) Push patch to review. And ensure that rally job fails
3) Push fix + in depending patch changes in rally task file that reproduce
4) If bug is not reproduced merge first patch and abandon change with rally
As a demo I made changes in rally task that reproduces cinder high priority
bug (volumes are not attached):
So here is the patch:
2) We are specifying in rally task to run 11 times benchmark, that
simultaneously do 4 scenarios: create server, create volume, attach to
server volume, detach volume, delete server.
3) We can see that Rally job return -1. After that we can click on it's url
and see this page:
4) There are 2 interesting links on it:
A) HTML reprot. That shows actually what benchmark failed:
B) DSVM logs (Logs of all services):
And here you can find cinder logs and actually exception that occurs:
So now we can repeat race condition in gates with close to 100% likelihood
in other words we are able to test that fix really fix this issue.
Happy bug fixing!=)
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the OpenStack-dev