[openstack-dev] [all][races][rally] Better way to work on fixing races in gates with Rally

Boris Pavlovic boris at pavlovic.me
Sat Nov 29 15:45:54 UTC 2014

Hi guys,

It's not a big secret that working on "races" especially in OpenStack gates
is quite complicated task.

Current workflow:
1) Some tempest test run fails
2) Build rules for  elastic-recheck & report bug
3) Recheck specific bug XXX
4) Collect stats
5) Attempt to fix BUG, believe that it fixes bug and MERGE IT!
6) Montior stats: http://status.openstack.org/elastic-recheck/
7) Repeat 5-6 until bug is fixed

With Rally we can improve this workflow.

As you probably know, in many projects we are running rally-job.
Usually it's called "gate-rally-dsvm-<something>"

This job does simple thing:
1) Run dsvm job that installs OpenStack + Rally
2) Run Rally Task (set of benchmarks) against this Cloud
3) Create a pretty page with results:
4) Put +1/-1 vote depending on criteria of success (sla) of benchmarks
specified in task

This job is very precise and flexible opposite to tempest job that just run
predefined in tempest and infra set of functional tests.

You have a plugins dir:
1) https://github.com/openstack/cinder/tree/master/rally-jobs/plugins
where you can put plugins. In Rally almost everything is pluggable:
 success criteria, load generators, benchmark scenarios and context,...
2) You have task file with specification of what benchmark to run:
That allows you to specify what benchmarks to run in gates.

New workflow for fixing races with Rally:
1) Create or use existing benchmark that test code that will reproduce
raices close to 100%.
2) Push patch to review. And ensure that rally job fails
3) Push fix + in depending patch changes in rally task file that reproduce
4) If bug is not reproduced merge first patch and abandon change with rally
task changes.

As a demo I made changes in rally task that reproduces cinder high priority
bug (volumes are not attached):

So here is the patch:
1) https://review.openstack.org/#/c/137885/
2) We are specifying in rally task to run 11 times benchmark, that
simultaneously do 4 scenarios: create server, create volume, attach to
server volume, detach volume, delete server.
3) We can see that Rally job return -1. After that we can click on it's url
 and see this page:

4) There are 2 interesting links on it:
A) HTML reprot. That shows actually what benchmark failed:
B) DSVM logs (Logs of all services):

And here you can find cinder logs and actually exception that occurs:
screen-c-vol.txt.gz ->

So now we can repeat race condition in gates with close to 100% likelihood
in other words we are able to test that fix really fix this issue.

Happy bug fixing!=)

Best regards,
Boris Pavlovic
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20141129/bcdda37b/attachment.html>

More information about the OpenStack-dev mailing list