[openstack-qa] In connection with speed-up and future design of tempest
David Kranz
david.kranz at qrclab.com
Mon Jan 21 20:48:52 UTC 2013
Attila, thanks for your comments. They are a lot to digest! I did have a
particular comment about the performance issue and parallelism.
I ran parallel nose jobs manually for the compute/servers and
compute/images tests and got only a little more than 1/2 the speedup
that should be
available theoretically. This was on a real machine with 24 "cpus"
running devstack. I believe most of the problem is due to
https://bugs.launchpad.net/nova/+bug/1016633 which limits the speedup of
parallel instance creation on a single server to half what it could be.
Running on a more constrained jenkins instance could easily be worse.
-David
On 1/21/2013 7:46 AM, Attila Fazekas wrote:
> Hi All,
>
> I have several things where I need some clarification.
> I heard a lot of different opinion on IRC related to the below topics, but I do not know which approaches are good for the majority and which are just good for the minority.
> All of the below statement are just opinion or just question or just a suggestion, even if it looks differently.
> I just want to discuss these topics.
>
> I would like to hear everybody's opinion in these questions. It will help me to see the possible future steps.
>
>
> 1. testtools
> I have seen various attempts in refactoring tempest to be compatible with testr (testrepository, testresources, testtools), but I did not see any detailed plans for doing it, and even did not read anything about the longer term goals.
>
> In the https://blueprints.launchpad.net/tempest/+spec/speed-up-tempest the full specification is point to the blueprint edit instead of wiki page.
> Am I missed something ?
>
> I saw many cool features in these tools. Probably I would be the first who say do it yesterday if I could see how it exactly will impure the performance, without additional side effects or resource starvation or even without deadlock or synchronization issues.
>
> I just seen testr in parallelization context I assume we just considering a major refactoring and switching to testtools just because of the parallel execution.
> Please FIXME.
>
> I think just for parallel execution, this is not the cheapest solution, in term of work hours.
>
>
> 2. Limits must be considered in any a new design
>
> 2.1 Limited by machine size
>
> FIXME Tempest primary function is the gate jobs, secondary is providing test tools for various other test environment and even for production environments.
> We should only make performance sacrifices on higher level goal, if it has significant benefits in other cases, and the side effect is not significant in the primary case. (Configurable things are good :))
>
> The gate jobs nowadays done by small VM's with minimal resources, I do not know the exact numbers, but it could be about 1 vCPU, 4GB RAM, 20GB storage.
>
> Tempest is just waiting for I/O from another services, but tempest sharing on the CPU and I/O resources with another processes.
> In small environments the performance could be even worse when you try to do it in parallel.
>
> 2.2 Quota limits (default 10 from most resources)
>
> The default quota limitation should not be a new design blocker.
>
> Since tempest knows the admin password, it can create tenants with higher/unlimited quota.
> I do not see why we need to be limitation by the default quota limitation.
> Tempest may create tenants with lower limitation for quota testing, or it can be done in periodic test even by a shell script.
>
> On the gate the maximum VM number is limited by the devstack VM size. Do not forget, even if we are using small memory guests (64MB RAM), we should not run more than 5/CPU_thread (not idle) VM.
>
> 2. Isolation and side effects, test splitting
>
> The real test isolation would be install all services to a different machine and run just single test case and reinstall everything and start again.
> I think nobody wants to go in this direction.
>
> Even now, test cases can fail just because of the heavy machine load.
>
> We should not isolate test cases by newly created tenants by default. We need to isolate only the test cases which otherwise would fail or causes others to fail.
> I am saying this because the resource creation or deletion can be really expensive, and we can use them just in a single tenant.
> We are slowed down primary by "real world" events, probably we can gain more performance by tricks which makes the real world event's cost smaller.
>
>
> 3. Resource reuse
>
> I heard many concerns about the resource reusing, but I think we can point out which test case made dirty the resource with proper logging and reuse strategy.
> The OpenStack API provides basic and advance information about the resource state, we can decide is the resource in good shape before starting a test code.
> If the above concept not working we found a real bug, or the API does not provide enough information, which is also a bug IMHO.
>
> I think we should try to go in resource reuse way it has great benefits even on a single tempest thread, but we need to consider a lot of thing if we even want to do it in parallel.
>
> 3.1 Challenges with parallel execution and resource reuse together
>
> Now it is difficult to know how match resource will be used, when we start a test case. But it is very important question in scheduling.
>
> Probably we need to add some attribute to the test functions and/or classes about the planned resource usage.
> Do not forget the resource deallocation, is not instant in term of both system resources and quota usage.
> We should do delete request once, and wait (we may try delete requests instead of list, and if its not found it is deleted) for termination, just before we need a new resource instance using the same quota/system resource.
>
> Many test case allocates resources (like server), with certain attributes, and verifies the operation.
> Many other cases just needs any server or just sensitive to several properties.
> Some of the test sensitive parameters can be changed in the resource lifetime but others are permanent.
>
> If we just label the test cases with the type of resource needs, like saying "2 active server and 2 active volume from same tenant" is not enough in all cases.
>
> Servers can be allocated by both XML and json and EC2 API call, however XML and json will know the same server id the EC2 will see a different one.
> Now the OS API can show the servers EC2 id as well, but in other cases (image) we might need to use "whitebox" DB query.
>
> Just saying in test fixture it "needs a server", not enough. Sometimes we require a special server. But all server using the same RAM pool and CPU pool, and it limited by the hardware.
>
> Test fixtures with multiple resource need can cause deadlock or unexpected failure, when we let them start before we can grantee the necessary resources.
>
> In a multi thread environment, all thread should know the same _consistent_ resource information at the same time, we might need locking or IPC. (consider at least threading/multiprocessing if we just speaking about cPython)
>
> The Test case ordering has side effects too, if the test executor decides starting a test case which uses resource "A" and only the last test case will use it again, it might occupies resource for a long time.
> Wrong ordering can prevent better parallel resource utilization.
>
> An example corner case: our server number limitation is 3, we have a test case which needs 3 server with spacial only creation time changeable attribute, and the already allocated "A" server has a different one.
>
> The actual ordering even can be test runner version dependent..
>
> 3.2 All or nothing
>
> If we leverage any system wide general parallel resource reuse solution, probably we need to do it everywhere once.
> However probably for "one active server" tests we can do some ad-hoc solution, without significant impact.
> We start one server when it first needed and we kill it when nothing else needs it anymore. We sacrifice just one server slot, however after boot it will not eat too match CPU..
> We might be able to have the XML and json tests to use the same setUpClass (I am not speaking about, just little better OOP style refactoring, however it is possible)
>
> If we just pick a good resource reuse solution and does not consider, how it behaves in multi-thread environment we might be in a big trouble.
> If the gate VM could not get significantly more cpu power(core), probably we can have just minimal benefits from parallel execution.
>
> 3.3 Manual ordering
>
> As you can see the problem set is big. Probably I missed a lot of other thing.
> I would not be surprised, if we could achieve better performance more easily by "manual" performance tuning.
> ie. manual test case ordering (to multiple threads) while considering resource reusing.
>
> I can even live without a unittest framework, if it has significant benefits and someone can show me a very great plan about how to do it.
> Minimal requirement:
> - Return is everything was OK or not
> - On failure tells what was not OK exactly and very verbosely (first failure might be enough)
> - Ability to skip the failed part, and test the rest of the system
>
> 4. Fear, uncertainty and doubt
>
> Looks like it is not clear for everyone. Do we rejecting patches, because it is not testr/testtools ready or because it is as nose dependent as the others.
>
> As you can see really good solution, has a lot of challenges, and I would like to see we are going in the right direction before doing a major refactoring.
> In my opinion we should try to follow the "old rule", "new test case should be similar to the existing ones", until otherwise announced on this mailing list.
> The announcement should happen 7 days before, the new rules are enforce.
> I do not expect, major changes within a month, but who knows :)
> Now we should spend more time on clean up, in order to help any possible major change.
>
> 5. Test Images
>
> 5.1 cli_htl VM
>
> Most of the servers are newer connected in test cases and we does an operation, which does not need a working VM, just an ACTIVE VM. Probably we have ~3 not skipped case which is sensitive to have a working VM.
> We should consider using very very small test image which halts the VM in state when it does not consumes CPU resources (the VM will not try to reboot in order to fix the problem).
> Would be great if we could find a code which working on most architectures, but first just the X86 can be enough.
>
> 5.2 buildroot and cirros
> The cirros images AFAIK does not supports all feature we need to test, we should create a list about the minimum VM requirements.
> cirros is built by buildroot, in the distant past I have used buildroot, and I liked it.
> Probably we could create faster booting VM (I guess just changing the image compression method to lzo can help), with the additional necessary features.
>
>
> 6. Client Library or RestClient
>
> I see both client library and rest API test cases in tempest.
>
> Using a client library is good because:
> - We can cover the client's code as well
> - We can reuse existing code
>
> Using client library is not good because:
> - We are not verifying the API correctness, just the functionality.
> - We might not see an unwanted API changes
>
> Which is the direction we need to move forward in this question ?
>
> We could test the very same feature in many ways:
> - CLI tools (multiple API version)
> - Client library (json) (multiple API version)
> - XML API (multiple API version)
> - json API (multiple API version)
> - boto library/ EC2
> - by our reinvented EC2 client (not exists at the moment)
>
> Client libraries are utilized by the CLI tools, we probably cover more code if we are using them as a CLI tools.
> AFAIK for CLI testing the devstack exercises is the recommended location.
> We should do some minimal smoke tests with client libraries anyway.
>
> Another possible combination is doing the XML tests with RestClient and doing the json tests with the client libraries.
>
>
>
> I need lot of feed backs about the above items in order to know what to add/remove to/from tempest.
>
> Best Regards,
> Attila
>
>
> PS: I hope now I have less typo than usual. I can rephrase any unclear part, feel free to ask it, even on IRC.
> I hope at least I will know what I intended to say :)
>
> _______________________________________________
> openstack-qa mailing list
> openstack-qa at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-qa
More information about the openstack-qa
mailing list