On 4/12/19 2:34 PM, Monty Taylor wrote:
On 4/11/19 9:42 PM, Ilya Shakhat wrote:
Distributed tracing is one of must-have features when one wants to track the full path of request going through different services and APIs. This makes it similar to shared request-id, but with nice visualization at the end [1]. In OpenStack the tracing can be achieved via osprofiler library. The library was introduced 5 years ago, and back then there was no standard approach on how to do tracing and that's why it stays aside from what has become a mainstream. Yet there is no single standard, but the major players are OpenTracing and OpenCensus communities. OpenTracing is represented by Uber's Jaeger which is the default tracer from k8s world.
Issues and limitations to be fixed: 1. Compatibility. While osprofiler library supports many different storage drivers, it has only one way of transferring trace context over the wire. Ideally the library should be compatible with other third-party tracers and allow traces to start in front of OpenStack APIs (e.g. in user apps) and continue after (e.g. in storage systems, or network management tools). [2] 2. Operation mode. With osprofiler tracing is initiated by user request, while in industrial solutions the tracing can be managed centrally via dynamic sampling policies. 3. In-process trace propagation. Depending on execution model (threaded, async) the ways of storing current trace context differ. OSProfiler supports thread-local model, which recently got broken with new async implementation in openstacksdk [3].
FWIW - we should have re-fixed that issue in SDK for all instances other than parallel uploading of Large Objects segments to swift. The parallism support now relies on the calling context's parallism. The large-object segment uploader is a thing we should make sure we do things with to make sure we're not losing those interactions.
That said - if we move forward with this plan - let's be sure to make sure it works in openstacksdk - and that we're testing it so that we don't break it.
Do we need to wrap logical operations that may make more than one remote call in a single span? I ask because in the cloud layer of openstacksdk, there are methods, like "create_image" or "get_server" which can wind up making multiple calls to multiple services, but it's a single logical operation to the user. I don't know enough about the opentracing best practices - do we care about such aggregations? Or is simply wrapping the http call at the ksa layer enough?
With OpenTracing it is possible to select the appropriate model alongside with tracer configuration.
What's the plan: Switching to OpenTracing could be a good option to gain compatibility with 3rd-party solutions. The actual change should go to osprofiler library, but indirectly affects all OpenStack projects (should it be a global team goal then?). I'm going to make a PoC of proposed change, so reviews would be highly appreciated.
Comments, suggestions?
Generally supportive. I have specific impl feedbacks - but I'll leave those on the patches.
Thanks, Ilya
[1] e.g. http://logs.openstack.org/15/650915/4/check/tempest-smoke-py3-osprofiler-red...
[2] https://bugs.launchpad.net/osprofiler/+bug/1798565 [3] https://bugs.launchpad.net/osprofiler/+bug/1818493