CSR: Small: A Just-in-Time, Cross-Layer Instrumentation Framework for Diagnosing Performance Problems in Distributed Applications
Sponsor: National Science Foundation
Award Number: 1815323
PI: Raja Sambasivan
Co-Is/Co-PIs: Ayse K. Coskun, Orran Krieger
Abstract:Distributed applications running in data centers are critical to society (e.g., for shopping, banking). Engineers must diagnose and fix problems observed in data centers quickly; however, doing so is extremely challenging. A significant hurdle is that engineers must spend significant time and effort exploring what instrumentation (e.g., log messages about specific application behaviors) is needed to provide visibility into a new problem. To assist in this front, this project will develop an instrumentation framework that, in response to a new problem, will automatically search the space of possible instrumentation choices and enable the instrumentation needed to provide insight into it.
This project addresses fundamental challenges associated with creating an automatic instrumentation framework: (a) What algorithms and heuristics are suited for automatically and efficiently exploring the instrumentation search space? (b) What architectural support is needed within the framework to enable automatic exploration? (c) How can the search space be explored without significantly impacting application performance? The proposal will explore the utility of algorithms based on operator knowledge, statistics, and machine learning to explore the search space. It will build on end-to-end tracing, as this will enable the framework to work for problems that affect different sets of requests.
This project will inform the architecture of next-generation instrumentation frameworks, which are needed to keep pace with the ever-increasing complexity of distributed applications. The critical issues identified in popular open-source distributed applications while evaluating the framework will improve their robustness. Researchers will be able to leverage the software artifacts released by this project to create novel distributed-application-management tools that leverage the framework’s unique capabilities. They will be able to deploy the framework in research clouds to obtain valuable workload traces from them. The project will generate course modules on diagnosis practices for distributed applications.
For more information: click here