论文标题

立即行动:事件定位迹线的总比较

ACT now: Aggregate Comparison of Traces for Incident Localization

论文作者

Ramasubramanian, Kamala, Raina, Ashutosh, Mace, Jonathan, Alvaro, Peter

论文摘要

生产系统中的事件很常见,停机时间很昂贵。快速采取适当的缓解措施,例如更改特定的防火墙规则,重新换变或将流量转移到其他可用性区域,从而节省了资金。事件定位是耗时的,因为单个故障可能会产生许多影响,远离故障部位。知道不同的系统事件相互关联是必要的,以快速识别\ emph {where}来减轻。我们的方法,迹线(ACT)的汇总比较,通过比较从最新的稳态操作和事件中取样的痕迹集(捕获事件及其关系的捕获事件及其关系)来定位事件。在我们的定量实验中,我们表明ACT能够有效地定位99%以上的事件。

Incidents in production systems are common and downtime is expensive. Applying an appropriate mitigating action quickly, such as changing a specific firewall rule, reverting a change, or diverting traffic to a different availability zone, saves money. Incident localization is time-consuming since a single failure can have many effects, extending far from the site of failure. Knowing how different system events relate to each other is necessary to quickly identify \emph{where} to mitigate. Our approach, Aggregate Comparison of Traces (ACT), localizes incidents by comparing sets of traces (which capture events and their relationships for individual requests) sampled from the most recent steady-state operation and during an incident. In our quantitative experiments, we show that ACT is able to effectively localize more than 99% of incidents.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源