Strange Loop

2009 - 2023

St. Louis, MO

How Tracing Uncovers Half-truths in Slack’s CI Infrastructure

Traditional monitoring tools like logs and metrics were necessary but not sufficient to debug how and where systems failed in CI, which relies on multiple, interconnected critical systems (e.g. GHE, Checkpoint, Cypress). In this talk, Frank Chen shares how traces gave us a critical and compounding capability to better understand where, when, how, and why faults occur for our customers in CI. We share how shared tooling for high-dimensionality event traces (using SlackTrace and SpanEvents) could significantly increase our velocity to diagnose code in flight and to debug complex system interactions. We go from stories with early incidents that motivated further investment throughout Slack’s internal tooling teams to stories about gains in performance and resiliency throughout our infrastructure.

Frank Chen

Slack

Speaker site

@frankc

fxchen

Frank is a maker. At Slack, he focuses on making engineers' lives simpler, more pleasant, and more productive, in the Developer Productivity group. Frank builds tools to be force multipliers for performance and resiliency projects, and guides internal teams adopt observability culture + tooling. Frank helps people make better decisions by designing technologies that connect people to what they want to do. He informs software development with a background in behavior design, engineering leadership, site reliability engineering, and resiliency research. Frank recently moved back to the bay area and can frequently be found hiking, running, or woodworking.