© 2020 Strange Loop
Dynamically scheduled tasks are at the heart of PagerDuty's microservices. They deliver incident alerts, on-call notifications, and manage myriad administrative chores. Historically, these tasks were scheduled and run using an in-house library built on Cassandra, but that solution had begun to show its age.
Early in 2016, the Core team at PagerDuty built a new Task Scheduler using Akka, Kafka, and Cassandra. After six weeks in development, the Scheduler is now running in production. This talk discusses how the strengths of the three technologies were leveraged to solve the challenges of resilient, distributed task scheduling.
This talk will present a number of distributed system concepts in the real-world context of the Scheduler project. How can you dynamically adjust for increased task load with zero downtime? Can you guarantee task ordering across many servers? Do your tasks still run when an entire datacenter goes down? What happens if your tasks are scheduled twice? Attendees can expect to see how all of these challenges were addressed.
Some familiarity with distributed queueing and actor systems will be helpful for attendees of this talk.
As a Software Engineer on the Core Team at PagerDuty, David works on mad-scientist projects designed to increase the capacity and flexibility of other engineering teams. Mostly this means writing libraries, defining best practices, and maintaining shared infrastructure. David lives in Toronto; he enjoys playing board games, cooking, and installing Linux on everything. You can usually find him cycling about the west end, or hiking with his incredibly fluffy dog.