Strange Loop

Scaling with Apache Spark (or a lesson in unintended consequences)

Apache Spark is one the most popular general purpose distributed systems in the past few years. Apache Spark has APIs in Scala, Java, Python, R and more recently a few different attempts to provide support for Javascript, C#, and Julia. This talk looks at Apache Spark from a performance/scaling point of view and the work we need to do to be able to handle large datasets. In essence parts of this talk could be considered "the impact of design decisions from years ago and how to work around them." Rather than dumping a set of boiler plate problems and solutions, this talk will help you understand why Spark is built the way it is - and how that can be awesome and terrible at the same time.

It's not all doom and gloom though, we will explore the new APIs and the exciting new things we can do with them with a brief detour into how to work around some of the trade-offs in the new APIs – but mostly focused on the new exciting shiny things we can play with. A basic background with Apache Spark will probably make the talk more exciting, or depressing depending on your point of view, but for those new to Apache Spark just enough to understand whats going will be covered at the start.

Holden Karau

Holden Karau


Holden Karau is transgender Canadian, Apache Spark Committer, and an active open source contributor. She is a co-author of "Learning Spark" and "High Performance Spark" which she encourages everyone to buy multiple copies of. When not in San Francisco working as a software development engineer at IBM's Spark Technology Center, Holden talks internationally on Spark and holds office hours at coffee shops at home and abroad. She is a Spark committer and makes frequent contributions to Spark, specializing in PySpark and Machine Learning. Prior to IBM she worked on a variety of distributed, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a Bachelor of Mathematics in Computer Science. Outside of software she enjoys playing with fire, welding, scooters, poutine, and dancing.