Runaway complexity in Big Data... and a plan to stop it
Big Data has dramatically increased the complexity of building data systems. Big Data forces you to leave the comfortable world of ACID, transactions, and relations, and thrusts you into a challenging world of distributed systems, CAP, and restrictive data models.
You cannot battle complexity with ever more complex systems. This leads to to restrictive systems that are difficult to operate and have poor performance. The only way to reasonably address the complexity of Big Data systems is to fundamentally rethink your approach to avoid that complexity in the first place. A key insight is that the ability to store and process very large amounts of data opens up entirely new ways of building systems that were not possible pre-“Big Data”.
NoSQL is not a panacea. Nor is Hadoop, Storm, or any of the other tools out there for Big Data. Yet there is a way to use these tools in conjunction with one another to build complete and robust realtime data systems with a minimum of complexity. These techniques are possible today and can be implemented and operated by small teams.
In this talk you’ll learn:
- How a huge amount of complexity stems from the CRUD paradigm, and why you only need (and want) CR
- Why embracing immutability is the key to simplifying data systems
- Where NoSQL fits into the big picture
- The “Lambda Architecture”: a generic approach to building data systems using a combination of batch processing and realtime processing
Nathan Marz is an engineer at Twitter. Previously Nathan was the lead engineer of BackType which was acquired by Twitter in July of 2011. Nathan has been involved in the Big Data space for more than four years. He is the author of Cascalog, a high level abstraction for MapReduce, and Storm, a distributed and fault-tolerant realtime computation system. These projects are relied upon by dozens and dozens of companies. He is the author of an upcoming book for Manning Publications called “Big Data: Principles and best practices of scalable realtime data systems” and writes a blog at http://nathanmarz.com.