Confirmed Sessions

We're just now beginning to announce the sessions. 70+ more to go. We'll be adding new sessions every few days.

Who guards the guardians? Designing for resilience in cluster orchestrators

Preetha Appan - HashiCorp

Cluster orchestrators enable reliable and repeatable application deploys and provide fault tolerance without operator intervention. These orchestrators are themselves complex distributed systems like the applications they manage. The blast radius when a cluster orchestrator fails is huge; it could take down all your applications. Designing resilience into the orchestrator is a unique challenge given its critical operational nature.
Preetha Appan outlines various failure modes ranging from network failures to entire server failures in Nomad, an open source scheduler that supports heterogeneous workloads. You’ll discover how building graceful degradation and resilience to address these failures involves looking at the problem as a trade-off between three system features: correctness, performance, and availability. Along the way, Preetha shares examples of design decisions that impact the availability of applications managed by the scheduler and lessons learned that apply to building any complex distributed system.

Improving reliability of your distributed data store

Mehant Baid - Dropbox

Edgestore is a low latency, distributed data store that is one of the largest services developed at Dropbox. It serves 10 million requests per second and stores over 10 trillion objects at rest. In the last few years Edgestore has grown from being used by a handful of services to being the primary data store for all of Dropbox’s metadata needs.
This talk will discuss the challenges we faced as we scaled Edgestore and our journey from being an operationally burdensome service that’s plagued with incidents and postmortems to a service that’s highly reliable and operationally light weight.

Isolate Computing

Zack Bloom - Cloudflare

For forty years computation has been built around the idea of a process as the fundamental abstraction of a piece of code to be executed. In that time, how we write code has changed dramatically, culminating with serverless, but the nature of a process has not.
Processes unfortunately incur a context-switching overhead as the operating system moves the Processor from executing one serverless container to another, wasting CPU cycles. Processes also can only do IO and other critical tasks by firing interrupts into the kernel which waste as much as 33% of the execution time of an IO bound function. Processes also incur startup time as heavyweight virtual machines like NodeJS are initialized, which we experience in the serverless world as a cold start. The fear of cold starts require us to do complex work to warm serverless functions, and require even infrequently used functions to consume precious memory to avoid them.
There may be an alternative. Web browsers have solved the same problem, the need to run many instances of untrusted code with minimal overhead and start new code execution lightning-fast, in an entirely different way. They run a single virtual machine, and encapsulate each piece of code not in a process, but in an ‘isolate’. These isolates can be started in 5 milliseconds, 100 times faster than a Node Lambda serverless function. They also consume 1/10 the memory.
Beyond serverless, being able to initiate execution of server-side code in less time than it takes for a web request to connect opens dramatic possibilities. Services can be scaled to millions of requests per second instantaneously. They can be deployed to hundreds of locations around the world with the same economics as deploying to just one. Even better, by eliminating process-related overhead, it brings us close to the economics of running on bare metal, but with the ergonomics of serverless programming.
Leaving this presentation attendees will have an understanding of where Isolate-based serverless might be more appropriate than other forms of compute. In those situations, they will be able to deploy code which can be affordably ran close to every Internet visitor, which can autoscale instantaneously, and which can be as much as three times less expensive than container based serverless systems per CPU-cycle.

How I failed to build a runbook automation system and what I learned

Tim Bonci - Cimpress

Our intentions can be good and the technical ability and time may be there and we’re going to build the thing to make work easier and more productive, allowing everyone to apply their labor to only the most valuable tasks – yet sometimes it’s still not enough. This is a post-mortem of a solution that was designed to solve a common operational problem but failed. I’ll talk about the scars and hopefully provide insights into finding and addressing the right problems in the right places that should be broadly useful in building and deploying your own transformational processes and tools.
This should be particularly relevant to brownfields teams looking for ways to modernize their processes and anyone who struggles with needing humans to change how they work.
We’ll go over why shifting human processes to computer automation does not always produce the expected results and how treating non-urgent alerts as a work queue is an antipattern.

Enterprise transformation (and you can too)

Donovan Brown - Microsoft

“That would never work here.” You’ve likely heard this sentiment echo from your company’s conference rooms or board rooms (or maybe you’ve said it yourself). There are always reasons: established processes (with vested interests supporting them), legacy codebases and data centers (both with large install footprints), and scale (for some values of scale), to name just a few.
Good news: change is possible. Donovan Brown walks you through a case study from Microsoft’s Visual Studio Team Services (VSTS). VSTS went from a three-year waterfall delivery cycle to three-week iterations and open sourced the VSTS task library and the Git Virtual File System (GVFS). To make these changes, the team had to question its tool choices, change its processes, and empower its people. You’ll learn why integration of cross-functional teams is key to the continuous delivery of value to end users.

Learning from failure: Why a total site outage can be a good thing

Alex Elman - Indeed

Although an outage is a terrifying prospect, you should embrace it as an opportunity. Failure can expand and improve your understanding of your systems.
Three years ago, Indeed suffered one of the worst outages in its history. No single fault or failure caused this outage. Rather, it was a complex interaction of bugs, design decisions, capacity loss, and poor situational awareness during incident response. Indeed learned valuable lessons from this event. It identified ways to make the systems more resilient and improved the approach to the incident lifecycle within the engineering culture.
Alex Elman uses the narrative of this incident to demonstrate how a site-wide outage can inform increased resilience and reduced operational complexity. Learning from failure is a feedback loop rather than a one-off process. He applies Indeed’s outage as a practical example of what an iteration of this loop can look like. He shares with other SREs the success that has risen from this failure. Indeed hasn’t had a global site outage in the three years since this event.
Alex begins with a discussion of failure to set the stage for delivering the incident background, then discusses incident response and situational awareness. He explains conducting incident postmortems and learning from failure and designing for reliability, including resilience patterns such as circuit breaking and graceful degradation. Finally, he gets into resilience testing, running chaos tests, and closing the feedback loop, leaving some time for a question and answer session.

Scaling SRE organizations: The journey from 1 to many teams

Gustavo Franco - Google

In this talk, Gustavo Franco of Google will share his experience starting new teams, splitting and moving them from both technical and non-technical standpoints. This is ideal for new leaders in charge of SRE wondering when it’s time to grow beyond a single team and how to. This is also very valuable for SREs who are interested to know what happens behind the scenes, how to influence such changes and how they can help while avoiding burnout.

Security in the FaaS lane

Karthik Gaekwad - Oracle

Security in FaaS isn’t what you’re used to. With enterprises quickly moving to serverless, there’s a need to address the topic of security.
Karthik Gaekwad shares what he’s learned in the application security sphere that still applies in the modern world of FaaS. Using lambhack, a sample vulnerable serverless application written in Go, Karthik explains how to think about security in the serverless world and details security strategies and pitfalls viewed through a serverless lens. You’ll leave with a solid understanding of how to approach security conversations about serverless applications in the enterprise.

Chaos Breeding Confidence: Broader Implications of Chaos Engineering

Patrick Higgins - Gremlin

Chaos engineering is fast becoming a contemporary zeitgeist of SRE and Dev Ops. While this trend has positive implications for these fields specifically, chaos engineering has more to offer the industry as a whole.
The goal of this talk is to showcase alternative applications of chaos engineering as a discipline and describe how chaos engineering can be applied as a broader practical imperative. The talk will address how ‘optimism bias’ negatively affects the creation of product – from specifications to technical implementation details and resource management. Additionally, the talk will explore how the lessons of ‘progressive enhancement’ in the browser can be applied to the complexity of distributed systems today.
Rather than merely presenting challenges, the talk will offer applied solutions for how failure can be made a first-class citizen throughout every organization.

Data Modeling in the 24th and ½ Century with Apache Cassandra

Amanda Moran - DataStax

Why do I want a cloud-native database? Why all this migration headache? Can’t I just keep my relation database? This talk will focus on Apache Cassandra data modeling, how to do it right, and how to be successful with cloud-native distributed databases by avoiding common mistakes. Some of the topics covered in this session are:
What needs to be considered when moving from a relational database to Apache Cassandra?
What needs to be considered when moving from another NoSQL database to Apache Cassandra?
What is the difference between SQL and CQL?
How to do data modeling in Apache Cassandra? Steps on how to get your data model correct
Common Mistakes and how to fix them to be successful.

How to Scale your Customer Experience

Chris McCraw - Netlify

Do you wish your company's Support team was constantly bringing you User Stories and filing better bugs? This talk will instruct and demonstrate how to create a better environment for collaborative work across teams particularly as they grow in size and products grow in complexity. We'll cover topics including:
- helping your support team think like engineers. This leads to better escalations and feedback.
- developing an engineering relationship directly with your customers.
- working to develop a model actionable feedback (instead of "this is broken", "this could meet more use cases we've heard about if behavior changed ")
- developing a better escalation path for customer-facing issues
You'll end up with some thoughts and practices that will help your customer experience and Support team interactions scale as fast as your business.
Intended audience:Engineers/Engineering Managers or Product Managers who want to get "closer to the meat" (receive more direct and actionable customer feedback)

Schema Evolution Patterns

Alex Rasmussen - Bits on Disk

Everybody’s talking about microservices, but nobody seems to agree on how to make them talk to each other. How should you version your APIs, and how does API version deprecation actually work in practice? Do you use plain old JSON, Thrift, protocol buffers, GraphQL? How do teams communicate changes in their services’ interfaces, and how do consumer services respond?
Separately, nobody seems to agree on how to handle migrating a service’s structured data without downtime. Do you write to shadow tables? Chain new tables off the old ones? Just run the migration live and hope nothing bad happens? Switch everything over to NoSQL?
Both these problems are instances of issues with schema evolution: what happens when the structure of your structured data changes. In this talk, rather than taking a prescriptive approach, I’ll try to distill a lot of institutional knowledge and computer science history into a set of patterns and examine the tradeoffs between them.

Ariadne's thread through the labyrinth: Using observability to tame a rogue code base

Isobel Redelmeier - LightStep

Every week, seven brave SWEs and seven brave SREs get sacrificed to the Minotaur: the legendary latency leech and the ravenous resource robber lurking somewhere in the labyrinthine depths of your code base.
You, Theseus, have been tasked with rescuing your sleep-deprived comrades from their midnight pages. But even a hero as brave as you cannot possibly survive the maze without some help! This workshop will be your Ariadne, providing you with the thread to find your way to the beast – and back.
We’ll use observability tools to unravel a number of common problems hidden in many code bases out there – maybe even yours. The workshop will cover the basics of tracing, metrics, and other observability tools, so you don’t need significant prior knowledge of observability concepts (or Greek mythology). For those with experience, the activities should still help you to better leverage your debugging toolkit.
Let’s conquer the Minotaur and root out your performance problems!
Required backqround: Attendees should be comfortable on the dev and/or ops side of production. They may be in their first year or their 14th on the job, but they're either not as comfortable with observability as they'd like or aren't yet familiar with the term. The theoretical perfect attendee can list off the creative workarounds they've found for grepping through the logs. Maybe they're an SRE at a company that doesn't yet have all the best APM tooling in place; maybe they straddle the world between "dev" and "ops", and would like to be more proactive about their SLIs. Maybe they don't know their SLIs, or even what an SLI is - that's alright!

Kubernetes is Still Hard for App Developers, Let’s Fix That!

Aaron Schlesinger - Microsoft

Kubernetes is all the rage these days, and for good reason. Among other benefits, app development teams get to use battle-hardened infrastructure to build and deploy containers, use modern tech and practices, and lower their cloud bill. But these days, the journey to Kubernetes is long and hard.
In this session, I’ll present some case studies that reveal the general needs of many app developers. Using these case studies, we’ll build a long list of concepts and technologies that folks need to learn before they can even think about deploying their apps on Kubernetes, and we’ll talk about the strategies (and hacks, and shortcuts!) that some teams have taken to get up and running faster.
From there, we’ll talk about some tools that help shorten that list, ease the transition, and make the day to day life of app developers easier on Kubernetes. We’ll wrap up with a holistic view of a team’s needs and how we can meet them with these tools.
The audience will leave with a deep understanding of what teams need to succeed on Kubernetes, some important tools they can use to meet those needs, and how they can use them today to realize the benefits that Kubernetes can bring to their apps right now.
requirements: Attendees will need to understand cloud computing and some of the major services that cloud vendors offer (e.g. compute, storage, networking, databases). They’ll also need a high level understanding of containers and container orchestrators.

Serverless Security: Attackers & Defenders

Ory Segal - PureSec

In cloud-native environments in general, and serverless in particular, the cloud provider is responsible for securing the underlying infrastructure, from the data centers all the way up to the container and runtime environment. This relieves much of the security burden from the application owner, however it also poses many unique challenges when it comes to securing the application layer. In this presentation, we will discuss the most critical challenges related to securing serverless applications – from development to deployment. We will also walk through a live demo of a realistic serverless application that contains several common vulnerabilities, and see how they can be exploited by attackers, and how to secure them.
Key takeaways include:
1) Understand application security challenges for serverless architectures
2) Learn about the key risks and developer mistakes for serverless applications
3) See how an attacker approaches serverless apps, and exploits weaknesses
4) Learn how to protect and defend your serverless code
5) Learn about open source tools that can help

Kubernetes for Stateful MPP systems

Paige Roberts / Deepak Majeti - Vertica

Containers de-couple applications from the underlying infrastructure. With the advent of low-cost public infrastructure providers such as Amazon, Google, etc., many applications are now being modified to run inside containers to enable simpler and faster deployment on any platform. Containers with the aid of deployment tools such as Kubernetes also enable applications to scale quickly on clouds.
De-coupling distributed databases from the underlying infrastructure would provide many benefits. You could run analytics on any hardware at scale, for instance. K8 could also make recoverability on cloud deployments automatic, making applications far more resilient.
However, Kubernetes started out only supporting applications that could be decomposed into micro-services, which are independent and stateless.
Spikes in demand hit database users hard, and node failures can bog down whole clusters without proper recovery. GoodData, for example, saw that node failures on the cloud could affect their Vertica MPP database, which caused a reduction in customer satisfaction.
The Vertica R & D team set out to find a way to make failure handling seamless and node recovery automatic.
Kubernetes was the obvious choice, but K8 is traditionally used for micro-services, not something like a stateful MPP database that might need hundreds of containers. In order to merge the power of an MPP analytics database with the flexibility of Kubernetes, a lot of hurdles had to be overcome.
In this talk, you will learn the challenges with networking, storage, and operational complexity we encountered while extending a stateful distributed database system to work with containers and Kubernetes. We will also describe one implementation, Gooddata that overcomes these challenges, and serves as a practical example of how this can work.
This presentation will explore some of the mistakes we made, and lessons we learned along the way to save you from having to make the same mistakes when incorporating Kubernetes into your software architecture.

Base64 is not encryption - a better story for Kubernetes Secrets

Seth Vargo - Google

Secrets are a key pillar of Kubernetes’ security model, used internally (e.g. service accounts) and by users (e.g. API keys), but did you know they are stored in plaintext? That’s right, by default all Kubernetes secrets are base64 encoded and stored as plaintext in etcd Anyone with access to the etcd cluster has access to all your Kubernetes secrets.
Thankfully there are better ways. This lecture provides an overview of different techniques for more securely managing secrets in Kubernetes including secrets encryption, KMS plugins, and tools like HashiCorp Vault. Attendees will learn the tradeoffs of each approach to make better decisions on how to secure their Kubernetes clusters.

Everything is a Little Bit Broken ~or~ The Illusion of Control

Heidi Waterhouse - LaunchDarkly

We never change the amount of work or technical debt, we just shift it, and with it, we change how it emerges and appears.
Our systems don’t have to be perfect to be operational – planes, networks, and elite athletes all function at extremely high levels even though they are not operating at 100%.
As an industry, we have moved the locus of control from hardware to operating system to virtual machine, to container, to orchestration, and now we are approaching serverless. None of that has reduced the amount of work that must happen, it just makes it possible to re-use and conceptually compress the work of others. Since we are making the work in our tools less visible, we also have less control over how they work. We end up assuming that the promises that have been true will continue to be true, but that is not in our control.
How do we handle this level of uncertainty? By adding in error budgets, layered access, and other accommodations for failure and for designing our systems for function over form or purity.
The audience will leave with some concrete ideas about how to add resiliency to their system by learning to trust but mitigate their reliance on perfect performance of their underlying tools.