Confirmed Sessions

Take advantage of our discount room block at the conference hotel.

Opening Keynote
How to Lose a Launch

Heidi Waterhouse

Many of us have experienced the feeling of gritty eyes, dark fatigue, and late-night anxiety as we try to get a product or feature out the door. Most of the time, it works, and once in a while, it doesn't. What makes a launch fail, or fizzle, or cost you both literal and metaphorical sleep?
A disaster is a collection of accumulating failures. In this heavily-fictionalized talk, I’m going to discuss what I’ve learned about how to make launches go wrong in order to help you understand how to make your projects go better. We’ll talk about Checklists, Circuit breakers, and Progressive deployment as ways to keep your failures from growing into disasters. You’ll leave with a better understanding of how to design for rapid recovery and failure avoidance. Launching feels like the most perilous part of our software lifecycle, but like with any other dangerous activity, understanding it precisely helps us make it less risky.

Kubernetes Keynote
Advanced kernel security in Kubernetes with eBPF

Kris Nova

With the endless layers of abstraction running a containerized application in a Kubernetes cluster, how does one approach keeping the cluster secure?
In this presentation we look at how taking the must fundamental component of a Linux system (syscall data) and bubbling that up to an API driven application in Userland has given us powerful visibility and control over the complexity of a Kubernetes cluster.
We learn about the eBPF protocol, and how it enables us to safely and securely parse this information. Furthermore we look at how we are able to parse this data at runtime.
The demo
1) We write a C program to use clone(2) to create a new child process using various namespaces in the Kernel. We then use cgroups to restrict the resources for our new process. The audience watches a “container” being created from thin air using nothing but Linux and C.
2) We then look at writing an eBPF program while exploring the interface. The new program will parse the syscall information from memory.
3) We then tweak the “container” to do something malicious on the system, and watch as we are able to see the malicious syscall information.
The audience walks away understanding how these primitives work, and how tools like Falco empower us to have never before seen control over a Linux system.

Falco Workshop with Kris Nova

While she is in Austin, Kris Nova has graciously agreed to offer a 2 hour Falco Workshop at the Scalability Summit. The workshop will be in the afternoon, and the bar will be open. We will set the room up classroom style, so that you will have a place to set your laptop - and your beer.
Falco ( ) is an open source project for intrusion and abnormality detection for Cloud Native platforms such as Kubernetes, Mesosphere, and Cloud Foundry. It detects abnormal application behavior. Alert via Slack, Fluentd, NATS, and more. It allows you to protect your platform by taking action through serverless (FaaS) frameworks, or other automation. The Falco project was hatched to understand container behavior and protect your platform from possible malicious activity. Leveraging Sysdig’s open source Linux kernel instrumentation, Falco gains deep insight into system behavior. The rules engine can then detect abnormal activity in applications, containers, the underlying host, and the container platform.

Who guards the guardians? Designing for resilience in cluster orchestrators

Preetha Appan - HashiCorp

Cluster orchestrators enable reliable and repeatable application deploys and provide fault tolerance without operator intervention. These orchestrators are themselves complex distributed systems like the applications they manage. The blast radius when a cluster orchestrator fails is huge; it could take down all your applications. Designing resilience into the orchestrator is a unique challenge given its critical operational nature.
Preetha Appan outlines various failure modes ranging from network failures to entire server failures in Nomad, an open source scheduler that supports heterogeneous workloads. You’ll discover how building graceful degradation and resilience to address these failures involves looking at the problem as a trade-off between three system features: correctness, performance, and availability. Along the way, Preetha shares examples of design decisions that impact the availability of applications managed by the scheduler and lessons learned that apply to building any complex distributed system.

Visiting from out of town? Book a room at the conference hotel, send us your confirmation code, and we'll send you a discount link for a $100 ticket.

Isolate Computing

Zack Bloom - Cloudflare

For forty years computation has been built around the idea of a process as the fundamental abstraction of a piece of code to be executed. In that time, how we write code has changed dramatically, culminating with serverless, but the nature of a process has not.
Processes unfortunately incur a context-switching overhead as the operating system moves the Processor from executing one serverless container to another, wasting CPU cycles. Processes also can only do IO and other critical tasks by firing interrupts into the kernel which waste as much as 33% of the execution time of an IO bound function. Processes also incur startup time as heavyweight virtual machines like NodeJS are initialized, which we experience in the serverless world as a cold start. The fear of cold starts require us to do complex work to warm serverless functions, and require even infrequently used functions to consume precious memory to avoid them.
There may be an alternative. Web browsers have solved the same problem, the need to run many instances of untrusted code with minimal overhead and start new code execution lightning-fast, in an entirely different way. They run a single virtual machine, and encapsulate each piece of code not in a process, but in an ‘isolate’. These isolates can be started in 5 milliseconds, 100 times faster than a Node Lambda serverless function. They also consume 1/10 the memory.
Beyond serverless, being able to initiate execution of server-side code in less time than it takes for a web request to connect opens dramatic possibilities. Services can be scaled to millions of requests per second instantaneously. They can be deployed to hundreds of locations around the world with the same economics as deploying to just one. Even better, by eliminating process-related overhead, it brings us close to the economics of running on bare metal, but with the ergonomics of serverless programming.
Leaving this presentation attendees will have an understanding of where Isolate-based serverless might be more appropriate than other forms of compute. In those situations, they will be able to deploy code which can be affordably ran close to every Internet visitor, which can autoscale instantaneously, and which can be as much as three times less expensive than container based serverless systems per CPU-cycle.

Kubernetes in Production - A Customer’s Journey

Bill Plein - Diamanti

Digital Transformation has many challenges. Enterprises can fall into the trap of reusing existing IT investments in order to reduce up-front costs and training. \
Bill Plein will share a successful IT transformation case study that examined the use of infrastructure, tools, and processes to enable a hybrid, modern and open Kubernetes platform for legacy and greenfield applications.

Building Maintainable, Observable Applications on Multi Cloud Serverless Architecture

Park Kittipatkul - SignalFx

Serverless computing has a number of obvious benefits over traditional application infrastructure – you pay only for what you use, scale up or down immediately to match supply with demand, and avoid operating any server infrastructure at all.
However, implementing maintainable and scalable applications using serverless computing services like AWS Lambda or Google Cloud Functions poses a number of challenges. The absence of long-lived, user-managed servers means that states cannot be maintained by the service. Longer function invocation times (referred to as cold starts) become very important to track, because they impact the response time of the service and will impose additional cost. Additionally, the transition to smaller individual components (much like breaking a monolithic application into microservices) results in a simpler deployment model, but makes the system as a whole increasingly complex.
With increasing adoptions of multiple cloud providers for redundancy and resiliency, usages of serverless architecture need to now be adjusted to work seamlessly across various providers.
In this talk, the speaker will discuss patterns and best practices around architecting and implementing code in serverless environments, specifically around how to build maintainable serverless code and minimize the occurrence of cold starts. Additionally, the speaker will cover how to properly instrument applications and supporting services so that your systems remain easily observable. Finally, the speaker will cover considerations and techniques for how to build and maintain multi cloud serverless services.

How I failed to build a runbook automation system and what I learned

Tim Bonci - Cimpress

Our intentions can be good and the technical ability and time may be there and we’re going to build the thing to make work easier and more productive, allowing everyone to apply their labor to only the most valuable tasks – yet sometimes it’s still not enough. This is a post-mortem of a solution that was designed to solve a common operational problem but failed. I’ll talk about the scars and hopefully provide insights into finding and addressing the right problems in the right places that should be broadly useful in building and deploying your own transformational processes and tools.
This should be particularly relevant to brownfields teams looking for ways to modernize their processes and anyone who struggles with needing humans to change how they work.
We’ll go over why shifting human processes to computer automation does not always produce the expected results and how treating non-urgent alerts as a work queue is an antipattern.

Scaling SRE organizations: The journey from 1 to many teams

Gustavo Franco - Google

In this talk, Gustavo Franco of Google will share his experience starting new teams, splitting and moving them from both technical and non-technical standpoints. This is ideal for new leaders in charge of SRE wondering when it’s time to grow beyond a single team and how to. This is also very valuable for SREs who are interested to know what happens behind the scenes, how to influence such changes and how they can help while avoiding burnout.

Security in the FaaS lane

Karthik Gaekwad - Oracle

Security in FaaS isn’t what you’re used to. With enterprises quickly moving to serverless, there’s a need to address the topic of security.
Karthik Gaekwad shares what he’s learned in the application security sphere that still applies in the modern world of FaaS. Using lambhack, a sample vulnerable serverless application written in Go, Karthik explains how to think about security in the serverless world and details security strategies and pitfalls viewed through a serverless lens. You’ll leave with a solid understanding of how to approach security conversations about serverless applications in the enterprise.

Productionizing Deep Learning in Health Care

Graham Ganssle - Expero

Expero worked with Kinsa Health to design and develop a deep learning system to forecast the spread of illness throughout the United States. The system is based on a geospatially linked mesh of recurrent neural networks, and is accurate beyond the capability of the CDC. That’s wonderful, but no one cares if you can’t deliver the results to users. Join us while we discuss the design process, initial deployment, failure of that deployment, re-architecture, and re-deployment of the system at production scale. Using worst (then best) practices in MLOps, we built a system to operate for all national application users.
The initial failure was based on a common architecture using a task management system which deployed containers onto VMs at runtime. We then pivoted towards using a fully hardware-managed solution by a major cloud provider to handle scalable deployment and elastic load. After training, we deployed the model using kubernetes to handle traffic across all regions in the United States. In retrospect, each solution has its benefits and detriments; we’ll discuss them all, and let the audience decide which is best for their solution.

Case Study: Comprehensive Scaling and Capacity Planning with k8s, Terraform, Ansible and Gatling

Brian Hall - Expero

Now that shared nothing, noSQL solutions are widely available in the marketplace, deploying clusters of most sizes is both practical and necessary. But how quickly do my clusters need to expand (or contract) and how can you be sure that you’ve properly planned and budgeted for that deluge of traffic after your new product announcement?
In this case study, we’ll walk you through how we setup a dynamically scaling cluster management solution that could add or remove capacity as planned traffic demands. All of this informed by a sophisticated test harness to game out for possible and probable scenarios.
Topics that will be covered are:
1. Deploying complex configuration dynamically within Ansible and Terraform scripts
2. Automating a fluctuating test harness to model spikes and plateaus within read and write traffic patterns with Kubernetes
3. Keeping a healthy ecosystem through monitoring and proactive maintenance
4. Push button provisioning for handling “Black Friday” type spikes in demand
Intended Audience: architects, devops and technical support personnel tasked with keeping a NoSQL database cluster healthy and properly scaled to meet fluctuating demand
Skills and Concepts Required: audience members should understand shared-nothing big data architectures, cloud automation scripting tools such as Kubernetes, Ansible and Terraform and load testing strategies

How to Scale your Customer Experience

Chris McCraw - Netlify

Do you wish your company's Support team was constantly bringing you User Stories and filing better bugs? This talk will instruct and demonstrate how to create a better environment for collaborative work across teams particularly as they grow in size and products grow in complexity. We'll cover topics including:
- helping your support team think like engineers. This leads to better escalations and feedback.
- developing an engineering relationship directly with your customers.
- working to develop a model actionable feedback (instead of "this is broken", "this could meet more use cases we've heard about if behavior changed ")
- developing a better escalation path for customer-facing issues
You'll end up with some thoughts and practices that will help your customer experience and Support team interactions scale as fast as your business.
Intended audience:Engineers/Engineering Managers or Product Managers who want to get "closer to the meat" (receive more direct and actionable customer feedback)

Service Mush: Debugging Istio Workloads

Sandeep Parikh - Google
Microservices often increase complexity and spread it across the organization. In this session, Sandeep will discuss how Istio can help. He will begin with a high level overview of major components: Traffic, Telemetry, and Security. He will then discuss the challenges that Istio brings with respect to service growth, component scaling, and troubleshooting.
This will be followed with a series of demos:
- Traffic: Why is traffic not routing to my services as specified
- Metrics: Where are my custom metrics, and why won't Mixer forward my metrics across?
- Security: Why isn't mTLS properly rolling out between my services?
- Tools: Which tools do I need in my toolbox to troubleshoot effectively?
This will be an in-depth and almost completely live debugging session with just a little bit of slides to open and end.
Skills and Concepts Required: Familiarity with service mesh concepts, Familiarity with Kubernetes primitives (Pods and Services)
Intended audience: Kubernetes cluster operators, Service mesh operators, SREs

Kick-starting a culture of observability and data-driven DevOps

Rajesh Raman - SignalFx

It’s widely recognized that monitoring is a critical aspect of operating a service, but the practice of observability is still relatively nascent in most organizations. While monitoring can indicate a problem, it’s only by making systems observable that teams can understand the behavior of complex systems, isolate causes, and effectively remediate incidents. The resulting insights provide value that extends far beyond the ability to successfully operate systems. Deeper knowledge of system behavior can inform and focus future development efforts, or quantify the value and effect of more recent work.
Rajesh Raman dives deep into the practice of observability, demonstrating how a more analytics-driven approach to metrics, traces, and other monitoring signals improves observability. You’ll learn a framework for kick-starting a culture of observability in your organization, informed by Rajesh’s experience building and deploying observability tools at SignalFx.

Schema Evolution Patterns

Alex Rasmussen - Bits on Disk

Everybody’s talking about microservices, but nobody seems to agree on how to make them talk to each other. How should you version your APIs, and how does API version deprecation actually work in practice? Do you use plain old JSON, Thrift, protocol buffers, GraphQL? How do teams communicate changes in their services’ interfaces, and how do consumer services respond?
Separately, nobody seems to agree on how to handle migrating a service’s structured data without downtime. Do you write to shadow tables? Chain new tables off the old ones? Just run the migration live and hope nothing bad happens? Switch everything over to NoSQL?
Both these problems are instances of issues with schema evolution: what happens when the structure of your structured data changes. In this talk, rather than taking a prescriptive approach, I’ll try to distill a lot of institutional knowledge and computer science history into a set of patterns and examine the tradeoffs between them.

Kubernetes is Still Hard for App Developers, Let’s Fix That!

Aaron Schlesinger - Microsoft

Kubernetes is all the rage these days, and for good reason. Among other benefits, app development teams get to use battle-hardened infrastructure to build and deploy containers, use modern tech and practices, and lower their cloud bill. But these days, the journey to Kubernetes is long and hard.
In this session, I’ll present some case studies that reveal the general needs of many app developers. Using these case studies, we’ll build a long list of concepts and technologies that folks need to learn before they can even think about deploying their apps on Kubernetes, and we’ll talk about the strategies (and hacks, and shortcuts!) that some teams have taken to get up and running faster.
From there, we’ll talk about some tools that help shorten that list, ease the transition, and make the day to day life of app developers easier on Kubernetes. We’ll wrap up with a holistic view of a team’s needs and how we can meet them with these tools.
The audience will leave with a deep understanding of what teams need to succeed on Kubernetes, some important tools they can use to meet those needs, and how they can use them today to realize the benefits that Kubernetes can bring to their apps right now.
requirements: Attendees will need to understand cloud computing and some of the major services that cloud vendors offer (e.g. compute, storage, networking, databases). They’ll also need a high level understanding of containers and container orchestrators.

Kubernetes for Stateful MPP systems

Paige Roberts / Deepak Majeti - Vertica

Containers de-couple applications from the underlying infrastructure. With the advent of low-cost public infrastructure providers such as Amazon, Google, etc., many applications are now being modified to run inside containers to enable simpler and faster deployment on any platform. Containers with the aid of deployment tools such as Kubernetes also enable applications to scale quickly on clouds.
De-coupling distributed databases from the underlying infrastructure would provide many benefits. You could run analytics on any hardware at scale, for instance. K8 could also make recoverability on cloud deployments automatic, making applications far more resilient.
However, Kubernetes started out only supporting applications that could be decomposed into micro-services, which are independent and stateless.
Spikes in demand hit database users hard, and node failures can bog down whole clusters without proper recovery. GoodData, for example, saw that node failures on the cloud could affect their Vertica MPP database, which caused a reduction in customer satisfaction.
The Vertica R & D team set out to find a way to make failure handling seamless and node recovery automatic.
Kubernetes was the obvious choice, but K8 is traditionally used for micro-services, not something like a stateful MPP database that might need hundreds of containers. In order to merge the power of an MPP analytics database with the flexibility of Kubernetes, a lot of hurdles had to be overcome.
In this talk, you will learn the challenges with networking, storage, and operational complexity we encountered while extending a stateful distributed database system to work with containers and Kubernetes. We will also describe one implementation, Gooddata that overcomes these challenges, and serves as a practical example of how this can work.
This presentation will explore some of the mistakes we made, and lessons we learned along the way to save you from having to make the same mistakes when incorporating Kubernetes into your software architecture.

Base64 is not encryption - a better story for Kubernetes Secrets

Seth Vargo - Google

Secrets are a key pillar of Kubernetes' security model, used internally (e.g. service accounts) and by users (e.g. API keys), but did you know they are stored in plaintext? That's right, by default all Kubernetes secrets are base64 encoded and stored as plaintext in etcd. Anyone with access to the etcd cluster has access to all your Kubernetes secrets.
Thankfully there are better ways. This lecture provides an overview of different techniques for more securely managing secrets in Kubernetes including secrets encryption, KMS plugins, and tools like HashiCorp Vault. Attendees will learn the tradeoffs of each approach to make better decisions on how to secure their Kubernetes clusters.
This lecture and discussion outlines the current state of Kubernetes' security with respect to managing and securing Kubernetes Secrets.
First, attendees will learn the current state of the world: a default Kubernetes cluster has secrets pretty widely exposed. We will talk briefly about how some cloud providers add additional layers of security, but the default is insecure.
Next, attendees will learn about features released in Kubernetes 1.7 to allow for application-layer encryption of secrets. We will discuss the pros and cons of this approach, and walk through same code/live demo of it working in action.
Next, attendees will learn about features released in Kubernetes 1.10 to allow for delegated application-layer encryption of secrets to a KMS provider. We will again discuss the pros and cons of this approach, show some code and live demos.
Finally, attendees will see an example of a full secrets management on Kubernetes, using the open source HashiCorp Vault tool.
Importantly, this talk is not a general security talk - it is specific to Kubernetes secrets. Specifically, there are no plans to discuss cluster-level security, firewall security, ACLs, or RBAC.