Confirmed Sessions

We continue to confirm and add sessions. Check back for updates.

Kubernetes Keynote
Advanced kernel security in Kubernetes with eBPF

Kris Nova

With the endless layers of abstraction running a containerized application in a Kubernetes cluster, how does one approach keeping the cluster secure?
In this presentation we look at how taking the must fundamental component of a Linux system (syscall data) and bubbling that up to an API driven application in Userland has given us powerful visibility and control over the complexity of a Kubernetes cluster.
We learn about the eBPF protocol, and how it enables us to safely and securely parse this information. Furthermore we look at how we are able to parse this data at runtime.
The demo
1) We write a C program to use clone(2) to create a new child process using various namespaces in the Kernel. We then use cgroups to restrict the resources for our new process. The audience watches a “container” being created from thin air using nothing but Linux and C.
2) We then look at writing an eBPF program while exploring the interface. The new program will parse the syscall information from memory.
3) We then tweak the “container” to do something malicious on the system, and watch as we are able to see the malicious syscall information.
The audience walks away understanding how these primitives work, and how tools like Falco empower us to have never before seen control over a Linux system.

Falco Workshop with Kris Nova

While she is in Austin, Kris Nova has graciously agreed to offer a 2 hour Falco Workshop at the Scalability Summit. The workshop will be in the afternoon, and the bar will be open. We will set the room up classroom style, so that you will have a place to set your laptop - and your beer.
Falco ( https://falco.org/ ) is an open source project for intrusion and abnormality detection for Cloud Native platforms such as Kubernetes, Mesosphere, and Cloud Foundry. It detects abnormal application behavior. Alert via Slack, Fluentd, NATS, and more. It allows you to protect your platform by taking action through serverless (FaaS) frameworks, or other automation. The Falco project was hatched to understand container behavior and protect your platform from possible malicious activity. Leveraging Sysdig’s open source Linux kernel instrumentation, Falco gains deep insight into system behavior. The rules engine can then detect abnormal activity in applications, containers, the underlying host, and the container platform.

Who guards the guardians? Designing for resilience in cluster orchestrators

Preetha Appan - HashiCorp

Cluster orchestrators enable reliable and repeatable application deploys and provide fault tolerance without operator intervention. These orchestrators are themselves complex distributed systems like the applications they manage. The blast radius when a cluster orchestrator fails is huge; it could take down all your applications. Designing resilience into the orchestrator is a unique challenge given its critical operational nature.
Preetha Appan outlines various failure modes ranging from network failures to entire server failures in Nomad, an open source scheduler that supports heterogeneous workloads. You’ll discover how building graceful degradation and resilience to address these failures involves looking at the problem as a trade-off between three system features: correctness, performance, and availability. Along the way, Preetha shares examples of design decisions that impact the availability of applications managed by the scheduler and lessons learned that apply to building any complex distributed system.

Visiting from out of town? Book a room at the conference hotel, send us your confirmation code, and we'll send you a discount link for a $100 ticket.

Improving reliability of your distributed data store

Mehant Baid - Dropbox

Edgestore is a low latency, distributed data store that is one of the largest services developed at Dropbox. It serves 10 million requests per second and stores over 10 trillion objects at rest. In the last few years Edgestore has grown from being used by a handful of services to being the primary data store for all of Dropbox’s metadata needs.
This talk will discuss the challenges we faced as we scaled Edgestore and our journey from being an operationally burdensome service that’s plagued with incidents and postmortems to a service that’s highly reliable and operationally light weight.

Isolate Computing

Zack Bloom - Cloudflare

For forty years computation has been built around the idea of a process as the fundamental abstraction of a piece of code to be executed. In that time, how we write code has changed dramatically, culminating with serverless, but the nature of a process has not.
Processes unfortunately incur a context-switching overhead as the operating system moves the Processor from executing one serverless container to another, wasting CPU cycles. Processes also can only do IO and other critical tasks by firing interrupts into the kernel which waste as much as 33% of the execution time of an IO bound function. Processes also incur startup time as heavyweight virtual machines like NodeJS are initialized, which we experience in the serverless world as a cold start. The fear of cold starts require us to do complex work to warm serverless functions, and require even infrequently used functions to consume precious memory to avoid them.
There may be an alternative. Web browsers have solved the same problem, the need to run many instances of untrusted code with minimal overhead and start new code execution lightning-fast, in an entirely different way. They run a single virtual machine, and encapsulate each piece of code not in a process, but in an ‘isolate’. These isolates can be started in 5 milliseconds, 100 times faster than a Node Lambda serverless function. They also consume 1/10 the memory.
Beyond serverless, being able to initiate execution of server-side code in less time than it takes for a web request to connect opens dramatic possibilities. Services can be scaled to millions of requests per second instantaneously. They can be deployed to hundreds of locations around the world with the same economics as deploying to just one. Even better, by eliminating process-related overhead, it brings us close to the economics of running on bare metal, but with the ergonomics of serverless programming.
Leaving this presentation attendees will have an understanding of where Isolate-based serverless might be more appropriate than other forms of compute. In those situations, they will be able to deploy code which can be affordably ran close to every Internet visitor, which can autoscale instantaneously, and which can be as much as three times less expensive than container based serverless systems per CPU-cycle.

How I failed to build a runbook automation system and what I learned

Tim Bonci - Cimpress

Our intentions can be good and the technical ability and time may be there and we’re going to build the thing to make work easier and more productive, allowing everyone to apply their labor to only the most valuable tasks – yet sometimes it’s still not enough. This is a post-mortem of a solution that was designed to solve a common operational problem but failed. I’ll talk about the scars and hopefully provide insights into finding and addressing the right problems in the right places that should be broadly useful in building and deploying your own transformational processes and tools.
This should be particularly relevant to brownfields teams looking for ways to modernize their processes and anyone who struggles with needing humans to change how they work.
We’ll go over why shifting human processes to computer automation does not always produce the expected results and how treating non-urgent alerts as a work queue is an antipattern.

Enterprise transformation (and you can too)

Donovan Brown - Microsoft

“That would never work here.” You’ve likely heard this sentiment echo from your company’s conference rooms or board rooms (or maybe you’ve said it yourself). There are always reasons: established processes (with vested interests supporting them), legacy codebases and data centers (both with large install footprints), and scale (for some values of scale), to name just a few.
Good news: change is possible. Donovan Brown walks you through a case study from Microsoft’s Visual Studio Team Services (VSTS). VSTS went from a three-year waterfall delivery cycle to three-week iterations and open sourced the VSTS task library and the Git Virtual File System (GVFS). To make these changes, the team had to question its tool choices, change its processes, and empower its people. You’ll learn why integration of cross-functional teams is key to the continuous delivery of value to end users.

Visiting from out of town? Book a room at the conference hotel, send us your confirmation code, and we'll send you a discount link for a $100 ticket.

Scaling SRE organizations: The journey from 1 to many teams

Gustavo Franco - Google

In this talk, Gustavo Franco of Google will share his experience starting new teams, splitting and moving them from both technical and non-technical standpoints. This is ideal for new leaders in charge of SRE wondering when it’s time to grow beyond a single team and how to. This is also very valuable for SREs who are interested to know what happens behind the scenes, how to influence such changes and how they can help while avoiding burnout.

Security in the FaaS lane

Karthik Gaekwad - Oracle

Security in FaaS isn’t what you’re used to. With enterprises quickly moving to serverless, there’s a need to address the topic of security.
Karthik Gaekwad shares what he’s learned in the application security sphere that still applies in the modern world of FaaS. Using lambhack, a sample vulnerable serverless application written in Go, Karthik explains how to think about security in the serverless world and details security strategies and pitfalls viewed through a serverless lens. You’ll leave with a solid understanding of how to approach security conversations about serverless applications in the enterprise.

Productionizing Deep Learning in Health Care

Graham Ganssle - Expero

Expero worked with Kinsa Health to design and develop a deep learning system to forecast the spread of illness throughout the United States. The system is based on a geospatially linked mesh of recurrent neural networks, and is accurate beyond the capability of the CDC. That’s wonderful, but no one cares if you can’t deliver the results to users. Join us while we discuss the design process, initial deployment, failure of that deployment, re-architecture, and re-deployment of the system at production scale. Using worst (then best) practices in MLOps, we built a system to operate for all national application users.
The initial failure was based on a common architecture using a task management system which deployed containers onto VMs at runtime. We then pivoted towards using a fully hardware-managed solution by a major cloud provider to handle scalable deployment and elastic load. After training, we deployed the model using kubernetes to handle traffic across all regions in the United States. In retrospect, each solution has its benefits and detriments; we’ll discuss them all, and let the audience decide which is best for their solution.

Case Study: Comprehensive Scaling and Capacity Planning with k8s, Terraform, Ansible and Gatling

Brian Hall - Expero

Now that shared nothing, noSQL solutions are widely available in the marketplace, deploying clusters of most sizes is both practical and necessary. But how quickly do my clusters need to expand (or contract) and how can you be sure that you’ve properly planned and budgeted for that deluge of traffic after your new product announcement?
In this case study, we’ll walk you through how we setup a dynamically scaling cluster management solution that could add or remove capacity as planned traffic demands. All of this informed by a sophisticated test harness to game out for possible and probable scenarios.
Topics that will be covered are:
1. Deploying complex configuration dynamically within Ansible and Terraform scripts
2. Automating a fluctuating test harness to model spikes and plateaus within read and write traffic patterns with Kubernetes
3. Keeping a healthy ecosystem through monitoring and proactive maintenance
4. Push button provisioning for handling “Black Friday” type spikes in demand
Intended Audience: architects, devops and technical support personnel tasked with keeping a NoSQL database cluster healthy and properly scaled to meet fluctuating demand
Skills and Concepts Required: audience members should understand shared-nothing big data architectures, cloud automation scripting tools such as Kubernetes, Ansible and Terraform and load testing strategies

How to Scale your Customer Experience

Chris McCraw - Netlify

Do you wish your company's Support team was constantly bringing you User Stories and filing better bugs? This talk will instruct and demonstrate how to create a better environment for collaborative work across teams particularly as they grow in size and products grow in complexity. We'll cover topics including:
- helping your support team think like engineers. This leads to better escalations and feedback.
- developing an engineering relationship directly with your customers.
- working to develop a model actionable feedback (instead of "this is broken", "this could meet more use cases we've heard about if behavior changed ")
- developing a better escalation path for customer-facing issues
You'll end up with some thoughts and practices that will help your customer experience and Support team interactions scale as fast as your business.
Intended audience:Engineers/Engineering Managers or Product Managers who want to get "closer to the meat" (receive more direct and actionable customer feedback)

Service Mush: Debugging Istio Workloads

Sandeep Parikh - Google
Microservices often increase complexity and spread it across the organization. In this session, Sandeep will discuss how Istio can help. He will begin with a high level overview of major components: Traffic, Telemetry, and Security. He will then discuss the challenges that Istio brings with respect to service growth, component scaling, and troubleshooting.
This will be followed with a series of demos:
- Traffic: Why is traffic not routing to my services as specified
- Metrics: Where are my custom metrics, and why won't Mixer forward my metrics across?
- Security: Why isn't mTLS properly rolling out between my services?
- Tools: Which tools do I need in my toolbox to troubleshoot effectively?
This will be an in-depth and almost completely live debugging session with just a little bit of slides to open and end.
Skills and Concepts Required: Familiarity with service mesh concepts, Familiarity with Kubernetes primitives (Pods and Services)
Intended audience: Kubernetes cluster operators, Service mesh operators, SREs

Kick-starting a culture of observability and data-driven DevOps

Rajesh Raman - SignalFx

It’s widely recognized that monitoring is a critical aspect of operating a service, but the practice of observability is still relatively nascent in most organizations. While monitoring can indicate a problem, it’s only by making systems observable that teams can understand the behavior of complex systems, isolate causes, and effectively remediate incidents. The resulting insights provide value that extends far beyond the ability to successfully operate systems. Deeper knowledge of system behavior can inform and focus future development efforts, or quantify the value and effect of more recent work.
Rajesh Raman dives deep into the practice of observability, demonstrating how a more analytics-driven approach to metrics, traces, and other monitoring signals improves observability. You’ll learn a framework for kick-starting a culture of observability in your organization, informed by Rajesh’s experience building and deploying observability tools at SignalFx.

Schema Evolution Patterns

Alex Rasmussen - Bits on Disk

Everybody’s talking about microservices, but nobody seems to agree on how to make them talk to each other. How should you version your APIs, and how does API version deprecation actually work in practice? Do you use plain old JSON, Thrift, protocol buffers, GraphQL? How do teams communicate changes in their services’ interfaces, and how do consumer services respond?
Separately, nobody seems to agree on how to handle migrating a service’s structured data without downtime. Do you write to shadow tables? Chain new tables off the old ones? Just run the migration live and hope nothing bad happens? Switch everything over to NoSQL?
Both these problems are instances of issues with schema evolution: what happens when the structure of your structured data changes. In this talk, rather than taking a prescriptive approach, I’ll try to distill a lot of institutional knowledge and computer science history into a set of patterns and examine the tradeoffs between them.

Kubernetes is Still Hard for App Developers, Let’s Fix That!

Aaron Schlesinger - Microsoft

Kubernetes is all the rage these days, and for good reason. Among other benefits, app development teams get to use battle-hardened infrastructure to build and deploy containers, use modern tech and practices, and lower their cloud bill. But these days, the journey to Kubernetes is long and hard.
In this session, I’ll present some case studies that reveal the general needs of many app developers. Using these case studies, we’ll build a long list of concepts and technologies that folks need to learn before they can even think about deploying their apps on Kubernetes, and we’ll talk about the strategies (and hacks, and shortcuts!) that some teams have taken to get up and running faster.
From there, we’ll talk about some tools that help shorten that list, ease the transition, and make the day to day life of app developers easier on Kubernetes. We’ll wrap up with a holistic view of a team’s needs and how we can meet them with these tools.
The audience will leave with a deep understanding of what teams need to succeed on Kubernetes, some important tools they can use to meet those needs, and how they can use them today to realize the benefits that Kubernetes can bring to their apps right now.
requirements: Attendees will need to understand cloud computing and some of the major services that cloud vendors offer (e.g. compute, storage, networking, databases). They’ll also need a high level understanding of containers and container orchestrators.

Serverless Security: Attackers & Defenders

Ory Segal - PureSec

In cloud-native environments in general, and serverless in particular, the cloud provider is responsible for securing the underlying infrastructure, from the data centers all the way up to the container and runtime environment. This relieves much of the security burden from the application owner, however it also poses many unique challenges when it comes to securing the application layer. In this presentation, we will discuss the most critical challenges related to securing serverless applications – from development to deployment. We will also walk through a live demo of a realistic serverless application that contains several common vulnerabilities, and see how they can be exploited by attackers, and how to secure them.
Key takeaways include:
1) Understand application security challenges for serverless architectures
2) Learn about the key risks and developer mistakes for serverless applications
3) See how an attacker approaches serverless apps, and exploits weaknesses
4) Learn how to protect and defend your serverless code
5) Learn about open source tools that can help

Kubernetes for Stateful MPP systems

Paige Roberts / Deepak Majeti - Vertica

Containers de-couple applications from the underlying infrastructure. With the advent of low-cost public infrastructure providers such as Amazon, Google, etc., many applications are now being modified to run inside containers to enable simpler and faster deployment on any platform. Containers with the aid of deployment tools such as Kubernetes also enable applications to scale quickly on clouds.
De-coupling distributed databases from the underlying infrastructure would provide many benefits. You could run analytics on any hardware at scale, for instance. K8 could also make recoverability on cloud deployments automatic, making applications far more resilient.
However, Kubernetes started out only supporting applications that could be decomposed into micro-services, which are independent and stateless.
Spikes in demand hit database users hard, and node failures can bog down whole clusters without proper recovery. GoodData, for example, saw that node failures on the cloud could affect their Vertica MPP database, which caused a reduction in customer satisfaction.
The Vertica R & D team set out to find a way to make failure handling seamless and node recovery automatic.
Kubernetes was the obvious choice, but K8 is traditionally used for micro-services, not something like a stateful MPP database that might need hundreds of containers. In order to merge the power of an MPP analytics database with the flexibility of Kubernetes, a lot of hurdles had to be overcome.
In this talk, you will learn the challenges with networking, storage, and operational complexity we encountered while extending a stateful distributed database system to work with containers and Kubernetes. We will also describe one implementation, Gooddata that overcomes these challenges, and serves as a practical example of how this can work.
This presentation will explore some of the mistakes we made, and lessons we learned along the way to save you from having to make the same mistakes when incorporating Kubernetes into your software architecture.

Base64 is not encryption - a better story for Kubernetes Secrets

Seth Vargo - Google

Secrets are a key pillar of Kubernetes’ security model, used internally (e.g. service accounts) and by users (e.g. API keys), but did you know they are stored in plaintext? That’s right, by default all Kubernetes secrets are base64 encoded and stored as plaintext in etcd Anyone with access to the etcd cluster has access to all your Kubernetes secrets.
Thankfully there are better ways. This lecture provides an overview of different techniques for more securely managing secrets in Kubernetes including secrets encryption, KMS plugins, and tools like HashiCorp Vault. Attendees will learn the tradeoffs of each approach to make better decisions on how to secure their Kubernetes clusters.

Everything is a Little Bit Broken ~or~ The Illusion of Control

Heidi Waterhouse - LaunchDarkly

We never change the amount of work or technical debt, we just shift it, and with it, we change how it emerges and appears.
Our systems don’t have to be perfect to be operational – planes, networks, and elite athletes all function at extremely high levels even though they are not operating at 100%.
As an industry, we have moved the locus of control from hardware to operating system to virtual machine, to container, to orchestration, and now we are approaching serverless. None of that has reduced the amount of work that must happen, it just makes it possible to re-use and conceptually compress the work of others. Since we are making the work in our tools less visible, we also have less control over how they work. We end up assuming that the promises that have been true will continue to be true, but that is not in our control.
How do we handle this level of uncertainty? By adding in error budgets, layered access, and other accommodations for failure and for designing our systems for function over form or purity.
The audience will leave with some concrete ideas about how to add resiliency to their system by learning to trust but mitigate their reliance on perfect performance of their underlying tools.