Litestream is a standalone streaming replication tool for SQLite. It runs as a background process and safely replicates changes incrementally to another file or S3.
The goal of Landlock is to enable to restrict ambient rights (e.g. global filesystem access) for a set of processes. Because Landlock is a stackable LSM, it makes possible to create safe security sandboxes as new security layers in addition to the existing system-wide access-controls. This kind of sandbox is expected to help mitigate the security impact of bugs or unexpected/malicious behaviors in user space applications. Landlock empowers any process, including unprivileged ones, to securely restrict themselves
A library to do type-safe and low-boilerplate bit level serialization
We present a federated, asynchronous, memory-limited algorithm for online task scheduling across large-scale networks of hundreds of workers. This is achieved through recent advancements in federated edge computing that unlocks the ability to incrementally compute local model updates within each node separately. This local model is then used along with incoming data to generate a rejection signal which reflects the overall node responsiveness and if it is able to accept an incoming task without resulting in degraded performance. Through this innovation, we allow each node to execute scheduling decisions on whether to accept an incoming job independently based on the workload seen thus far. Further, using the aggregate of the iterates a global view of the system can be constructed, as needed, and could be used to produce a holistic perspective of the system. We complement our findings, by an empirical evaluation on a large-scale real-world dataset of traces from a virtualized production data center that shows, while using limited memory, that our algorithm exhibits state-of-the-art performance. Concretely, it is able to predict changes in the system responsiveness ahead of time based on the industry-standard CPU-Ready metric and, in turn, can lead to better scheduling decisions and overall utilization of the available resources. Finally, in the absence of communication latency, it exhibits attractive horizontal scalability.
As distributed systems grow in scale and complexity, the need for flexible automation of systems management functions also grows. We outline a framework for building tools that provide distributed, scalable, declarative, modular, and continuous automation for distributed systems. We focus on four points of design: 1) a state-management approach that prescribes source-of-truth for configured and discovered system states; 2) a technique to solve the declarative unification problem for a class of automation problems, providing state convergence and modularity; 3) an eventual-consistency approach to state synchronization which provides automation at scale; 4) an event-driven architecture that provides always-on state enforcement. We describe the methodology, software architecture for the framework, and constraints for these techniques to apply to an automation problem. We overview a reference application built on this framework that provides state-aware system provisioning and node lifecycle management, highlighting key advantages. We conclude with a discussion of current and future applications
Modern web applications heavily rely on in-memory key-value caches to deliver low-latency, high-throughput services. In-memory caches store small objects of size in the range of 10s to 1000s of bytes, and use TTLs widely for data freshness and implicit delete. Current solutions have relatively large per-object metadata and cannot remove expired objects promptly without incurring a high overhead. We present Segcache, which uses a segment-structured design that stores data in fixed-size segments with three key features: (1) it groups objects with similar creation and expiration time into the segments for efficient expiration and eviction, (2) it approximates some and lifts most per-object metadata into the shared segment header and shared information slot in the hash table for object metadata reduction, and (3) it performs segment-level bulk expiration and eviction with tiny critical sections for high scalability. Evaluation using production traces shows that Segcache uses 22-60% less memory than state-of-the-art designs for a variety of workloads. Segcache simultaneously delivers high throughput, up to 40% better than Memcached on a single thread. It exhibits close-to-linear scalability, providing a close to 8× speedup over Memcached with 24 threads.
Our take is that you should use systemd-resolved, and if you need user-friendly network configuration, a very recent version of NetworkManager (1.26.6 or better). This will give your distro state-of-the-art DNS capabilities, and make implementers of networking software much happier
Quorum systems are a powerful mechanism for ensuring the consistency of replicated data. Production systems usually opt for majority quorums due to their simplicity and fault tolerance, but majority quorum systems provide poor throughput and scalability. Alternatively, researchers have invented a number of theoretically "optimal" quorum systems, but the underlying theory ignores many practical complexities such as machine heterogeneity and workload skew. In this paper, we conduct a pragmatic re-examination of quorum systems. We enrich the current theory on quorum systems with a number of practical refinements to find quorum systems that provide higher throughput, lower latency, and lower network load. We also develop a library Quoracle that precisely quantifies the available trade-offs between quorum systems to empower engineers to find optimal quorum systems, given a set of objectives for specific deployments and workloads. Our tool is available online at: https://github.com/mwhittaker/quoracle.
Framework for evaluating (planet-scale) consensus protocols Tempo, Atlas, EPaxos, FPaxos, Caesar, Janus
Unikraft automatically builds extremely efficient and secure software. By tailoring the operating system, libraries and tools to the particular needs of your application, it vastly reduces virtual machine and container image sizes to a few KBs, drastically cutting down your software stack's attack surface
Modern web applications replicate their data across the globe and require strong consistency guarantees for their most critical data. These guarantees are usually provided via state-machine replication (SMR). Recent advances in SMR have focused on leaderless protocols, which improve the availability and performance of traditional Paxos-based solutions. We propose Tempo - a leaderless SMR protocol that, in comparison to prior solutions, achieves superior throughput and offers predictable performance even in contended workloads. To achieve these benefits, Tempo timestamps each application command and executes it only after the timestamp becomes stable, i.e., all commands with a lower timestamp are known. Both the timestamping and stability detection mechanisms are fully decentralized, thus obviating the need for a leader replica. Our protocol furthermore generalizes to partial replication settings, enabling scalability in highly parallel workloads. We evaluate the protocol in both real and simulated geo-distributed environments and demonstrate that it outperforms state-of-the-art alternatives.
Maelstrom is a workbench for learning distributed systems by writing your own. It uses the Jepsen testing library to test toy implementations of distributed systems. Maelstrom provides standardized tests for things like "a commutative set" or "a transactional key-value store", and lets you learn by writing implementations which those test suites can exercise.
[A] simple mechanism for detection of uninitialized reads in userspace with minimal performance overhead
Leaders make decisions that set context for further decisions. They enable something strategic, but also constrain and shape —but only as essential to system outcomes. And decisions that cross contexts or boundaries, need leadership —to help others understand the need and outcomes, and consequences, and what their role is in making the decision effective. Leaders communicate strategic intent and decisions, and foster organizational will and goodwill, to facilitate work towards coherent strategic outcomes.
We can easily build accurate monitoring for all traffic in only 40 lines of code for bpftrace.
A small object optimization for polymorphic types is a very useful construct so it important to understand how it can be done correctly.
In the course of trying to figure out how to smoothly zoom timelines of a billion trace events, I figured out a cool tree structure that I can’t find elsewhere online, which it turned out two of my friends have independently derived after not finding anything on their own searches. It’s a way of implementing an index for doing range aggregations on an array (e.g “find the sum/max of elements [7,12]”) in O(log N) time, with amortized constant time appends, a simple implementation (around 50 lines of Rust), low constant factors, and low memory overhead.
Recent years have seen Kubernetes emerge as a primary choice for container orchestration. Kubernetes largely targets the cloud environment but new use cases require performant, available and scalable orchestration at the edge. Kubernetes stores all cluster state in etcd, a strongly consistent key-value store. We find that at larger etcd cluster sizes, offering higher availability, write request latency significantly increases and throughput decreases similarly. Coupled with approximately 30% of Kubernetes requests being writes, this directly impacts the request latency and availability of Kubernetes, reducing its suitability for the edge. We revisit the requirement of strong consistency and propose an eventually consistent approach instead. This enables higher performance, availability and scalability whilst still supporting the broad needs of Kubernetes. This aims to make Kubernetes much more suitable for performance-critical, dynamically-scaled edge solutions.
Wuffs is a memory-safe programming language (and a standard library written in that language) for wrangling untrusted file formats safely. Wrangling includes parsing, decoding and encoding. Example file formats include images, audio, video, fonts and compressed archives
Over the years, platforms and application requirements change. As they do, technologies come, go, and return again as the preferred solution to certain system problems. In each of its incarnations, the technology’s details change but the principles remain the same. One such technology is distributed transactions on middle-tier servers. Here, we argue that after a 15-year decline, they need to return to the mainstream
This is a C++ class library for using the Single Instruction Multiple Data (SIMD) instructions to improve performance on modern microprocessors with the x86 or x86/64 instruction set on Windows, Linux, and Mac platforms. There are no plans to support ARM or other instruction sets.
When it comes to having durable data, there are four ways to do it: undo log, redo log, shadow copy and shadow data.
High-resolution timers and a few oddities in the way the x86_64 emulation on the M1 presents itself, that lead to some potential “gotchas”.
There are a lot of different application-layer protocols, and it seems they all end up evolving solutions to many of the same problems. This article seeks to identify the aspects of functionality commonly found in many application-layer protocol, and serves as a dictionary of those functions.
Cuckoo Index (CI) is a lightweight secondary index structure that represents the many-to-many relationship between keys and partitions of columns in a highly space-efficient way. At its core, CI associates variable-sized fingerprints in a Cuckoo filter with compressed bitmaps indicating qualifying partitions.
A clear picture of the hidden algorithm that runs for every function call at compile time.
EBPFSnitch is a Linux Application Level Firewall based on eBPF and NFQUEUE. It is inspired by OpenSnitch, and Douane, but utilizing modern kernel abstractions, without a kernel module. The eBPFSnitch daemon is implemented in C++ 20. The control interface is implemented in Python 3 utilizing Qt5
Smaller cells offer lower availability in the face of small numbers of uncorrelated node failures, but better availability when the proportion of node failure exceeds 50%. While such high failure rates are rare, they do happen in practice
CRAQ is a replication protocol that allows reads from any replica while still maintaining strong consistency. CRAQ should provide better read throughput than Raft and Paxos. Read performance grows linearly with the number of nodes added to the system. Network chatter is significantly lower compared to Raft and Paxos.
Serverless is an increasingly popular choice for service architects because it can provide elasticity and load-based billing with minimal developer effort. A common and important use case is to compose serverless functions and cloud storage into reliable workflows. However, existing solutions for authoring workflows provide a rudimentary experience compared to writing standard code in a modern programming language. Furthermore, executing workflows reliably in an elastic serverless environment poses significant performance challenges. To address these, we propose Durable Functions, a programming model for serverless workflows, and Netherite, a distributed execution engine to execute them efficiently. Workflows in Durable Functions are expressed as task-parallel code in a host language of choice. Internally, the workflows are translated to fine-grained stateful communicating processes, which are load-balanced over an elastic cluster. The main challenge is to minimize the cost of reliably persisting progress to storage while supporting elastic scale. Netherite solves this by introducing partitioning, recovery logs, asynchronous snapshots, and speculative communication. Our results show that Durable Functions simplifies the expression of complex workflows, and that Netherite achieves lower latency and higher throughput than the prevailing approaches for serverless workflows in Azure and AWS, by orders of magnitude in some cases.
Replicated Object Notation (RON) is a format for distributed live data. RON’s primary mission is continuous data synchronization. A RON object may naturally have any number of replicas, which may synchronize in real-time or intermittently. JSON, protobuf, and many other formats implicitly assume serialization of separate state snapshots. RON has versioning and addressing metadata, so state and updates can be always pieced together. RON handles state and updates all the same: state is change and change is state.
It is a good idea to equip yourself with at least a basic understanding of lockless primitives. Throughout this series I will describe what acquire and release semantics are really about, and present five relatively simple patterns that alone can cover most uses of the primitives.
Rasmussen’s migration model illustrates that small optimizations and adaptations can accumulate over time, taking the system far from its initial design parameters. If there is no counterweight to this “practical drift” from operations staff who are alert to the possibility and the dangers of the normalization of deviance, from the safety function, or from an effective regulator, systems are likely to drift towards catastrophe.
Fully managed, horizontally scalable, multitenant, persistent distributed priority queue built on top of sharded MySQL that enables developers at Facebook to decouple and scale microservices and distributed systems.
Vendor-neutral log-linear histograms for the compression, mergeability, and analysis of telemetry data
I often declare these kinds of wars on unneeded copies and allocations. Talking to the global allocator is not exactly cheap and making extra copies is just not elegant. Sometimes it means using a bump allocator, or even writing an extension for it. Sometimes it means extensive use of borrowing.
This technique optimizes the expressiveness of the call site at the expense of the one of the interface and implementation. Indeed, the interface needs naming and a comment to help clarify how to use it, and the implementation has more code to turn the parameters around.
Get started with your own BPF application quickly and painlessly with libbpf-bootstrap scaffolding, which takes care of all the mundane setup steps and lets you dive right into BPF fun and minimize the necessary boilerplate. We'll take a look at what libbpf-bootstrap provides and how everything is tied together.
Linux has a mechanism that can turn asynchronous signals into synchronous events that can be handled by epoll. It’s a great idea, really. If signals can be delivered via a file descriptor, much like data, then existing I/O multiplexing mechanisms like epoll can be used to handle signals much like we handle other file descriptors that represent local files or sockets. This is exactly the idea behind Linux’s signalfd(2) system call
This article will describe thread-local storage on ELF platforms in detail, and touch on other related topics, such as: thread-specific data keys and Windows/macOS TLS.
Here’s a problem. I have a set of keys and values. I also have some servers for a key-value store. This could be memcached, Redis, MySQL, whatever. I want to distribute the keys across the servers so…
We present a novel dynamic reconfiguration protocol for the MongoDB replication system that extends and generalizes the single server reconfiguration protocol of the Raft consensus algorithm. Our protocol decouples the processing of configuration changes from the main database operation log, which allows reconfigurations to proceed in cases when the main log is prevented from processing new operations. Additionally, this decoupling allows for configuration state to be managed by a logless replicated state machine, by optimizing away the explicit log and storing only the latest version of the configuration, avoiding the complexities of a log-based protocol. We provide a formal specification of the protocol along with results from automated verification of its safety properties. We also provide an experimental evaluation of the protocol benefits, showing how reconfigurations are able to quickly restore a system to healthy operation in scenarios where node failures have stalled the main operation log.
Libcsp is a high performance concurrency C library influenced by the CSP model.
Cosmopolitan Libc makes C a build-once run-anywhere language, like Java, except it doesn't need an interpreter or virtual machine. Instead, it reconfigures stock GCC and Clang to output a POSIX-approved polyglot format that runs natively on Linux + Mac + Windows + FreeBSD + OpenBSD + NetBSD + BIOS with the best possible performance and the tiniest footprint imaginable.
In this article, we discuss how AWS leverages idempotent API operations to mitigate some of the potential undesirable side effects of retrying in order to deliver a more robust service while still leveraging the benefits of retries to simplifying client-side code.
Exploring how buffer pool management works in databases by building one
We present Hemlock, a novel mutual exclusion locking algorithm that is extremely compact, requiring just one word per thread plus one word per lock, but which still provides local spinning in most circumstances, high throughput under contention, and low latency in the uncontended case. Hemlock is context-free -- not requiring any information to be passed from a lock operation to the corresponding unlock -- and FIFO. The performance of Hemlock is competitive with and often better than the best scalable spin locks.
Atkinson Hyperlegible font is named after Braille Institute founder, J. Robert Atkinson. What makes it different from traditional typography design is that it focuses on letterform distinction to increase character recognition, ultimately improving readability. We are making it free for anyone to use!
The Piecewise Geometric Model index (PGM-index) is a data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes while providing the same worst-case query time guarantees
Key-value stores are everywhere. They power a diverse set of data-driven applications across both industry and science.Key-value stores are used as stand-alone NoSQL systems but they are also used as a part of more complex pipelines and systems such as machine learning and relational systems.In this tutorial, we survey the state-of-the-art approaches on how the core storage engine of a key-value store system is designed. We focus on several critical components oft he engine, starting with the core data structures to lay out data across the memory hierarchy. We also discuss design issues related to caching, timestamps, concurrency control,updates, shifting workloads, as well as mixed workloads withboth analytical and transactional characteristics. We cover designs that are read-optimized, write-optimized as well as hybrids. We draw examples from several state-of-the-art systems but we also put everything together in a general framework which allows us to model storage engine designs under a single unified model and reason about the expected behavior of diverse designs. In addition, we show that given the vast number of possible storage engine designs and their complexity, there is a need to be able to describe and communicate design decisions at a high level descriptive language and we present a first version of such a language. We then use that framework to present several open challenges in the field especially in terms of supporting increasingly more diverse and dynamic applications in the era of data science and AI, including neural networks, graphs, and data versioning.
Guidelines for RMM Level 2 API design
Modern web services use in-memory caching extensively to increase throughput and reduce latency. There have been several workload analyses of production systems that have fueled research in improving the effectiveness of in-memory caching systems. However, the coverage is still sparse considering the wide spectrum of industrial cache use cases. In this work, we significantly further the understanding of real-world cache workloads by collecting production traces from 153 in-memory cache clusters at Twitter, sifting through over 80 TB of data, and sometimes interpreting the workloads in the context of the business logic behind them. We perform a comprehensive analysis to characterize cache workloads based on traffic pattern, time-to-live (TTL), popularity distribution, and size distribution. A fine-grained view of different workloads uncover the diversity of use cases: many are far more write-heavy or more skewed than previously shown and some display unique temporal patterns. We also observe that TTL is an important and sometimes defining parameter of cache working sets. Our simulations show that ideal replacement strategy in production caches can be surprising, for example, FIFO works the best for a large number of workloads.
A curated selection of distributed transactions protocols
LeanStore is a high-performance OLTP storage engine optimized for many-core CPUs and NVMe SSDs. Our goal is to achieve performance comparable to in-memory systems when the data set fits into RAM, while being able to fully exploit the bandwidth of fast NVMe SSDs for large data sets. While LeanStore is currently a research prototype, we hope to make it usable in production in the future.
Varlink is an interface description format and protocol that aims to make services accessible to both humans and machines in the simplest feasible way.A varlink interface combines the classic UNIX command line options, STDIN/OUT/ERROR text formats, man pages, service metadata and provides the equivalent over a single file descriptor, a.k.a. “FD3”. Varlink is plain-text, type-safe, discoverable, self-documenting, remotable, testable, easy to debug. Varlink is accessible from any programming environment
APIs offered by the kernel for file descriptor transfer between processes have been awkward to use and riddled with a number of gotchas. On newer versions of Linux (5.6 and above), a far better API exists
In this post, we will consider whether Raft can guarantee liveness, and specifically whether it can establish a stable leader, if a network fault means that some servers are no longer connected to each other.
PCIe-attached solid-state drives offer high throughput and largecapacity at low cost. Modern servers can easily host 4 or 8 suchSSDs, resulting in an aggregated bandwidth that hitherto was onlyachievable using DRAM. In this paper we study how to exploit suchDirectly-Attached NVMe Arrays (DANA) in database systems. Wefind that DANA presents new challenges that require rethinking theway I/O operations are performed at both the database and operat-ing system layer
Serverless computing offers an event driven pay-as-you-go framework for application development. A key selling point is the concept of no back-end server management, allowing developers to focus on application functionality. This is achieved through severe abstraction of the underlying architecture the functions run on. We examine the underlying architecture and report on the performance of serverless functions and how they are effected by certain factors such as memory allocation and interference caused by load induced by other users on the platform. Specifically, we focus on the serverless offerings of the four largest platforms; AWS Lambda, Google Cloud Functions, Microsoft Azure Functions and IBM Cloud Functions}. In this paper, we observe and contrast between these platforms in their approach to the common issue of "cold starts", we devise a means to unveil the underlying architecture serverless functions execute on and we investigate the effects of interference from load on the platform over the time span of one month.
Last time in the 1st article I briefly introduced Firecracker as a lightweight virtualization/containerization solution for extreme-scale (like AWS Lambda functions). This time around I am going to dig a bit deeper into the API and the management of microVMs. I am going to install Alpine on RPI, install Rust and Python, Docker, get the Linux kernel source and compile a new kernel with minimal config, compile our own Firecracker and then create a new rootfs to be able to boot up a guest. Most of these steps are optional, you can use the stock kernel the Firecracker team provides or download Firecracker release from Github.
Noise is a framework that can be used to construct secure channel protocols. Noise takes a fairly basic set of cryptographic operations and allows them to be combined in ways that provide various security properties. Noise is not a protocol itself: it is a protocol framework, so by "filling in the blanks" you get a concrete protocol that has essentially no knobs to twist
With the rise of data-centric process management paradigms, interdependent processes, such as artifacts or object lifecycles, form a business process through their interactions. Coordination processes may be used to coordinate these interactions, guiding the overall business process towards a meaningful goal. A coordination process model specifies coordination constraints between the interdependent processes in terms of semantic relationships. At run-time, these coordination constraints must be enforced by a coordination process instance. As the coordination of multiple interdependent processes is a complex endeavor, several challenges need to be fulfilled to achieve optimal process coordination. For example, processes must be allowed to run asynchronously and concurrently, taking their complex relations into account. This paper contributes the operational semantics of coordination processes, which enforces the coordination constraints at run-time. Coordination processes form complex structures to adequately represent processes and their relations, specifically supporting many-to-many relationships. Based on these complex structures, markings and process rules allow for the flexible enactment of the interdependent processes while fulfilling all challenges. Coordination processes represent a sophisticated solution to the complex problem of coordinating interdependent, concurrently running processes.
CUE is an open source language, with a rich set of APIs and tooling, for defining, generating, and validating all kinds of data: configuration, APIs, database schemas, code, … you name it.
P is a language for asynchronous event-driven programming. P allows the programmer to specify the system as a collection of interacting state machines, which communicate with each other using events. P unifies modeling and programming into one activity for the programmer. Not only can a P program be compiled into executable code, but it can also be systematically tested using Model Checking.
Strong typedefs in Rust. Make invalid states unrepresentable
I wanted to gather some notes I've kept in my head over the years about Interval Tree Clocks. Interval Tree clocks have been presented in a 2008 paper by Paulo Sérgio Almeida, Carlos Baquero, and Victor Fonte. It's one of the most interesting and readable papers I've seen, but I don't recall seeing many production systems using the mechanism, and I'm not fully sure why
Modern NVMe devices change the nature of how to best perform I/O in stateful applications. This trend has been going on for a while but has been so far masked by the fact that the APIs, especially the higher level ones, haven’t evolved to match what has been happening in the device — and more recently Linux Kernel layers. With the right set of APIs, Direct I/O is the new black.
Testing for causality between events in distributed executions is a fundamental problem. Vector clocks solve this problem but do not scale well. The probabilistic Bloom clock can determine causality between events with lower space, time, and message-space overhead than vector clock; however, predictions suffer from false positives. We give the protocol for the Bloom clock based on Counting Bloom filters and study its properties including the probabilities of a positive outcome and a false positive. We show the results of extensive experiments to determine how these above probabilities vary as a function of the Bloom timestamps of the two events being tested, and to determine the accuracy, precision, and false positive rate of a slice of the execution containing events in the temporal proximity of each other. Based on these experiments, we make recommendations for the setting of the Bloom clock parameters. We postulate the causality spread hypothesis from the application's perspective to indicate whether Bloom clocks will be suitable for correct predictions with high confidence. The Bloom clock design can serve as a viable space-, time-, and message-space-efficient alternative to vector clocks if false positives can be tolerated by an application.
Preloading is a feature of the dynamic linker (ld). On most UNIX systems ld allows you to load a number of libraries before those that are required by an executable. The cwrap project uses this feature to implement special libraries which can be used to fake certain behavior of the system. Think of it like "The Matrix" where reality is simulated and everything is a lie!
Consensus-based replicated systems are complex, monolithic, and difficult to upgrade once deployed. As a result, deployed systems do not benefit from innovative research, and new consensus protocols rarely reach production. We propose virtualizing consensus by virtualizing the shared log API, allowing services to change consensus protocols without downtime. Virtualization splits the logic of consensus into the VirtualLog, a generic and reusable reconfiguration layer; and pluggable ordering protocols called Loglets. Loglets are simple, since they do not need to support reconfiguration or leader election; diverse, consisting of different protocols, codebases, and even deployment modes; and composable, via RAID-like stacking and striping. We describe a production database called Delos which leverages virtual consensus for rapid, incremental development and deployment. Delos reached production within 8 months, and 4 months later upgraded its consensus protocol without downtime for a 10X latency improvement. Delos can dynamically change its performance properties by changing consensus protocols: we can scale throughput by up to 10X by switching to a disaggregated Loglet, and double the failure threshold of an instance without sacrificing throughput via a striped Loglet.