Over the last few years, we have implemented MySQL Raft, a Raft consensus engine that was integrated with MySQL to build a replicated state machine. We have migrated a large portion of our deployment to MySQL Raft and plan to fully replace the current MySQL semisynchronous databases with it. The project has delivered significant benefits to the MySQL deployment at Meta, including higher reliability, provable safety, significant improvements in failover time, and operational simplicity — all with equal or comparable write performance.
Introduced in 2019 (kernel 5.1) by Jens Axboe, io_uring (henceforth uring) is a system for providing the kernel with a schedule of system calls, and receiving the results as they're generated. Whereas epoll and kqueue support multiplexing, where you're told when you can usefully perform a system call using some set of filters, uring allows you to specify the system calls themselves (and dependencies between them), and execute the schedule at the kernel "dataflow limit". It combines asynchronous I/O, system call polybatching, and flexible buffer management, and is IMHO the most substantial development in the Linux I/O model since Berkeley sockets
In this paper, we describe our changes to the original Raft algorithm required for supporting flexible data commit quorums. We discuss the impact of these changes on workload performance, fault tolerance and ease of integration into the existing production setup.
What every web developer must know about mobile networks, protocols, and APIs provided by browser to deliver the best user experience.
This post is best described as a technology demonstration; it melds together web servers, plugins, WebAssembly, Go, Rust and ABIs
Ast-grep(sg) is a lightning fast and user-friendly tool for code searching, linting, rewriting at large scale
Bcachefs is a modern, general purpose, copy on write filesystem descended from bcache, a block layer cache.The internal architecture is very different from most existing filesystems where the inode is central and many data structures hang off of the inode.Instead, bcachefs is architected more like a filesystem on top of a relational database, with tables for the different filesystem data types - extents, inodes, dirents, xattrs, et cetera.bcachefs supports almost all of the same features as other modern COW filesystems, such as ZFS and btrfs, but in general with a cleaner, simpler, higher performance design.
Lightweight fuzzing of a memory snapshot using KVM. Snapchange provides the ability to load a raw memory dump and register state into a KVM virtual machine (VM) for execution. At a point in execution, this VM can be reset to its initial state by resetting the dirty pages found by KVM or pages manually dirtied by a fuzzer.
Looking up in a static set of strings is a common problem we encounter when parsing any textual formats. Such sets are often keywords of a programming language or a protocol. Parsing HTTP verbs appeared to be the fastest when we use a compile-time trie: a series of nested switch statements. I could not believe that perfect hash function is not better, and that led to a novel hashing approach that is based on the instruction PEXT (Parallel Bits Extract).
Otel-desktop-viewer is a CLI tool for receiving OpenTelemetry traces while working on your local machine that helps you visualize and explore your trace data without needing to send it on to a telemetry vendor.
We will be blunt: if someone is going to build a new MVCC DBMS today, they should not do it the way PostgreSQL does (e.g., append-only storage with autovacuum). In our 2018 VLDB paper (aka “the best paper ever on MVCC“), we did not find another DBMS doing MVCC the way PostgreSQL does it. Its design is a relic of the 1980s and before the proliferation of log-structured system patterns from the 1990s
The FTS features available in Postgres and how you might use them to build a hypothetical search API endpoint
For a strongly consistent, distributed metadata store such as Edgestore—serving 10 million requests per second and storing multiple petabytes of metadata—writes spanning multiple physical storage nodes are an inevitability. Although our initial “best-effort” approach to multi-shard writes worked well for most use cases, over time the balance of complexity shifted too heavily on developers. Therefore, we decided to tackle the problem of building a scalable primitive for multi-shard writes and implemented cross-shard transactions.
We present two abstractions for designing modular state machine replication (SMR) protocols: trees and turtles. A tree captures the set of possible state machine histories, while a turtle represents a subprotocol that tries to find agreement in this tree. We showcase the applicability of these abstractions by constructing crash-tolerant SMR protocols out of abstract tree turtles and providing examples of tree turtle implementations. The modularity of tree turtles allows a generic approach for adding a leader for liveness. We expect that these abstractions will simplify reasoning and formal verification of SMR protocols as well as facilitate innovation in protocol designs.
A distributed multi-writer multi-reader (MWMR) atomic register is an important primitive that enables a wide range of distributed algorithms. Hence, improving its performance can have large-scale consequences. Since the seminal work of ABD emulation in the message-passing networks [JACM '95], many researchers study fast implementations of atomic registers under various conditions. "Fast" means that a read or a write can be completed with 1 round-trip time (RTT), by contacting a simple majority. In this work, we explore an atomic register with optimal resilience and "optimistically fast" read and write operations. That is, both operations can be fast if there is no concurrent write. This paper has three contributions: (i) We present Gus, the emulation of an MWMR atomic register with optimal resilience and optimistically fast reads and writes when there are up to 5 nodes; (ii) We show that when there are > 5 nodes, it is impossible to emulate an MWMR atomic register with both properties; and (iii) We implement Gus in the framework of EPaxos and Gryff, and show that Gus provides lower tail latency than state-of-the-art systems such as EPaxos, Gryff, Giza, and Tempo under various workloads in the context of geo-replicated object storage systems.
How is tail recursion different from regular recursion? What do continuations have to do with this, what is CPS, and how do trampolines help? This article provides an introduction, with code samples in Python and Clojure
In this post we're going to focus on the ways that a single load balancer might distribute HTTP requests to a set of servers. We'll start from the bottom and work our way up to modern load balancing algorithms.
Did you know there’s a function in Postgres that lets you write data which you can’t query? A function that lets you persist data in all kinds and shapes but which will never show up in any table? Let me tell you about pg_logical_emit_message()! It’s a Postgres function that allows you to write messages to the write-ahead log (WAL) of the database
In this paper, we propose LightIOV, a novel software-based NVMe virtualization mechanism that achieves high performance and scalability without consuming valuable CPU resources and without requiring special hardware support. LightIOV can support thousands of VMs on each server. The key idea behind LightIOV is NVMe hardware I/O queues passthrough, which enables VMs to directly access I/O queues of NVMe devices, thus eliminating virtualization overhead and providing near-native performance
Auto-scaling, multi-tenant data ingestion pipelines that guarantee exactly once ingestion of every event
Root > Parquet Files > Row Groups > Columns > Data Page
Your default hash table should be open-addressed, using Robin Hood linear probing with backward-shift deletion. When prioritizing deterministic performance over memory efficiency, two-way chaining is also a good choice.
The revolutionary ideas that empowered the Web. A simpler approach to building applications on the Web and beyond with htmx and Hyperview.Enhancing web applications without using SPA frameworks
CRDT info aggregation
Generate type-safe code from SQL. Contribute to kyleconroy/sqlc development by creating an account on GitHub.
In distributed systems, logical clocks play a key role in the ordering of system events. What are the various logical clock designs, and how do they help with event ordering? This article answers these questions
A simple, modern and secure encryption tool (and Go library) with small explicit keys, no config options, and UNIX-style composability.
Effect systems refine types with information about the behaviour of programs. They have been used for many purposes, such as optimizing programs, determining resource usage, and finding bugs. So far, however, work on effect systems has largely concentrated on call-by-value languages. We consider the problem of designing an effect system for a lazy language. This is more challenging because it depends on the ability to locate the first use of each variable. Coeffect systems, which track contextual requirements of programs, provide a method of doing this. We describe how to track variable usage in a coeffect system that can be instantiated for different reduction strategies, including call-by-need. We then add effects to the result, allowing work that has been done on effect systems for call-by-value languages to be applied to lazy languages.
Coeffects are Tomas Petricek's PhD research project. They are a programming language abstraction for understanding how programs access the context or environment in which they execute
As a proponent of statically typed functional languages I believe that a context-aware programming language should capture such context information in the type system and make sure that basic errors (like the ones demonstrated in the four examples above) are ruled out at compile time. This is essentially the idea behind coeffects
Run AlloyDB anywhere—on-premises, at the edge, across clouds, or on developer laptops, with full PostgreSQL compatibility and Google Cloud support.
MRSK deploys web apps anywhere from bare metal to cloud VMs using Docker with zero downtime. It uses the dynamic reverse-proxy Traefik to hold requests while the new application container is started and the old one is stopped. It works seamlessly across multiple hosts, using SSHKit to execute commands. It was built for Rails applications, but works with any type of web app that can be containerized with Docker
Ventoy is an open source tool to create bootable USB drive for ISO/WIM/IMG/VHD(x)/EFI files
Tectonic is a modernized, complete, self-contained TeX/LaTeX engine, powered by XeTeX and TeXLive.
PostgreSQL wire protocol implemented as a rust library.
This article will demonstrate a simple, practical, and robust approach to spawning and managing threads using only raw system calls. It only takes about a dozen lines of C, including 4 or so inline assembly instructions.
Boxxy puts bad Linux applications in a box with only their files.
Typst is a new markup-based typsetting system that is designed to be as powerful as LaTeX while being much easier to learn and use
Flexible Paxos is the simple observation that it is not necessary to require all quorums in Paxos to intersect. It is sufficient to require that the quorum used by the leader election phase (phase-1) will overlap with the quorums used by previous replication phases (phase-2). Majority quorums are one such way to meet this requirement, but many more exist. Thus, Paxos is just a single point on a broad spectrum of possibilities for safely reaching distributed consensus. To learn more about Flexible Paxos and how it is used to build more resilient distributed systems, take a look at the papers, talks, and open source projects below.
WeasyPrint is a smart solution helping web developers to create PDF documents. It’s free and open source software that can be easily plugged to your applications and websites and turns simple HTML pages into gorgeous: Statistical reports, Invoices, Tickets
Mountpoint for Amazon S3 is a simple, high-throughput file client for mounting an Amazon S3 bucket as a local file system
A LD_PRELOAD-able shared library that intercepts file operations: if a program tries to open a dotfile in $HOME, it is redirected to $XDG_CONFIG_HOME (as defined by freedesktop).
This library is yet another implementation of delimited continuations for JVM using bytecode instrumentation. It primarily implements resumable exceptions. And there is also a classical multi-prompt delimited continuations implementation based on the resumable exceptions. Use plain Java exception handling to capture, compose and run parts of programs' call stack. No functional programming knowledge is required.
Cloud computing environments allow customers to dynamically scale their applications. The key problem is how to lease the right amount of resources, on a pay-as-you-go basis. Application re-dimensioning can be implemented effortlessly, adapting the resources assigned to the application to the incoming user demand. However, the identification of the right amount of resources to lease in order to meet the required Service Level Agreement, while keeping the overall cost low, is not an easy task. Many techniques have been proposed for automating application scaling. We propose a classification of these techniques into five main categories: static threshold-based rules, control theory, reinforcement learning, queuing theory and time series analysis. Then we use this classification to carry out a literature review of proposals for auto-scaling in the cloud
Algebraic effects and handlers are an emerging new feature to model effectful computations and attract attention not only from researchers but also from programmers. They are implemented in various ways as part of compilers, interpreters, or as libraries. We present a direct embedding of one-shot algebraic effects and handlers in a language which has asymmetric coroutines. The key observation is that, by restricting the use of continuations to be one-shot, we obtain a simple and sufficiently general implementation via coroutines, which are available in many modern programming languages. We have implemented our embedding as a library in Lua and Ruby, which allows one to write effectful programs in a modular way using algebraic effects and handlers
Algebraic effects and handlers are an emerging new feature to model effectful computations and attract attention not only from researchers but also from programmers. They are implemented in various ways as part of compilers, interpreters, or as libraries. We present a direct embedding of one-shot algebraic effects and handlers in a language which has asymmetric coroutines. The key observation is that, by restricting the use of continuations to be one-shot, we obtain a simple and sufficiently general implementation via coroutines, which are available in many modern programming languages. We have implemented our embedding as a library in Lua and Ruby, which allows one to write effectful programs in a modular way using algebraic effects and handlers
How to use and configure a Yubikey
This is a help target for GNU Makefiles that allows for automatic documentation
Wasm Workers Server (wws) is a framework to develop and run serverless applications server in WebAssembly. These
Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, branching, and bottomless storage.
In a partitioned Bloom Filter (PBF) the bit vector is split into disjoint parts, one per hash function. Contrary to hardware designs, where they prevail, software implementations mostly ignore PBFs, considering them worse than standard Bloom filters (SBF), due to the slightly larger false positive rate (FPR). In this paper, by performing an in-depth analysis, first we show that the FPR advantage of SBFs is smaller than thought; more importantly, by deriving the per-element FPR, we show that SBFs have weak spots in the domain: elements that test as false positives much more frequently than expected. This is relevant in scenarios where an element is tested against many filters. Moreover, SBFs are prone to exhibit extremely weak spots if naive double hashing is used, something occurring in mainstream libraries. PBFs exhibit a uniform distribution of the FPR over the domain, with no weak spots, even using naive double hashing. Finally, we survey scenarios beyond set membership testing, identifying many advantages of having disjoint parts, in designs using SIMD techniques, for filter size reduction, test of set disjointness, and duplicate detection in streams. PBFs are better, and should replace SBFs, in general purpose libraries and as the base for novel designs
Recent high-performance storage devices have exposed software inefficiencies in existing storage stacks, leading to a new breed of I/O stacks. The newest storage API of the Linux kernel is io_uring. We perform one of the first in-depth studies of io_uring, and compare its performance and disadvantages with the established libaio and SPDK APIs. Our key findings reveal that (i) polling design significantly impacts performance; (ii) with enough CPU cores io_uring can deliver performance close to that of SPDK; and (iii) performance scalability over multiple CPU cores and devicesrequires careful consideration and necessitates a hybrid approach. Last, we provide design guidelines for developers of storage intensive applications.
Kernel-bypass techniques for high-speed network packet processing
We believe that querying data in Apache Parquet files directly can achieve similar or better storage efficiency and query performance than most specialized file formats. While it requires significant engineering effort, the benefits of Parquet’s open format and broad ecosystem support make it the obvious choice for a wide class of data systems. In this article we explain several advanced techniques needed to query data stored in the Parquet format quickly that we implemented in the Apache Arrow Rust Parquet reader. Together these techniques make the Rust implementation one of, if not the, fastest implementation for querying Parquet files — be it on local disk or remote object storage
Asynchronously replicated primary-backup databases are commonly deployed to improve availability and offload read-only transactions. To both apply replicated writes from the primary and serve read-only transactions, the backups implement a cloned concurrency control protocol. The protocol ensures read-only transactions always return a snapshot of state that previously existed on the primary. This compels the backup to exactly copy the commit order resulting from the primary's concurrency control. Existing cloned concurrency control protocols guarantee this by limiting the backup's parallelism. As a result, the primary's concurrency control executes some workloads with more parallelism than these protocols. In this paper, we prove that this parallelism gap leads to unbounded replication lag, where writes can take arbitrarily long to replicate to the backup and which has led to catastrophic failures in production systems. We then design C5, the first cloned concurrency protocol to provide bounded replication lag. We implement two versions of C5: Our evaluation in MyRocks, a widely deployed database, demonstrates C5 provides bounded replication lag. Our evaluation in Cicada, a recent in-memory database, demonstrates C5 keeps up with even the fastest of primaries.
Everyone knows that if you use TCP to transfer data across the Internet any corrupted TCP segment is detected and retransmitted. Well like so many other things, what everyone knows is not always correct.
Pgec is an Apache licensed real-time in memory PostgreSQL logical replication cache with Redis, Memcached and REST APIs. It supports column lists and row filters with the latest features of replication in PostgreSQL 15.
In the last few months there has been a lot of hype about “passkeys” and how they are going to change authentication forever. But that hype will come at a cost. Obsession with passkeys are about to turn your security keys (yubikeys, feitian, nitrokeys, …) into obsolete and useless junk.It all comes down to one thing - resident keys.
This document aims to highlight some of the features that are available and tailored to networked applications, and assumes that the reader is already familiar with io_uring basics. It does not attempt to be a case study in gains achievable by switching from epoll to io_uring.
Since developers keep trying to use mmap in new DBMSs, we wrote this paper to provide a warning to others that mmap is not a suitable replacement for a traditional buffer pool. We discuss the main shortcomings of mmap in detail, and our experimental analysis demonstrates clear performance limitations. Based on these findings, we conclude with a prescription for when DBMS developers might consider using mmap for file I/O.
We enhance the well-established software combining synchronization technique to create combining funnels. Previous software combining methods used a statically assigned tree whose depth was logarithmic in the total number of processors in the system. On shared memory multiprocessors the new method allows one to dynamically build combining trees with depth logarithmic in the actual number of processors concurrently accessing the data structure. The structure is comprised from a series of combining layers through which processors' requests are funneled. These layers use randomization instead of a rigid tree structure to allow processors to find partners for combining. By using an adaptive scheme the funnel can change width and depth to accommodate different access frequencies without requiring global agreement as to its size. Rather, processors choose parameters of the protocol privately, making this scheme very simple to implement and tune. When we add an “elimination” mechanism to the funnel structure, the randomly constructed “tree” is transformed into a “forest” of disjoint (and on average shallower) trees of requests, thus enhancing the level of parallelism and decreasing latency. We present two new linearizable combining funnel based data structures: a fetch-and-add object and a stack. We study the performance of these structures by benchmarking them against the most efficient software implementations of fetch-and-add and stacks known to date, combining trees and elimination trees, on a simulated shared memory multiprocessor using Proteus. Our empirical data shows that combining funnel-based fetch-and-add outperforms combining trees of fixed height by as much as 70%. In fact, even compared to combining trees optimized for a given load, funnel performance is the same or better. Elimination trees, which are not linearizable, are 10% faster than funnels under highest load, but as load drops, combining funnels adapt their size, giving them a 34% lead in latency
This repo contains the scripts necessary to install and run a tailscale instance on your Unifi Dream Machine (UDM/UDM Pro/UDR/UDM-SE).
Minimal, self-hosted, 0-config alternative to ngrok. Caddy+OpenSSH+50 lines of Python.
Nelua is a systems programming language for performance sensitive applications, like real-time applications and game engines. Its syntax and semantics are similar to Lua, but its garbage collection is optional, it provides optional type notations, and it is free from an interpreter. Nelua uses ahead-of-time compilation to generate optimized native binaries. It is metaprogrammable at compile-time using Lua and it is simple and easy to use.
A Development Container (or Dev Container for short) allows you to use a container as a full-featured development environment. It can be used to run an application, to separate tools, libraries, or runtimes needed for working with a codebase, and to aid in continuous integration and testing. Dev containers can be run locally or remotely, in a private or public cloud.
Handshake is a decentralized, permissionless naming protocol where every peer is validating and in charge of managing the root DNS naming zone with the goal of creating an alternative to existing Certificate Authorities and naming systems
This paper develops a theory of deadlock and leak freedom for higher-order locks in a shared memory concurrent setting. Higher-order locks allow sharing not only of basic values but also of other locks and channels, and are themselves first-class citizens. The theory is based on the notion of a sharing topology, administrating who is permitted to access shared data at what point in the program. The paper first develops higher-order locks for acyclic sharing topologies, instantiated in a 𝜆-calculus with higher-order locks and message-passing concurrency. The paper then extends the calculus to support circular dependencies with dynamic lock orders, which we illustrate with a dynamic version of Dijkstra’s dining philosophers problem. Well-typed programs in the resulting calculi are shown to be free of deadlocks and memory leaks, with proofs mechanized in the Coq proof assistant.
We describe the internal structure of boost::unordered_flat_map and provide theoretical analyses and benchmarking data to help readers gain insights into the key design elements behind this container's excellent performance. Interface and behavioral differences with the standard are also discussed.
In this series of posts we’ll write a program that reads lots of files stored on disk using Linux io_uring and C++20 coroutines. The main motivation comes from the fact that there are lots of in-depth resources for both io_uring and C++20 coroutines, but practically nothing showing how to combine both. We will discover that asynchronous I/O and coroutines go as well together as bread and butter.
The task-based dataflow programming model has emerged as an alternative to the process-centric programming model for extreme-scale applications. However, load balancing is still a challenge in task-based dataflow runtimes. In this paper, we present extensions to the PaR-SEC runtime to demonstrate that distributed work stealing is an effective load-balancing method for task-based dataflow runtimes. In contrast to shared-memory work stealing, we find that each process should consider future tasks and the expected waiting time for execution when determining whether to steal. We demonstrate the effectiveness of the proposed work-stealing policies for a sparse Cholesky factorization, which shows a speedup of up to 35% compared to a static division of work.