Matt McShane

Restartable Sequences
- concurrency
- linux
2026-05-31T16:28:35+00:00
The best kept secret at the frontier of system programming right now is the Linux 4.18+ (c. 2018) concept of restartable sequences or rseq for short. They allow you to create thread-safe data structures without locks or atomics which scale to microprocessors with many cores.
Yggdrasil Network as an Embedded GO Library
- overlay
- go
2026-05-14T09:10:34+00:00
Yggdrasil is an experimental overlay IPv6 mesh network. In short, it lets you build a "network on top of a network": each node gets a stable IPv6 address derived from its public key, and that address does not depend on where the node is physically located or what external IP address it currently has. Nodes can connect to public peers, to each other directly, or discover each other on the local network. Once connectivity is established, ordinary TCP/UDP applications can communicate as if they were simply using another IPv6 network. In the classic setup, Yggdrasil is a daemon that creates a virtual network interface in the operating system. But sometimes it would be useful to embed Yggdrasil directly into an application. For example, into Matrix clients, or into web applications.
Designing microkernel IPC
- ipc
2026-05-14T09:08:35+00:00
Over the past few weeks, I had a lot of fun simplifying the IPC design in FTL operating system. While IPC is a simple memory copy operation between processes, you'll run into interesting problems to consider.
PGKeeper: Building the Bouncer We Needed for Postgres
- postgresql
- go
- queuing
- load-balancing
2026-05-05T00:46:37+00:00
Our database layer has to contend with an onslaught of novel workloads and increased traffic. It was clear we were outgrowing PgBouncer, a lightweight and widely adopted PostgreSQL connection pooler. That led us to build PGKeeper, a new connection and load management service that replaces PgBouncer in front of our Postgres fleet. In this post, we walk through how we designed and rolled it out.
Zero-config Go heap profiling
- go
- memory
- observability
2026-05-04T14:41:57+00:00
Heap profiles for your Go services no longer require pprof endpoints, scraping configuration, or a deploy. Coroot picks them up automatically from whatever is already running on your nodes, with no code changes, no annotations, and no restart.
C8s: A Confidential Kubernetes Architecture
2026-05-04T00:55:40+00:00
This paper presents C8s, a confidential computing architecture for Kubernetes that provides cryptographically rooted confidentiality, integrity, and verifiability guarantees for Kubernetes clusters from infrastructure operators. These guarantees are cryptographically provable to any independent third party verifier. The architecture is built on hardware Trusted Execution Environments (TEEs), specifically AMD SEV-SNP, Intel TDX, and NVIDIA Confidential Computing support, to establish an attestation-rooted trust boundary around confidential VMs. This design is compatible with managed Kubernetes services such as Amazon EKS, Google GKE, and Microsoft AKS, where the control plane cannot be attested. Under this boundary, three groups gain guarantees that are absent from conventional deployments. Data and artifact owners can deploy sensitive workloads and proprietary artifacts on third-party infrastructure without risking exfiltration. Compute providers can offer execution services without revealing workloads to cloud operators. End users can submit requests that remain opaque to all parties except the attested TEE processing them. Representative workloads include AI inference, securing AI model weights, and training or fine-tuning on sensitive data.
Pandoc for the people
- pandoc
- wasm
2026-04-30T17:07:33+00:00
Convert documents without leaving the browser
Project Hummingbird
- containers
- deployment
2026-04-28T23:42:57+00:00
Minimal, hardened, and secure container images
Chamelio: A Fast Shared Cloud Network Stack for Isolated Tenant-Defined Protocols
2026-04-28T02:18:19+00:00
Conventional cloud network virtualization sends packets through multiple guest and host layers, inflating CPU cost and tail latency. Shared host datapaths collapse this layering into one optimized path across tenants, but existing shared stacks are fixed-function: tenants cannot specialize their protocols. eBPF is the natural vehicle for restoring programmability to a shared datapath, but today's extensions are hook-sized, and its verifier provides safety -- not performance isolation: one tenant's per-packet work can inflate every other tenant's tail latency. Chamelio is a programmable shared network stack that lets tenants implement full protocols through a bounded eBPF fast path and a tenant slow path, while approaching the performance and preserving the strong isolation of fixed shared stacks. It combines three ideas: a shared-stack architecture for tenant-defined protocols; joint optimisation of tenant handlers with provider infrastructure and co-resident tenants in the shared fast path; and a bounded fast path contract with runtime cycle accounting that keeps tenant programmability compatible with strong performance isolation. A tenant programmable TCP on Chamelio reaches 9.2 Mreq/s, matching the hand-tuned TAS stack; joint compilation shrinks the programmability tax from 23.9% to 3.8%; and under a scaling TCP adversary that drives uninstrumented stacks to 154 microseconds, Chamelio bounds victim tail latency at 46 microseconds.
The Git Commands I Run Before Reading Any Code
- git
- coding
2026-04-27T15:48:14+00:00
Five git log commands that diagnose a new codebase before you open a single file: code churn hotspots, bus factor, bug clusters, and crisis patterns.
Magic Number Design
2026-04-27T15:42:07+00:00
Another day, another binary file format with a badly designed magic number
Unbound
- dns
2026-04-27T15:40:46+00:00
Unbound is a validating, recursive, caching DNS resolver. It is designed to be fast and lean and incorporates modern features based on open standards
Towards Principled, Practical Document Database Design
- database
2026-04-27T15:39:07+00:00
Relational database design is a well-understood process enabled by a combination of database theory (e.g., normal forms) as well as conceptual modeling (e.g., ER-based design). In contrast, database design for NoSQL databases, notably document databases, is often approached in a much more ad hoc manner. It is frequently driven by application details and physical considerations that muddy the design process in ways all too reminiscent of the pre-relational database era. In this paper, we argue for a return to sanity - for a logical, data-first, conceptually grounded approach to document database design. We explain how such an approach can work, yielding a clean, query-friendly document database design. We also highlight a collection of document (JSON) anti-patterns to avoid. The process and the anti-patterns both stem from the authors' experiences in current and past lives when dealing with a wide variety of JSON document data from commercial applications, government applications, and university research applications
Mitigating Application Resource Overload with Targeted Task Cancellation
2026-04-27T15:37:24+00:00
Real systems often run into overload because one or two unlucky timed requests monopolize an internal logical resource (like buffer pools, locks, and thread-pool queues). These few rogue whales have nonlinear effects. A single ill-timed dump query can thrash the buffer pool and cut throughput in half [..] Atropos proposes a simple fix to this problem. Rather than throttling or dropping victims at the front of the system, it continuously monitors how tasks use internal logical resources and cancels the ones most responsible for the collapse
Reciprocating Locks
- concurrency
2026-04-27T15:35:31+00:00
We present "Reciprocating Locks", a novel mutual exclusion locking algorithm, targeting cache-coherent shared memory (CC), that enjoys a number of desirable properties. The doorway arrival phase and the release operation both run in constant-time. Waiting threads use local spinning and only a single waiting element is required per thread, regardless of the number of locks a thread might hold at a given time. While our lock does not provide strict FIFO admission, it bounds bypass and has strong anti-starvation properties. The lock is compact, space efficient, and has been intentionally designed to be readily usable in real-world general purpose computing environments such as the linux kernel, pthreads, or C++. We show the lock exhibits high throughput under contention and low latency in the uncontended case. The performance of Reciprocating Locks is competitive with and often better than the best state-of-the-art scalable spin locks.
Linux Kernel vs DPDK: HTTP Performance Showdown
- kernel-bypass
- linux
2026-04-27T15:27:23+00:00
The Linux kernel is designed to be fast, but it is also designed to be multi-purpose, so it isn't perfectly optimized for high-speed networking by default. On the other hand, kernel-bypass technologies like DPDK take a single-minded approach to networking performance. An entire network interface is dedicated to a single application, and aggressive busy polling is used to achieve high throughput and low latency. For this post I wanted to see what the performance gap would look like when a finely tuned kernel/application goes head to head with kernel-bypass in a no holds barred fight.
Modern Hardware for Future Databases
- database
- performance
2026-04-27T15:25:55+00:00
We’re in an exciting era for databases where advancements are coming along each major resource front, each of which has the potential to shape what an optimal database architecture would be. All combined, I’m hopeful that we’ll see some interesting architectural shifts in databases over the next decade, but I’m uncertain if the necessary hardware will be accessible.
When GPUs Fail Quietly: Observability-Aware Early Warning Beyond Numeric Telemetry
- monitoring
- fault-forecasting
2026-04-27T15:03:45+00:00
GPU nodes are central to modern HPC and AI workloads, yet many failures do not manifest as immediate hard faults. While some instabilities emerge gradually as weak thermal or efficiency drift, a significant class occurs abruptly with little or no numeric precursor. In these detachment-class failures, GPUs become unavailable at the driver or interconnect level and the dominant observable signal is structural, including disappearance of device metrics and degradation of monitoring payload integrity. This paper proposes an observability-aware early-warning framework that jointly models (i) utilization-aware thermal drift signatures in GPU telemetry and (ii) monitoring-pipeline degradation indicators such as scrape latency increase, sample loss, time-series gaps, and device-metric disappearance. The framework is evaluated on production telemetry from GPU nodes at GWDG, where GPU, node, monitoring, and scheduler signals can be correlated. Results show that detachment failures exhibit minimal numeric precursor and are primarily observable through structural telemetry collapse
Postgres's lateral joins allow for quite the good eDSL
- postgresql
- sql
2026-04-27T14:35:18+00:00
This is actually really useful as it provides a way to solve an expressivity problem that I think most ORMs and query builders have: that queries are difficult to compose.
The fastest Linux timestamps
- linux
- performance
- time
2026-04-27T14:30:10+00:00
We can speed up timestamps on x86 Linux by 30% and maintain the same precision as the standard system clock by implementing our own timers without relying on vDSO. Almost nobody should do this.
Behind the scenes: How Database Traffic Control works
- postgresql
- admission
2026-04-27T14:10:10+00:00
Database Traffic Control™, a feature for mitigating and preventing database overload due to unexpectedly expensive SQL queries.
Integrated Gauges: Lessons Learned Monitoring Seastar's IO Stack
2026-04-27T12:58:47+00:00
Relying solely on instantaneous metrics, or gauges, to monitor rapidly fluctuating system parameters very often leads to misleading data that poorly reflects the system’s actual behavior over a period of time. While solutions like histograms offer better statistical insights, they incur notable resource overhead. For metrics where the individual values can be aggregated (like request latencies), converting them into cumulative counters and deriving a rate provides a much more stable and representative average over the scraping interval. This offers a very efficient compromise between informative granularity and resource consumption. For metrics where instantaneous values cannot be simply summed (such as queue lengths), the concept of the Integrated Gauge offers a generalization of the same efficiency. In essence, by treating the gauge not as a point-in-time value but as a continuously accumulating measure of time-at-value, integral gauges provide a highly reliable and definitive representation of a parameter’s average behavior across any given measurement interval.
Effect Systems vs Print Debugging: A Pragmatic Solution
- algebraic-effects
2026-04-27T12:47:54+00:00
But what happens when you lie to the effect system? Nothing good.
Topics in High-Performance Messaging
- messaging
2026-04-25T14:05:42+00:00
The Informatica Ultra Messaging team has worked together in the field of high-performance messaging for many years, and in that time, have seen some messaging systems that worked well and some that didn't. Successful deployment of a messaging system requires background information that is not easily available; most of what we know, we had to learn in the school of hard knocks. To save others a knock or two, we have collected here the essential background information and commentary on some of the issues involved in successful deployments. This information is organized as a series of topics around which there seems to be confusion or uncertainty.
The Serial Safety Net
2026-04-24T20:25:51+00:00
Concurrency control (CC) algorithms must trade off strictness for performance, with serializable schemes generally paying high cost---both in runtime overhead such as contention on lock tables, and in wasted efforts by aborting transactions---to prevent anomalies. We propose the serial safety net (SSN), a serializability-enforcing certifier for modern hardware with substantial core count and large main memory. SSN can be applied with minimal overhead on top of various CC schemes that offer higher performance but admit anomalies, e.g., snapshot isolation and read committed. We demonstrate the efficiency, accuracy and robustness of SSN using a memory-optimized OLTP engine with different CC schemes. We find that SSN is a promising approach to serializability with low abort rates and robust performance for various workloads.
gabagool
- wasm
- process-migration
2026-04-24T20:24:49+00:00
Snapshotable WebAssembly interpreter from scratch. Includes a time travel debugger
Building a Database on S3
- faas
- database
- storage
2026-04-24T20:24:21+00:00
Hold your horses, though. I'm not unveiling a new S3-native database. This paper is from 2008. Many of its protocols feel clunky today. Yet it nails the core idea that defines modern cloud-native databases: separate storage from compute. The authors propose a shared-disk design over Amazon S3, with stateless clients executing transactions. The paper provides a blueprint for serverless before the term existed.
Kovan: From Production MVCC Systems to Wait-Free Memory Reclamation
2026-04-24T20:23:18+00:00
I came across “Crystalline: Fast and Memory Efficient Wait-Free Reclamation” by Nikolaev & Ravindran (DISC 2021). The paper has formal proofs of wait-freedom and bounded memory, and benchmarks that show Crystalline matching or beating epoch-based reclamation in read-heavy workloads — which is exactly the case that matters most in practice. Turning a paper into something that actually runs on ARM64 under production load is a different problem. Memory ordering, ABA issues under high contention, the gap between what the proof assumes and what hardware actually does. That took a while.
The inner workings of TCP zero-copy
- linux
- memory
- tcp
2026-04-24T20:21:14+00:00
TCP zero-copy is a feature of the Linux kernel that makes it possible to send and receive data without incurring an extra copy between kernel memory and the memory buffer that holds the final data (in userspace, or even in the memory of a different device on the system). Copying data adds overhead, so avoiding it is appealing. The kernel features that enable this are quite new, and figuring out exactly how they work under the hood is not trivial. So in this post I’ll try to summarise what exactly is going on under the hood when these features are used.
Lamport's Arrow of Time: The Category Mistake in Logical Clocks
- time
2026-04-24T20:20:38+00:00
Lamport's 1978 paper introduced the happens-before relation and logical clocks, freeing distributed systems from dependence on synchronized physical clocks. This is widely understood as a move away from Newtonian absolute time. We argue that Lamport's formalism retains a deeper and largely unexamined assumption: that causality induces a globally well-defined directed acyclic graph (DAG) over events -- a forward-in-time-only (FITO) structure that functions as an arrow of time embedded at the semantic level. Following Ryle's analysis of category mistakes, we show that this assumption conflates an epistemic construct (the logical ordering of messages) with an ontic claim (that physical causality is globally acyclic and monotonic). We trace this conflation through Shannon's channel model, TLA+, Bell's theorem, and the impossibility results of Fischer-Lynch-Paterson and Brewer's CAP theorem. We then show that special and general relativity permit only local causal structure, and that recent work on indefinite causal order demonstrates that nature admits correlations with no well-defined causal ordering. We propose that mutual information conservation, rather than temporal precedence, provides a more fundamental primitive for distributed consistency.
You don't need free lists! | Jakub's tech blog
- memory
2026-04-24T20:20:04+00:00
The usual way to implement the slot-reuse mechanism is to store some kind of a list of currently unused slots ... But I have been using a very different approach lately, which is NOT based on any kind of free-list-like system.
An Auction-Based Mechanism for Optimal Task Allocation and Resource Aware Containerization
- scheduling
- auction
2026-04-24T20:18:15+00:00
Distributed computing has enabled cooperation between multiple computing devices for the simultaneous execution of resource-hungry tasks. Such execution also plays a pivotal role in the parallel execution of numerous tasks in the Internet of Things (IoT) environment. Leveraging the computing resources of multiple devices, the offloading and processing of computationintensive tasks can be carried out more efficiently. However, managing resources and optimizing costs remain challenging for successfully executing tasks in cloud-based containerization for IoT. This paper proposes AUC-RAC, an auction-based mechanism for efficient offloading of computation tasks among multiple local servers in the context of IoT devices. The approach leverages the concept of Docker swarm, which connects multiple local servers in the form of Manager Node (MN) and Worker Nodes (WNs). It uses Docker containerization to execute tasks simultaneously. In this system, IoT devices send tasks to the MN, which then sends the task details to all its WNs to participate in the auction-based bidding process. The auctionbased bidding process optimizes the allocation of computation tasks among multiple systems, considering their resource sufficiency. The experimental analysis establishes that the approach offers improved offloading and computation-intensive services for IoT devices by enabling cooperation between local servers.
It's not a lie if you don't get caught: simplifying reconfiguration in SMR through dirty logs
- smr
2026-04-24T20:17:51+00:00
Production state-machine replication (SMR) implementations are complex, multi-layered architectures comprising data dissemination, ordering, execution, and reconfiguration components. Existing research consensus protocols rarely discuss reconfiguration. Those that do tightly couple membership changes to a specific algorithm. This prevents the independent upgrade of individual building blocks and forces expensive downtime when transitioning to new protocol implementations. Instead, modularity is essential for maintainability and system evolution in production deployments. We present Gauss, a reconfiguration engine designed to treat consensus protocols as interchangeable modules. By introducing a distinction between a consensus protocol's inner log and a sanitized outer log exposed to the RSM node, Gauss allows engineers to upgrade membership, failure thresholds, and the consensus protocol itself independently and with minimal global downtime. Our initial evaluation on the Rialo blockchain shows that this separation of concerns enables a seamless evolution of the SMR stack across a sequence of diverse protocol implementations.
Distributing Go binaries like sqlite-scanner through PyPI using go-to-wheel
- go
- python
- deployment
2026-04-24T20:17:30+00:00
It turns out publishing Go binaries to PyPI means any Go binary can be just a uvx package-name call away.
CockroachDB Serverless: Sub-second Scaling from Zero with Multi-region Cluster Virtualization
- database
- scaling
2026-04-24T20:15:44+00:00
Dismissing this as fake disaggregation would miss the point. I came to appreciate this design choice as I read the paper (a well written paper!). This logical disaggregation (the paper calls it cluster virtualization) provides a pragmatic evolution of the shared-nothing model. CRDB pushes the SQL–KV boundary (as in systems like TiDB and FoundationDB) to its logical extreme to provide the basis for a multi-tenant storage layer. From here on, they solve the sub-second cold starts problem and admission control problems with good engineering rather than an architectural overhaul.
Creating virtual block devices with ublk - Jiri Pospisil
- linux
- io_uring
2026-04-24T20:13:59+00:00
Ublk is a new framework available since Linux v6.0+ for creating virtual block devices using io_uring in user space
HTTP RateLimit headers
- http
- load-balancing
2026-04-24T20:13:32+00:00
When a client has more work to do than will fit in a single window’s quota, linear rate limit algorithms such as GCRA encourage the client to smooth out its requests nicely. In this article I’ll describe how a server can use a linear rate limit algorithm with HTTP RateLimit headers.
Just the Browser
- home-it
2026-04-24T20:12:51+00:00
Just the Browser helps you remove AI features, telemetry data reporting, sponsored content, product integrations, and other annoyances from desktop web browsers. The goal is to give you "just the browser" and nothing else, using hidden settings in web browsers intended for companies and other organizations.
High-Performance DBMSs with io_uring: When and How to use it
- io
- io_uring
- database
2026-04-24T20:12:23+00:00
We study how modern database systems can leverage the Linux io_uring interface for efficient, low-overhead I/O. io_uring is an asynchronous system call batching interface that unifies storage and network operations, addressing limitations of existing Linux I/O interfaces. However, naively replacing traditional I/O interfaces with io_uring does not necessarily yield performance benefits. To demonstrate when io_uring delivers the greatest benefits and how to use it effectively in modern database systems, we evaluate it in two use cases: Integrating io_uring into a storage-bound buffer manager and using it for high-throughput data shuffling in network-bound analytical workloads. We further analyze how advanced io_uring features, such as registered buffers and passthrough I/O, affect end-to-end performance. Our study shows when low-level optimizations translate into tangible system-wide gains and how architectural choices influence these benefits. Building on these insights, we derive practical guidelines for designing I/O-intensive systems using io_uring and validate their effectiveness in a case study of PostgreSQL's recent io_uring integration, where applying our guidelines yields a performance improvement of 14%.
Fun with Algebraic Effects
- algebraic-effects
2026-04-24T20:12:04+00:00
Algebraic effects were originally added to OCaml for general-purpose concurrent execution of programs for OCaml 5, which supports thread-level parallelism. The fact that they can be repurposed for Hardcaml simulations speaks to how well-thought-out and general a language feature this is.
Local Rendezvous Hashing: Bounded Loads and Minimal Churn via Cache-Local Candidates
- consistent-hashing
2026-04-24T20:11:12+00:00
Consistent hashing is fundamental to distributed systems, but ring-based schemes can exhibit high peak-to-average load ratios unless they use many virtual nodes, while multi-probe methods improve balance at the cost of scattered memory accesses. This paper introduces Local Rendezvous Hashing (LRH), which preserves a token ring but restricts Highest Random Weight (HRW) selection to a cache-local window of C distinct neighboring physical nodes. LRH locates a key by one binary search, enumerates exactly C distinct candidates using precomputed next-distinct offsets, and chooses the HRW winner (optionally weighted). Lookup cost is O(log|R| + C). Under fixed-topology liveness changes, fixed-candidate filtering remaps only keys whose original winner is down, yielding zero excess churn. In a benchmark with N=5000, V=256 (|R|=1.28M), K=50M and C=8, LRH reduces Max/Avg load from 1.2785 to 1.0947 and achieves 60.05 Mkeys/s, about 6.8x faster than multi-probe consistent hashing with 8 probes (8.80 Mkeys/s) while approaching its balance (Max/Avg 1.0697). A microbenchmark indicates multi-probe assignment is dominated by repeated ring searches and memory traffic rather than probe-generation arithmetic.
LeaseGuard: Raft Leases Done Right!
- raft
- consensus
2026-04-24T20:09:50+00:00
We designed a simple lease protocol tailored for Raft, called LeaseGuard. Our main innovation is to rely on Raft-specific guarantees to design a simpler lease protocol that recovers faster from a leader crash.
vm.overcommit_memory=2 is always the right setting for servers
- memory
- linux
2026-04-24T20:09:11+00:00
While overcommit is convenient for application developers, it fundamentally changes the contract of memory allocation: a successful allocation no longer represents an atomic acquisition of a real resource. Instead, the returned mapping serves as a deferred promise, which will only be fulfilled by the page fault handler if and when the memory is first accessed. This is an important distinction, as it means overcommit effectively replaces a fail-fast transactional allocation model with a best-effort one where failures are only caught after the fact rather than at the point of allocation.
Seamless Transitions: A Comprehensive Review of Live Migration Technologies
- process-migration
2026-04-24T20:08:37+00:00
Live migration, a technology enabling seamless transition of operational computational entities between various hosts while preserving continuous functionality and client connectivity, has been the subject of extensive research. However, existing reviews often overlook critical technical aspects and practical challenges integral to the usage of live migration techniques in real-world scenarios. This work bridges this gap by integrating the aspects explored in existing reviews together with a comprehensive analysis of live migration technologies across multiple dimensions, with focus on migration techniques, migration units, and infrastructure characteristics. Despite efforts to make live migration widely accessible, its reliance on multiple system factors can create challenges. In certain cases, the complexities and resource demands outweigh the benefits, making its implementation hard to justify. The focus of this work is mainly on container based and virtual machine-based migration technologies, examining the current state of the art and the disparity in adoption between these two approaches. Furthermore, this work explores the impact of migration objectives and operational constraints on the usability and efficacy of existing technologies. By outlining current technical challenges and providing guidelines for future research and development directions, this work serves a dual purpose: first, to equip enthusiasts with a valuable resource on live migration, and second, to contribute to the advancement of live migration technologies and their practical implementation across diverse computing environments.
randoneering/pgFirstAid
- postgresql
2026-04-24T20:05:58+00:00
Easy to deploy, open source, postgresql function that provides a prioritized list of actions to take to improve stability and performance.
No Cords Attached: Coordination-Free Concurrent Lock-Free Queues
2026-04-24T20:05:25+00:00
The queue is conceptually one of the simplest data structures-a basic FIFO container. However, ensuring correctness in the presence of concurrency makes existing lock-free implementations significantly more complex than their original form. Coordination mechanisms introduced to prevent hazards such as ABA, use-after-free, and unsafe reclamation often dominate the design, overshadowing the queue itself. Many schemes compromise strict FIFO ordering, unbounded capacity, or lock-free progress to mask coordination overheads. Yet the true source of complexity lies in the pursuit of infinite protection against reclamation hazards--theoretically sound but impractical and costly. This pursuit not only drives unnecessary complexity but also creates a protection paradox where excessive protection reduces system resilience rather than improving it. While such costs may be tolerable in conventional workloads, the AI era has shifted the paradigm: training and inference pipelines involve hundreds to thousands of concurrent threads per node, and at this scale, protection and coordination overheads dominate, often far heavier than the basic queue operations themselves. This paper introduces Cyclic Memory Protection (CMP), a coordination-free queue that preserves strict FIFO semantics, unbounded capacity, and lock-free progress while restoring simplicity. CMP reclaims the strict FIFO that other approaches sacrificed through bounded protection windows that provide practical reclamation guarantees. We prove strict FIFO and safety via linearizability and bounded reclamation analysis, and show experimentally that CMP outperforms state-of-the-art lock-free queues by up to 1.72-4x under high contention while maintaining scalability to hundreds of threads. Our work demonstrates that highly concurrent queues can return to their fundamental simplicity without weakening queue semantics.
Making Democracy Work: Fixing and Simplifying Egalitarian Paxos (Extended Version)
- paxos
- consensus
2026-04-24T20:03:50+00:00
Classical state-machine replication protocols, such as Paxos, rely on a distinguished leader process to order commands. Unfortunately, this approach makes the leader a single point of failure and increases the latency for clients that are not co-located with it. As a response to these drawbacks, Egalitarian Paxos introduced an alternative, leaderless approach, that allows replicas to order commands collaboratively. Not relying on a single leader allows the protocol to maintain non-zero throughput with up to f crashes of any processes out of a total of n = 2f+1. The protocol furthermore allows any process to execute a command c fast, in 2 message delays, provided no more than e = \lceil\frac{f+1}{2}\rceil other processes fail, and all concurrently submitted commands commute with c; the latter condition is often satisfied in practical systems. Egalitarian Paxos has served as a foundation for many other replication protocols. But unfortunately, the protocol is very complex, ambiguously specified and suffers from nontrivial bugs. In this paper, we present EPaxos* -- a simpler and correct variant of Egalitarian Paxos. Our key technical contribution is a simpler failure-recovery algorithm, which we have rigorously proved correct. Our protocol also generalizes Egalitarian Paxos to cover the whole spectrum of failure thresholds f and e such that n \ge \max\{2e+f-1, 2f+1\} -- the number of processes that we show to be optimal.
WireGuard topologies for self-hosting at home
2026-04-24T20:02:19+00:00
I recently migrated my self-hosted services from a VPS (virtual private server) at a remote data center to a physical server at home. This change was motivated by wanting to be in control of the hardware and network where said services run, while trying to keep things as simple as possible. What follows is a walk-through of how I reasoned through different WireGuard toplogies for the VPN (virtual private network) in which my devices and services reside.
Just use cURL
- http
- fun
2026-04-24T19:57:47+00:00
What the fuck happened to making HTTP requests? You used to just type curl example.com and boom, you got your goddamn response. Now everyone's downloading 500MB Electron monstrosities that take 3 minutes to boot up just to send a fucking GET request.
OSWALD—Object Storage Write-Ahead Log Device
- smr
- wal
- storage
2026-04-24T19:53:24+00:00
OSWALD is a Write-Ahead Log (WAL) design built exclusively on object storage primitives. It works with any object storage service that provides read-after-write consistency and compare-and-swap operations, including AWS S3, Google Cloud Storage, and Azure Blob Storage. The design supports checkpointing and garbage collection, making it suitable for State Machine Replication (SMR).
Rendezvous Hashing Explained - Randorithms
2026-04-24T19:50:27+00:00
Rendezvous hashing is an algorithm to solve the distributed hash table problem - a common and general pattern in distributed systems
WebAssembly and Unikernels: A Comparative Study for Serverless at the Edge
- wasm
- virtualization
2026-04-24T19:49:41+00:00
Serverless computing at the edge requires lightweight execution environments to minimize cold start latency, especially in Urgent Edge Computing (UEC). This paper compares WebAssembly and unikernel-based MicroVMs for serverless workloads. We present Limes, a WebAssembly runtime built on Wasmtime, and evaluate it against the Firecracker-based environment used in SPARE. Results show that WebAssembly offers lower cold start times for lightweight functions but suffers with complex workloads, while Firecracker provides higher, but stable, cold starts and better execution performance, particularly for I/O-heavy tasks.
Towards Deterministic Sub-0.5 us Response on Linux through Interrupt Isolation
- linux
- latency
- scheduling
2026-04-24T19:49:00+00:00
Real-time responsiveness in Linux is often constrained by interrupt contention and timer handling overhead, making it challenging to achieve sub-microsecond latency. This work introduces an interrupt isolation approach that centralizes and minimizes timer interrupt interference across CPU cores. By enabling a dedicated API to selectively invoke timer handling routines and suppress non-critical inter-processor interrupts, our design significantly reduces jitter and response latency. Experiments conducted on an ARM-based multicore platform demonstrate that the proposed mechanism consistently achieves sub-0.5 us response times, outperforming conventional Linux PREEMPT-RT configurations. These results highlight the potential of interrupt isolation as a lightweight and effective strategy for deterministic real-time workloads in general-purpose operating systems.
Batched Critical Sections
2026-04-24T19:47:40+00:00
Critical sections don’t have to be scheduler-bottlenecked
Bringing restartable sequences out of the niche
- concurrency
- non-blocking
2026-04-24T19:45:50+00:00
Any user-space lockless algorithm must work correctly in an environment where code execution can be interrupted at any time. Restartable sequences are one solution to this problem. To use this feature, an application must designate a critical section that does some work, culminating in a single atomic instruction that commits whatever change is being made
Don't pick weird subnets for embedded networks, use VRFs
2026-04-24T19:44:27+00:00
So instead of having more exotic addressing inside your embedded network there's also a solution that contains all the weirdness to only the router. It is possible to configure a router so that both networks it connects to have the same subnet by having separate routing tables for the interfaces. This means your internal network can be 10.0.0.0/24 and the venue network can be 10.0.0.0/24 and it all just works.
Notes on io-uring
- io_uring
- rust
- linux
2026-04-24T19:43:20+00:00
So I think this is the solution we should all adopt and move forward with: io-uring controls the buffers, the fastest interfaces on io-uring are the buffered interfaces, the unbuffered interfaces make an extra copy. We can stop being mired in trying to force the language to do something impossible. But there are still many many interesting questions ahead.
eBPF Networking Techniques - Packet Redirection
- ebpf
- linux
- networking
2026-04-24T19:37:59+00:00
EBPF packet redirection is a common technique especially in container orchestration software like Kubernetes. Cilium, which I work on as my day job, uses this all the time to move packets between containers.
Index 1,600,000,000 Keys with Automata and Rust
- fsm
- rust
2026-04-24T19:34:18+00:00
It turns out that finite state machines are useful for things other than expressing computation. Finite state machines can also be used to compactly represent ordered sets or maps of strings that can be searched very quickly. In this article, I will teach you about finite state machines as a data structure for representing ordered sets and maps
Bypass PostgreSQL catalog overhead with direct partition hash calculations
- postgresql
- scalability
2026-04-24T19:29:59+00:00
PostgreSQL’s hash partitioning distributes rows across partitions using deterministic hash functions. When you query through the parent table, PostgreSQL must perform catalog lookups to route each query to the correct partition. This results in measurable overhead for high-throughput applications, especially if you decide to use multi-level partitioning schemes where PostgreSQL must traverse deeper catalog structures to identify the target partition. Let’s take a look at some findings on speeding up the part where you already know the partition key values.
It's time for modern CSS to kill the SPA
- web
- browser
- css
2026-04-24T19:27:03+00:00
The assumption was simple: Seamless navigation requires us to build an app. That assumption is now obsolete.
Design and Reliability of a User Space Write-Ahead Log in Rust
- rust
- wal
- storage
2026-04-24T19:23:38+00:00
Write-ahead logs (WALs) are a fundamental fault-tolerance technique found in many areas of computer science. WALs must be reliable while maintaining high performance, because all operations will be written to the WAL to ensure their stability. Without reliability a WAL is useless, because its utility is tied to its ability to recover data after a failure. In this paper we describe our experience creating a prototype user space WAL in Rust. We observed that Rust is easy to use, compact and has a very rich set of libraries. More importantly, we have found that the overhead is minimal, with the WAL prototype operating at basically the expected performance of the stable memory device.
LazyLog: A New Shared Log Abstraction for Low-Latency Applications
- data
- consistency
2026-04-24T19:22:37+00:00
Shared logs offer linearizable total order across storage shards. However, they enforce this order eagerly upon ingestion, leading to high latencies. We observe that in many modern shared-log applications, while linearizable ordering is necessary, it is not required eagerly when ingesting data but only later when data is consumed. Further, readers are naturally decoupled in time from writers in these applications. Based on this insight, we propose LazyLog, a novel shared log abstraction. LazyLog lazily binds records (across shards) to linearizable global positions and enforces this before a log position can be read. Such lazy ordering enables low ingestion latencies. Given the time decoupling, LazyLog can establish the order well before reads arrive, minimizing overhead upon reads. We build two LazyLog systems that provide linearizable total order across shards. Our experiments show that LazyLog systems deliver significantly lower latencies than conventional, eager-ordering shared logs.
Systems Thinking Explained: From Events to Systemic Structures
- systems
2026-04-24T17:29:50+00:00
Understanding those structures requires a different kind of thinking, and that’s exactly what systems thinking is: the ability to shift from reacting to events through responsive patterns of behaviors to generating improved systemic structures.
Exploring Micro Frontends: A Case Study Application in E-Commerce
- micro-frontend
- web
2026-04-24T13:53:44+00:00
In the micro frontends architectural style, the frontend is divided into smaller components, which can range from a simple button to an entire page. The goal is to improve scalability, resilience, and team independence, albeit at the cost of increased complexity and infrastructure demands. This paper seeks to understand when it is worth adopting micro frontends, particularly in the context of industry. To achieve this, we conducted an investigation into the state of the art of micro frontends, based on both academic and gray literature. We then implemented this architectural style in a marketplace for handcrafted products, which already used microservices. Finally, we evaluated the implementation through a semi-open questionnaire with the developers. At the studied marketplace company, the need for architectural change arose due to the tight coupling between their main system (a Java monolith) and a dedicated frontend system. Additionally, there were deprecated technologies and poor developer experience. To address these issues, the micro frontends architecture was adopted, along with the API Gateway and Backend for Frontend patterns, and technologies such as Svelte and Fastify. Although the adoption of Micro Frontends was successful, it was not strictly necessary to meet the company's needs. According to the analysis of the mixed questionnaire responses, other alternatives, such as a monolithic frontend, could have achieved comparable results. What made adopting micro frontends the most convenient choice in the company's context was the monolith strangulation and microservices adoption, which facilitated implementation through infrastructure reuse and knowledge sharing between teams.
Implementation and Evaluation of Fast Raft for Hierarchical Consensus
- raft
- consensus
2026-04-24T13:53:16+00:00
We present the first open-source implementation and evaluation of Fast Raft, a hierarchical consensus protocol designed for dynamic, distributed environments. Fast Raft reduces the number of message rounds needed to commit log entries compared to standard Raft by introducing a fast-track mechanism and reducing leader dependence. Our implementation uses gRPC and Kubernetes-based deployment across AWS availability zones. Experimental results demonstrate a throughput improvement and reduced commit latency under low packet loss conditions, while maintaining Raft's safety and liveness guarantees.
How to Grow an LSM-tree? Towards Bridging the Gap Between Theory and Practice
- lsm
- storage
- database
2026-04-24T13:52:56+00:00
LSM-tree based key-value stores are widely adopted as the data storage backend in modern big data applications. The LSM-tree grows with data ingestion, by either adding levels with fixed level capacities (dubbed as vertical scheme) or increasing level capacities with fixed number of levels (dubbed as horizontal scheme). The vertical scheme leads the trend in recent system designs in RocksDB, LevelDB, and WiredTiger, whereas the horizontal scheme shows a decline in being adopted in the industry. The growth scheme profoundly impacts the LSM system performance in various aspects such as read, write and space costs. This paper attempts to give a new insight into a fundamental design question -- how to grow an LSM-tree to attain more desirable performance? Our analysis highlights the limitations of the vertical scheme in achieving an optimal read-write trade-off and the horizontal scheme in managing space cost effectively. Building on the analysis, we present a novel approach, Vertiorizon, which combines the strengths of both the vertical and horizontal schemes to achieve a superior balance between lookup, update, and space costs. Its adaptive design makes it highly compatible with a wide spectrum of workloads. Compared to the vertical scheme, Vertiorizon significantly improves the read-write performance trade-off. In contrast to the horizontal scheme, Vertiorizon greatly extends the trade-off range by a non-trivial generalization of Bentley and Saxe's theory, while substantially reducing space costs. When integrated with RocksDB, Vertiorizon demonstrates better write performance than the vertical scheme, while incurring about six times less additional space cost compared to the horizontal scheme.
Mastering Postgres Replication Slots: Preventing WAL Bloat and Other Production Issues
2026-04-23T21:51:26+00:00
When not being careful, a replication slot may cause unduly large amounts of WAL segments to be retained by the database. This post describes best practices helping to prevent this and other issues, discussing aspects like heartbeats, replication slot failover, monitoring, the management of Postgres publications, and more. While this is primarily based on my experience of using replication slots via Debezium’s Postgres connector, the principles are generally applicable and are worth considering also when using other CDC tools for Postgres based on logical replication.
Burn rate is a better error rate
- metrics
- slo
2026-04-23T18:10:28+00:00
Burn rate is how fast, relative to the SLO, the service consumes the error budget. Put simply, burn rate is the ratio of your error rate to your error budget (the percentage). Though burn rate is nearly identical to error rate, our stance is that burn rate is a better error rate. Looking at an error rate in isolation, you might ask: is this error rate too high? Burn rate answers this question by taking into account the expected error rate—i.e., the error budget—which was decided on by the service owners in accordance with their service reliability goals. The ratio of error rate to error budget is useful because if it is at or below one, the error rate is acceptable. If it’s higher than one, the error rate is higher than the service owner would expect.
magic namerefs
- bash
2026-04-23T18:08:24+00:00
Namerefs (introduced in bash 4.0) act as aliases for other variables