Concurrency control (CC) algorithms must trade off strictness for performance, with serializable schemes generally paying high cost---both in runtime overhead such as contention on lock tables, and in wasted efforts by aborting transactions---to prevent anomalies. We propose the serial safety net (SSN), a serializability-enforcing certifier for modern hardware with substantial core count and large main memory. SSN can be applied with minimal overhead on top of various CC schemes that offer higher performance but admit anomalies, e.g., snapshot isolation and read committed. We demonstrate the efficiency, accuracy and robustness of SSN using a memory-optimized OLTP engine with different CC schemes. We find that SSN is a promising approach to serializability with low abort rates and robust performance for various workloads.
Snapshotable WebAssembly interpreter from scratch. Includes a time travel debugger
Hold your horses, though. I'm not unveiling a new S3-native database. This paper is from 2008. Many of its protocols feel clunky today. Yet it nails the core idea that defines modern cloud-native databases: separate storage from compute. The authors propose a shared-disk design over Amazon S3, with stateless clients executing transactions. The paper provides a blueprint for serverless before the term existed.
I came across “Crystalline: Fast and Memory Efficient Wait-Free Reclamation” by Nikolaev & Ravindran (DISC 2021). The paper has formal proofs of wait-freedom and bounded memory, and benchmarks that show Crystalline matching or beating epoch-based reclamation in read-heavy workloads — which is exactly the case that matters most in practice. Turning a paper into something that actually runs on ARM64 under production load is a different problem. Memory ordering, ABA issues under high contention, the gap between what the proof assumes and what hardware actually does. That took a while.
TCP zero-copy is a feature of the Linux kernel that makes it possible to send and receive data without incurring an extra copy between kernel memory and the memory buffer that holds the final data (in userspace, or even in the memory of a different device on the system). Copying data adds overhead, so avoiding it is appealing. The kernel features that enable this are quite new, and figuring out exactly how they work under the hood is not trivial. So in this post I’ll try to summarise what exactly is going on under the hood when these features are used.
Lamport's 1978 paper introduced the happens-before relation and logical clocks, freeing distributed systems from dependence on synchronized physical clocks. This is widely understood as a move away from Newtonian absolute time. We argue that Lamport's formalism retains a deeper and largely unexamined assumption: that causality induces a globally well-defined directed acyclic graph (DAG) over events -- a forward-in-time-only (FITO) structure that functions as an arrow of time embedded at the semantic level. Following Ryle's analysis of category mistakes, we show that this assumption conflates an epistemic construct (the logical ordering of messages) with an ontic claim (that physical causality is globally acyclic and monotonic). We trace this conflation through Shannon's channel model, TLA+, Bell's theorem, and the impossibility results of Fischer-Lynch-Paterson and Brewer's CAP theorem. We then show that special and general relativity permit only local causal structure, and that recent work on indefinite causal order demonstrates that nature admits correlations with no well-defined causal ordering. We propose that mutual information conservation, rather than temporal precedence, provides a more fundamental primitive for distributed consistency.
The usual way to implement the slot-reuse mechanism is to store some kind of a list of currently unused slots ... But I have been using a very different approach lately, which is NOT based on any kind of free-list-like system.
Distributed computing has enabled cooperation between multiple computing devices for the simultaneous execution of resource-hungry tasks. Such execution also plays a pivotal role in the parallel execution of numerous tasks in the Internet of Things (IoT) environment. Leveraging the computing resources of multiple devices, the offloading and processing of computationintensive tasks can be carried out more efficiently. However, managing resources and optimizing costs remain challenging for successfully executing tasks in cloud-based containerization for IoT. This paper proposes AUC-RAC, an auction-based mechanism for efficient offloading of computation tasks among multiple local servers in the context of IoT devices. The approach leverages the concept of Docker swarm, which connects multiple local servers in the form of Manager Node (MN) and Worker Nodes (WNs). It uses Docker containerization to execute tasks simultaneously. In this system, IoT devices send tasks to the MN, which then sends the task details to all its WNs to participate in the auction-based bidding process. The auctionbased bidding process optimizes the allocation of computation tasks among multiple systems, considering their resource sufficiency. The experimental analysis establishes that the approach offers improved offloading and computation-intensive services for IoT devices by enabling cooperation between local servers.
Production state-machine replication (SMR) implementations are complex, multi-layered architectures comprising data dissemination, ordering, execution, and reconfiguration components. Existing research consensus protocols rarely discuss reconfiguration. Those that do tightly couple membership changes to a specific algorithm. This prevents the independent upgrade of individual building blocks and forces expensive downtime when transitioning to new protocol implementations. Instead, modularity is essential for maintainability and system evolution in production deployments. We present Gauss, a reconfiguration engine designed to treat consensus protocols as interchangeable modules. By introducing a distinction between a consensus protocol's inner log and a sanitized outer log exposed to the RSM node, Gauss allows engineers to upgrade membership, failure thresholds, and the consensus protocol itself independently and with minimal global downtime. Our initial evaluation on the Rialo blockchain shows that this separation of concerns enables a seamless evolution of the SMR stack across a sequence of diverse protocol implementations.
It turns out publishing Go binaries to PyPI means any Go binary can be just a uvx package-name call away.
Dismissing this as fake disaggregation would miss the point. I came to appreciate this design choice as I read the paper (a well written paper!). This logical disaggregation (the paper calls it cluster virtualization) provides a pragmatic evolution of the shared-nothing model. CRDB pushes the SQL–KV boundary (as in systems like TiDB and FoundationDB) to its logical extreme to provide the basis for a multi-tenant storage layer. From here on, they solve the sub-second cold starts problem and admission control problems with good engineering rather than an architectural overhaul.
Ublk is a new framework available since Linux v6.0+ for creating virtual block devices using io_uring in user space
When a client has more work to do than will fit in a single window’s quota, linear rate limit algorithms such as GCRA encourage the client to smooth out its requests nicely. In this article I’ll describe how a server can use a linear rate limit algorithm with HTTP RateLimit headers.
Just the Browser helps you remove AI features, telemetry data reporting, sponsored content, product integrations, and other annoyances from desktop web browsers. The goal is to give you "just the browser" and nothing else, using hidden settings in web browsers intended for companies and other organizations.
We study how modern database systems can leverage the Linux io_uring interface for efficient, low-overhead I/O. io_uring is an asynchronous system call batching interface that unifies storage and network operations, addressing limitations of existing Linux I/O interfaces. However, naively replacing traditional I/O interfaces with io_uring does not necessarily yield performance benefits. To demonstrate when io_uring delivers the greatest benefits and how to use it effectively in modern database systems, we evaluate it in two use cases: Integrating io_uring into a storage-bound buffer manager and using it for high-throughput data shuffling in network-bound analytical workloads. We further analyze how advanced io_uring features, such as registered buffers and passthrough I/O, affect end-to-end performance. Our study shows when low-level optimizations translate into tangible system-wide gains and how architectural choices influence these benefits. Building on these insights, we derive practical guidelines for designing I/O-intensive systems using io_uring and validate their effectiveness in a case study of PostgreSQL's recent io_uring integration, where applying our guidelines yields a performance improvement of 14%.
Algebraic effects were originally added to OCaml for general-purpose concurrent execution of programs for OCaml 5, which supports thread-level parallelism. The fact that they can be repurposed for Hardcaml simulations speaks to how well-thought-out and general a language feature this is.
Consistent hashing is fundamental to distributed systems, but ring-based schemes can exhibit high peak-to-average load ratios unless they use many virtual nodes, while multi-probe methods improve balance at the cost of scattered memory accesses. This paper introduces Local Rendezvous Hashing (LRH), which preserves a token ring but restricts Highest Random Weight (HRW) selection to a cache-local window of C distinct neighboring physical nodes. LRH locates a key by one binary search, enumerates exactly C distinct candidates using precomputed next-distinct offsets, and chooses the HRW winner (optionally weighted). Lookup cost is O(log|R| + C). Under fixed-topology liveness changes, fixed-candidate filtering remaps only keys whose original winner is down, yielding zero excess churn. In a benchmark with N=5000, V=256 (|R|=1.28M), K=50M and C=8, LRH reduces Max/Avg load from 1.2785 to 1.0947 and achieves 60.05 Mkeys/s, about 6.8x faster than multi-probe consistent hashing with 8 probes (8.80 Mkeys/s) while approaching its balance (Max/Avg 1.0697). A microbenchmark indicates multi-probe assignment is dominated by repeated ring searches and memory traffic rather than probe-generation arithmetic.
We designed a simple lease protocol tailored for Raft, called LeaseGuard. Our main innovation is to rely on Raft-specific guarantees to design a simpler lease protocol that recovers faster from a leader crash.
While overcommit is convenient for application developers, it fundamentally changes the contract of memory allocation: a successful allocation no longer represents an atomic acquisition of a real resource. Instead, the returned mapping serves as a deferred promise, which will only be fulfilled by the page fault handler if and when the memory is first accessed. This is an important distinction, as it means overcommit effectively replaces a fail-fast transactional allocation model with a best-effort one where failures are only caught after the fact rather than at the point of allocation.
Live migration, a technology enabling seamless transition of operational computational entities between various hosts while preserving continuous functionality and client connectivity, has been the subject of extensive research. However, existing reviews often overlook critical technical aspects and practical challenges integral to the usage of live migration techniques in real-world scenarios. This work bridges this gap by integrating the aspects explored in existing reviews together with a comprehensive analysis of live migration technologies across multiple dimensions, with focus on migration techniques, migration units, and infrastructure characteristics. Despite efforts to make live migration widely accessible, its reliance on multiple system factors can create challenges. In certain cases, the complexities and resource demands outweigh the benefits, making its implementation hard to justify. The focus of this work is mainly on container based and virtual machine-based migration technologies, examining the current state of the art and the disparity in adoption between these two approaches. Furthermore, this work explores the impact of migration objectives and operational constraints on the usability and efficacy of existing technologies. By outlining current technical challenges and providing guidelines for future research and development directions, this work serves a dual purpose: first, to equip enthusiasts with a valuable resource on live migration, and second, to contribute to the advancement of live migration technologies and their practical implementation across diverse computing environments.
Easy to deploy, open source, postgresql function that provides a prioritized list of actions to take to improve stability and performance.
The queue is conceptually one of the simplest data structures-a basic FIFO container. However, ensuring correctness in the presence of concurrency makes existing lock-free implementations significantly more complex than their original form. Coordination mechanisms introduced to prevent hazards such as ABA, use-after-free, and unsafe reclamation often dominate the design, overshadowing the queue itself. Many schemes compromise strict FIFO ordering, unbounded capacity, or lock-free progress to mask coordination overheads. Yet the true source of complexity lies in the pursuit of infinite protection against reclamation hazards--theoretically sound but impractical and costly. This pursuit not only drives unnecessary complexity but also creates a protection paradox where excessive protection reduces system resilience rather than improving it. While such costs may be tolerable in conventional workloads, the AI era has shifted the paradigm: training and inference pipelines involve hundreds to thousands of concurrent threads per node, and at this scale, protection and coordination overheads dominate, often far heavier than the basic queue operations themselves. This paper introduces Cyclic Memory Protection (CMP), a coordination-free queue that preserves strict FIFO semantics, unbounded capacity, and lock-free progress while restoring simplicity. CMP reclaims the strict FIFO that other approaches sacrificed through bounded protection windows that provide practical reclamation guarantees. We prove strict FIFO and safety via linearizability and bounded reclamation analysis, and show experimentally that CMP outperforms state-of-the-art lock-free queues by up to 1.72-4x under high contention while maintaining scalability to hundreds of threads. Our work demonstrates that highly concurrent queues can return to their fundamental simplicity without weakening queue semantics.
Classical state-machine replication protocols, such as Paxos, rely on a distinguished leader process to order commands. Unfortunately, this approach makes the leader a single point of failure and increases the latency for clients that are not co-located with it. As a response to these drawbacks, Egalitarian Paxos introduced an alternative, leaderless approach, that allows replicas to order commands collaboratively. Not relying on a single leader allows the protocol to maintain non-zero throughput with up to f crashes of any processes out of a total of n = 2f+1. The protocol furthermore allows any process to execute a command c fast, in 2 message delays, provided no more than e = \lceil\frac{f+1}{2}\rceil other processes fail, and all concurrently submitted commands commute with c; the latter condition is often satisfied in practical systems. Egalitarian Paxos has served as a foundation for many other replication protocols. But unfortunately, the protocol is very complex, ambiguously specified and suffers from nontrivial bugs. In this paper, we present EPaxos* -- a simpler and correct variant of Egalitarian Paxos. Our key technical contribution is a simpler failure-recovery algorithm, which we have rigorously proved correct. Our protocol also generalizes Egalitarian Paxos to cover the whole spectrum of failure thresholds f and e such that n \ge \max\{2e+f-1, 2f+1\} -- the number of processes that we show to be optimal.
I recently migrated my self-hosted services from a VPS (virtual private server) at a remote data center to a physical server at home. This change was motivated by wanting to be in control of the hardware and network where said services run, while trying to keep things as simple as possible. What follows is a walk-through of how I reasoned through different WireGuard toplogies for the VPN (virtual private network) in which my devices and services reside.
What the fuck happened to making HTTP requests? You used to just type curl example.com and boom, you got your goddamn response. Now everyone's downloading 500MB Electron monstrosities that take 3 minutes to boot up just to send a fucking GET request.
OSWALD is a Write-Ahead Log (WAL) design built exclusively on object storage primitives. It works with any object storage service that provides read-after-write consistency and compare-and-swap operations, including AWS S3, Google Cloud Storage, and Azure Blob Storage. The design supports checkpointing and garbage collection, making it suitable for State Machine Replication (SMR).
Rendezvous hashing is an algorithm to solve the distributed hash table problem - a common and general pattern in distributed systems
Serverless computing at the edge requires lightweight execution environments to minimize cold start latency, especially in Urgent Edge Computing (UEC). This paper compares WebAssembly and unikernel-based MicroVMs for serverless workloads. We present Limes, a WebAssembly runtime built on Wasmtime, and evaluate it against the Firecracker-based environment used in SPARE. Results show that WebAssembly offers lower cold start times for lightweight functions but suffers with complex workloads, while Firecracker provides higher, but stable, cold starts and better execution performance, particularly for I/O-heavy tasks.
Real-time responsiveness in Linux is often constrained by interrupt contention and timer handling overhead, making it challenging to achieve sub-microsecond latency. This work introduces an interrupt isolation approach that centralizes and minimizes timer interrupt interference across CPU cores. By enabling a dedicated API to selectively invoke timer handling routines and suppress non-critical inter-processor interrupts, our design significantly reduces jitter and response latency. Experiments conducted on an ARM-based multicore platform demonstrate that the proposed mechanism consistently achieves sub-0.5 us response times, outperforming conventional Linux PREEMPT-RT configurations. These results highlight the potential of interrupt isolation as a lightweight and effective strategy for deterministic real-time workloads in general-purpose operating systems.
Critical sections don’t have to be scheduler-bottlenecked
Any user-space lockless algorithm must work correctly in an environment where code execution can be interrupted at any time. Restartable sequences are one solution to this problem. To use this feature, an application must designate a critical section that does some work, culminating in a single atomic instruction that commits whatever change is being made
So instead of having more exotic addressing inside your embedded network there's also a solution that contains all the weirdness to only the router. It is possible to configure a router so that both networks it connects to have the same subnet by having separate routing tables for the interfaces. This means your internal network can be 10.0.0.0/24 and the venue network can be 10.0.0.0/24 and it all just works.
So I think this is the solution we should all adopt and move forward with: io-uring controls the buffers, the fastest interfaces on io-uring are the buffered interfaces, the unbuffered interfaces make an extra copy. We can stop being mired in trying to force the language to do something impossible. But there are still many many interesting questions ahead.
EBPF packet redirection is a common technique especially in container orchestration software like Kubernetes. Cilium, which I work on as my day job, uses this all the time to move packets between containers.
It turns out that finite state machines are useful for things other than expressing computation. Finite state machines can also be used to compactly represent ordered sets or maps of strings that can be searched very quickly. In this article, I will teach you about finite state machines as a data structure for representing ordered sets and maps
PostgreSQL’s hash partitioning distributes rows across partitions using deterministic hash functions. When you query through the parent table, PostgreSQL must perform catalog lookups to route each query to the correct partition. This results in measurable overhead for high-throughput applications, especially if you decide to use multi-level partitioning schemes where PostgreSQL must traverse deeper catalog structures to identify the target partition. Let’s take a look at some findings on speeding up the part where you already know the partition key values.
The assumption was simple: Seamless navigation requires us to build an app. That assumption is now obsolete.
Write-ahead logs (WALs) are a fundamental fault-tolerance technique found in many areas of computer science. WALs must be reliable while maintaining high performance, because all operations will be written to the WAL to ensure their stability. Without reliability a WAL is useless, because its utility is tied to its ability to recover data after a failure. In this paper we describe our experience creating a prototype user space WAL in Rust. We observed that Rust is easy to use, compact and has a very rich set of libraries. More importantly, we have found that the overhead is minimal, with the WAL prototype operating at basically the expected performance of the stable memory device.
Shared logs offer linearizable total order across storage shards. However, they enforce this order eagerly upon ingestion, leading to high latencies. We observe that in many modern shared-log applications, while linearizable ordering is necessary, it is not required eagerly when ingesting data but only later when data is consumed. Further, readers are naturally decoupled in time from writers in these applications. Based on this insight, we propose LazyLog, a novel shared log abstraction. LazyLog lazily binds records (across shards) to linearizable global positions and enforces this before a log position can be read. Such lazy ordering enables low ingestion latencies. Given the time decoupling, LazyLog can establish the order well before reads arrive, minimizing overhead upon reads. We build two LazyLog systems that provide linearizable total order across shards. Our experiments show that LazyLog systems deliver significantly lower latencies than conventional, eager-ordering shared logs.
Understanding those structures requires a different kind of thinking, and that’s exactly what systems thinking is: the ability to shift from reacting to events through responsive patterns of behaviors to generating improved systemic structures.
In the micro frontends architectural style, the frontend is divided into smaller components, which can range from a simple button to an entire page. The goal is to improve scalability, resilience, and team independence, albeit at the cost of increased complexity and infrastructure demands. This paper seeks to understand when it is worth adopting micro frontends, particularly in the context of industry. To achieve this, we conducted an investigation into the state of the art of micro frontends, based on both academic and gray literature. We then implemented this architectural style in a marketplace for handcrafted products, which already used microservices. Finally, we evaluated the implementation through a semi-open questionnaire with the developers. At the studied marketplace company, the need for architectural change arose due to the tight coupling between their main system (a Java monolith) and a dedicated frontend system. Additionally, there were deprecated technologies and poor developer experience. To address these issues, the micro frontends architecture was adopted, along with the API Gateway and Backend for Frontend patterns, and technologies such as Svelte and Fastify. Although the adoption of Micro Frontends was successful, it was not strictly necessary to meet the company's needs. According to the analysis of the mixed questionnaire responses, other alternatives, such as a monolithic frontend, could have achieved comparable results. What made adopting micro frontends the most convenient choice in the company's context was the monolith strangulation and microservices adoption, which facilitated implementation through infrastructure reuse and knowledge sharing between teams.
We present the first open-source implementation and evaluation of Fast Raft, a hierarchical consensus protocol designed for dynamic, distributed environments. Fast Raft reduces the number of message rounds needed to commit log entries compared to standard Raft by introducing a fast-track mechanism and reducing leader dependence. Our implementation uses gRPC and Kubernetes-based deployment across AWS availability zones. Experimental results demonstrate a throughput improvement and reduced commit latency under low packet loss conditions, while maintaining Raft's safety and liveness guarantees.
LSM-tree based key-value stores are widely adopted as the data storage backend in modern big data applications. The LSM-tree grows with data ingestion, by either adding levels with fixed level capacities (dubbed as vertical scheme) or increasing level capacities with fixed number of levels (dubbed as horizontal scheme). The vertical scheme leads the trend in recent system designs in RocksDB, LevelDB, and WiredTiger, whereas the horizontal scheme shows a decline in being adopted in the industry. The growth scheme profoundly impacts the LSM system performance in various aspects such as read, write and space costs. This paper attempts to give a new insight into a fundamental design question -- how to grow an LSM-tree to attain more desirable performance? Our analysis highlights the limitations of the vertical scheme in achieving an optimal read-write trade-off and the horizontal scheme in managing space cost effectively. Building on the analysis, we present a novel approach, Vertiorizon, which combines the strengths of both the vertical and horizontal schemes to achieve a superior balance between lookup, update, and space costs. Its adaptive design makes it highly compatible with a wide spectrum of workloads. Compared to the vertical scheme, Vertiorizon significantly improves the read-write performance trade-off. In contrast to the horizontal scheme, Vertiorizon greatly extends the trade-off range by a non-trivial generalization of Bentley and Saxe's theory, while substantially reducing space costs. When integrated with RocksDB, Vertiorizon demonstrates better write performance than the vertical scheme, while incurring about six times less additional space cost compared to the horizontal scheme.
When not being careful, a replication slot may cause unduly large amounts of WAL segments to be retained by the database. This post describes best practices helping to prevent this and other issues, discussing aspects like heartbeats, replication slot failover, monitoring, the management of Postgres publications, and more. While this is primarily based on my experience of using replication slots via Debezium’s Postgres connector, the principles are generally applicable and are worth considering also when using other CDC tools for Postgres based on logical replication.
Burn rate is how fast, relative to the SLO, the service consumes the error budget. Put simply, burn rate is the ratio of your error rate to your error budget (the percentage). Though burn rate is nearly identical to error rate, our stance is that burn rate is a better error rate. Looking at an error rate in isolation, you might ask: is this error rate too high? Burn rate answers this question by taking into account the expected error rate—i.e., the error budget—which was decided on by the service owners in accordance with their service reliability goals. The ratio of error rate to error budget is useful because if it is at or below one, the error rate is acceptable. If it’s higher than one, the error rate is higher than the service owner would expect.
Namerefs (introduced in bash 4.0) act as aliases for other variables
How to speed up your Postgres insert performance
BPF arenas are areas of memory where the verifier can safely relax its checking of pointers, allowing programmers to write arbitrary data structures in BPF
This library is a fork of Bits and Blooms that uses an alternative backing bitset based on Go's sync/atomic.Int64 rather than a bare slice of integers. This allows for concurrent addition and testing of filters without creating memory safety issues or race conditions by leveraging hardware support for atomic Load and Or operations on Int64s.
We present Laminar, the first TCP stack that delivers ASIC-class performance and energy efficiency on programmable Reconfigurable Match-Action Table (RMT) pipelines, providing flexibility while retaining standard TCP semantics and POSIX socket compatibility. The key challenge to Laminar is reconciling TCP's complex dependent state updates with RMT's unidirectional, lock-step execution model. To overcome this challenge, Laminar introduces three novel techniques: optimistic concurrency (speculative updates validated downstream), pseudo-segment injection (circular dependency resolution without stalls), and bump-in-the-wire processing (single-pass segment handling). Together, these enable TCP processing, including retransmission, reassembly, flow, and congestion control, as a pipeline of simple match-action operations. Our Intel Tofino 2 prototype demonstrates Laminar's scalability to terabit speeds, flexibility, and robustness to network dynamics. Laminar matches RDMA performance and efficiency for both RPC and streaming workloads (including NVMe-oF with SPDK), while maintaining TCP/POSIX compatibility. Laminar saves up to 16 host CPU cores versus state-of-the-art kernel-bypass TCP, while achieving 5 lower 99.99p tail latency and 2 better throughput-per-watt for key-value stores. At scale, Laminar drives nearly Bpps at 20 s RPC tail latency. Unlike fixed-function offloads, Laminar supports transport evolution through in-data-path extensions (selective ACKs, congestion control variants, application co-design for shared logs). Finally, Laminar generalizes to FPGA SmartNICs, outperforming ToNIC's monolithic design by under equal timing.
This post gives an overview of how to build a CPU-local data structure on modern Linux. The exposition will be for x86, but other than the small bits of assembly you need to write, the technique is architecture-independent.
Priority queues are used in a wide range of applications, including prioritized online scheduling, discrete event simulation, and greedy algorithms. In parallel settings, classical priority queues often become a severe bottleneck, resulting in low throughput. Consequently, there has been significant interest in concurrent priority queues with relaxed semantics. In this article, we present the MultiQueue, a flexible approach to relaxed priority queues that uses multiple internal sequential priority queues. The scalability of the MultiQueue is enhanced by buffering elements, batching operations on the internal queues, and optimizing access patterns for high cache locality. We investigate the complementary quality criteria of rank error, which measures how close deleted elements are to the global minimum, and delay, which quantifies how many smaller elements were deleted before a given element. Extensive experimental evaluation shows that the MultiQueue outperforms competing approaches across several benchmarks. This includes shortest-path and branch-and-bound benchmarks that resemble real applications. Moreover, the MultiQueue can be configured easily to balance throughput and quality according to the application's requirements. We employ a seemingly paradoxical technique of wait-free locking that might be of broader interest for converting sequential data structures into relaxed concurrent data structures.
We pulled data from over 100,000 real incidents and built a benchmark report that skips the vanity metrics and focuses on what good incident response actually looks like.
When you are auditing or writing alerting rules, consider these things to keep your oncall rotation happier
We present a new technique, Safe Concurrent Optimistic Traversals (SCOT), to address a well-known problem related to optimistic traversals with classical and more recent safe memory reclamation (SMR) schemes, such as Hazard Pointers (HP), Hazard Eras (HE), Interval-Based Reclamation (IBR), and Hyaline. Unlike Epoch-Based Reclamation (EBR), these (robust) schemes protect against stalled threads but lack support for well-known data structures with optimistic traversals, e.g., Harris' list and the Natarajan-Mittal tree. Such schemes are either incompatible with them or need changes with performance trade-offs (e.g., the Harris-Michael list). SCOT keeps existing SMR schemes intact and retains performance benefits of original data structures. We implement and evaluate SCOT with Harris' list and the Natarajan-Mittal tree, but it is also applicable to other data structures. Furthermore, we provide a simple modification for wait-free traversals. We observe similar performance speedups (e.g., Harris vs. Harris-Michael lists) that were previously available only to EBR users. Our version of the tree also achieves very high throughput, comparable to that of EBR, which is often treated as a practical upper bound.
S6-overlay is an easy-to-install (just extract a tarball or two!) set of scripts and utilities allowing you to use existing Docker images while using s6 as a pid 1 for your container and process supervisor for your services.
Macaroon tokens are bearer tokens (like JWTs) that use a cute chained-HMAC construction that allows an end-user to take any existing token they have and scope it down, all on their own. You can minimize your token before every API operation so that you’re only ever transmitting the least amount of privilege needed for what you’re actually doing, even if the token you were issued was an admin token. And they have a user-serviceable plug-in interface!
We’re announcing the release of Hyperlight Wasm: a Hyperlight virtual machine (VM) “micro-guest” that can run wasm component workloads written in many programming languages. If you’d like to dive straight in, you can visit the hyperlight-wasm repo on GitHub. In the remainder of this post we’ll cover the basics of how Hyperlight Wasm works and then walk through how to build a Rust example step-by-step.
The more you learn about STPA, and the more you see results from successful application of STPA to your business or industry, the more you realize you can't afford not to use it.
Software organizations tend to value measurement, iteration, and improvement based on data. These are great things for an organization to focus on; however, this has led to an industry practice of calculating and tracking Mean Time to Resolve, or MTTR. While it’s understandable to want to have a clear metric for tracking incident resolution, MTTR is problematic for a number of reasons.
This paper introduces a new data-structural object that we call the tiny pointer. In many applications, traditional -bit pointers can be replaced with -bit tiny pointers at the cost of only a constant-factor time overhead. We develop a comprehensive theory of tiny pointers, and give optimal constructions for both fixed-size tiny pointers (i.e., settings in which all of the tiny pointers must be the same size) and variable-size tiny pointers (i.e., settings in which the average tiny-pointer size must be small, but some tiny pointers can be larger). If a tiny pointer references an element in an array filled to load factor , then the optimal tiny-pointer size is bits in the fixed-size case, and expected bits in the variable-size case. Our tiny-pointer constructions also require us to revisit several classic problems having to do with balls and bins; these results may be of independent interest. Using tiny pointers, we revisit five classic data-structure problems: the data-retrieval problem, succinct dynamic binary search trees, space-efficient stable dictionaries, space-efficient dictionaries with variable-size keys, and the internal-memory stash problem. These are all well-studied problems, and in each case tiny pointers allow for us to take a natural space-inefficient solution that uses pointers and make it space-efficient for free.
Context switching is known to be one of the most expensive operations performed by the operating system kernel which can kill the performance of many systems. It is a necessary evil on a busy system to keep it responsive, and to allow all the processes to make progress. But what makes it so expensive? This article decodes the hardware and software dynamics underlying context switching.
WireMock is a popular tool to simulate APIs and software dependencies, typically used by teams that need more power and functionality compared to simple mocking. It allows you to test applications without dependencies, simulate edge cases, and develop dependent features in parallel.
In this manuscript I overview my work on developing a Theory for Distributed Systems -- work that has involved many students and other collaborators. This effort started at Georgia Tech in the late 1970s, and has continued at MIT since 1981. This manuscript emphasizes the earlier contributions, and their impact on the directions of the field. These contributions include new distributed algorithms; rigorous proofs and analysis; discovery of errors in previous algorithms; lower bounds and other impossibility results expressing inherent limitations on the power of distributed systems; general mathematical foundations for modeling and analyzing distributed systems; and applications of these methods to understanding a variety of practical distributed systems, including distributed data-management systems, wired and wireless communication systems, and biological systems.
Let’s implement leader election using Amazon S3’s If-Match condition by building a distributed lock with it.
We saw last time that with linear types, we could precisely capture the state of sockets in their types. In this post, I want to use the same idea of tracking states in types, but applied to a more unusual example from our paper: sending rich structured data types across the network and back with as little copying as possible.
In this post, I’ll go through some of the perhaps obscure Git config settings that I have personally globally enabled and go into them to explain what they do and why they should probably be the default settings.
Fixi.js is an experimental, minimalist implementation of generalized hypermedia controls
Actor systems are a flexible model of concurrent and distributed programming, which are efficiently implementable, and avoid many classic concurrency bugs by construction. However actor systems must still deal with the challenge of messages arriving in unexpected orderings. We describe an approach to restricting the orders in which actors send messages to each other, by equipping actor references -- the handle used to address another actor -- with a protocol restricting which message types can be sent to another actor and in which order using that particular actor reference. This endows the actor references with the properties of static (flow-sensitive) capabilities, which we call actor capabilities. By sending other actors only restricted actor references, they may control which messages are sent in which orders by other actors. Rules for duplicating (splitting) actor references ensure that these restrictions apply even in the presence of delegation. The capabilities themselves restrict message ordering, which may form the foundation for stronger forms of reasoning. We demonstrate this by layering an effect system over the base type system, where the relationships enforced between the actor capabilities and the effects of an actor's behaviour ensure that an actor's behaviour is always prepared to handle any message that may arrive.
We present a practical model of non-transactional consistency levels in the context of distributed data replication. Unlike prior work, our simple Shared Object Pool (SOP) model defines common consistency levels in a unified framework centered around the single concept of ordering. This naturally reflects modern cloud object storage services and is thus easy to understand. We show that a consistency level can be intuitively defined by specifying two types of constraints on the validity of orderings allowed by the level: convergence, which bounds the lineage shape of the ordering, and relationship, which bounds the relative positions between operations. We give examples of representative protocols and systems, and discuss their availability upper bound. To further demonstrate the expressiveness and practical relevance of our model, we use it to implement a Jepsen-integrated consistency checker for the four most common levels (linearizable, sequential, causal+, and eventual); the checker analyzes consistency conformity for small-scale histories of real system runs (etcd, ZooKeeper, and RabbitMQ).