How to speed up your Postgres insert performance
BPF arenas are areas of memory where the verifier can safely relax its checking of pointers, allowing programmers to write arbitrary data structures in BPF
This library is a fork of Bits and Blooms that uses an alternative backing bitset based on Go's sync/atomic.Int64 rather than a bare slice of integers. This allows for concurrent addition and testing of filters without creating memory safety issues or race conditions by leveraging hardware support for atomic Load and Or operations on Int64s.
We present Laminar, the first TCP stack that delivers ASIC-class performance and energy efficiency on programmable Reconfigurable Match-Action Table (RMT) pipelines, providing flexibility while retaining standard TCP semantics and POSIX socket compatibility. The key challenge to Laminar is reconciling TCP's complex dependent state updates with RMT's unidirectional, lock-step execution model. To overcome this challenge, Laminar introduces three novel techniques: optimistic concurrency (speculative updates validated downstream), pseudo-segment injection (circular dependency resolution without stalls), and bump-in-the-wire processing (single-pass segment handling). Together, these enable TCP processing, including retransmission, reassembly, flow, and congestion control, as a pipeline of simple match-action operations. Our Intel Tofino 2 prototype demonstrates Laminar's scalability to terabit speeds, flexibility, and robustness to network dynamics. Laminar matches RDMA performance and efficiency for both RPC and streaming workloads (including NVMe-oF with SPDK), while maintaining TCP/POSIX compatibility. Laminar saves up to 16 host CPU cores versus state-of-the-art kernel-bypass TCP, while achieving 5 lower 99.99p tail latency and 2 better throughput-per-watt for key-value stores. At scale, Laminar drives nearly Bpps at 20 s RPC tail latency. Unlike fixed-function offloads, Laminar supports transport evolution through in-data-path extensions (selective ACKs, congestion control variants, application co-design for shared logs). Finally, Laminar generalizes to FPGA SmartNICs, outperforming ToNIC's monolithic design by under equal timing.
This post gives an overview of how to build a CPU-local data structure on modern Linux. The exposition will be for x86, but other than the small bits of assembly you need to write, the technique is architecture-independent.
Priority queues are used in a wide range of applications, including prioritized online scheduling, discrete event simulation, and greedy algorithms. In parallel settings, classical priority queues often become a severe bottleneck, resulting in low throughput. Consequently, there has been significant interest in concurrent priority queues with relaxed semantics. In this article, we present the MultiQueue, a flexible approach to relaxed priority queues that uses multiple internal sequential priority queues. The scalability of the MultiQueue is enhanced by buffering elements, batching operations on the internal queues, and optimizing access patterns for high cache locality. We investigate the complementary quality criteria of rank error, which measures how close deleted elements are to the global minimum, and delay, which quantifies how many smaller elements were deleted before a given element. Extensive experimental evaluation shows that the MultiQueue outperforms competing approaches across several benchmarks. This includes shortest-path and branch-and-bound benchmarks that resemble real applications. Moreover, the MultiQueue can be configured easily to balance throughput and quality according to the application's requirements. We employ a seemingly paradoxical technique of wait-free locking that might be of broader interest for converting sequential data structures into relaxed concurrent data structures.
We pulled data from over 100,000 real incidents and built a benchmark report that skips the vanity metrics and focuses on what good incident response actually looks like.
When you are auditing or writing alerting rules, consider these things to keep your oncall rotation happier
We present a new technique, Safe Concurrent Optimistic Traversals (SCOT), to address a well-known problem related to optimistic traversals with classical and more recent safe memory reclamation (SMR) schemes, such as Hazard Pointers (HP), Hazard Eras (HE), Interval-Based Reclamation (IBR), and Hyaline. Unlike Epoch-Based Reclamation (EBR), these (robust) schemes protect against stalled threads but lack support for well-known data structures with optimistic traversals, e.g., Harris' list and the Natarajan-Mittal tree. Such schemes are either incompatible with them or need changes with performance trade-offs (e.g., the Harris-Michael list). SCOT keeps existing SMR schemes intact and retains performance benefits of original data structures. We implement and evaluate SCOT with Harris' list and the Natarajan-Mittal tree, but it is also applicable to other data structures. Furthermore, we provide a simple modification for wait-free traversals. We observe similar performance speedups (e.g., Harris vs. Harris-Michael lists) that were previously available only to EBR users. Our version of the tree also achieves very high throughput, comparable to that of EBR, which is often treated as a practical upper bound.
S6-overlay is an easy-to-install (just extract a tarball or two!) set of scripts and utilities allowing you to use existing Docker images while using s6 as a pid 1 for your container and process supervisor for your services.
Macaroon tokens are bearer tokens (like JWTs) that use a cute chained-HMAC construction that allows an end-user to take any existing token they have and scope it down, all on their own. You can minimize your token before every API operation so that you’re only ever transmitting the least amount of privilege needed for what you’re actually doing, even if the token you were issued was an admin token. And they have a user-serviceable plug-in interface!
We’re announcing the release of Hyperlight Wasm: a Hyperlight virtual machine (VM) “micro-guest” that can run wasm component workloads written in many programming languages. If you’d like to dive straight in, you can visit the hyperlight-wasm repo on GitHub. In the remainder of this post we’ll cover the basics of how Hyperlight Wasm works and then walk through how to build a Rust example step-by-step.
The more you learn about STPA, and the more you see results from successful application of STPA to your business or industry, the more you realize you can't afford not to use it.
Software organizations tend to value measurement, iteration, and improvement based on data. These are great things for an organization to focus on; however, this has led to an industry practice of calculating and tracking Mean Time to Resolve, or MTTR. While it’s understandable to want to have a clear metric for tracking incident resolution, MTTR is problematic for a number of reasons.
This paper introduces a new data-structural object that we call the tiny pointer. In many applications, traditional -bit pointers can be replaced with -bit tiny pointers at the cost of only a constant-factor time overhead. We develop a comprehensive theory of tiny pointers, and give optimal constructions for both fixed-size tiny pointers (i.e., settings in which all of the tiny pointers must be the same size) and variable-size tiny pointers (i.e., settings in which the average tiny-pointer size must be small, but some tiny pointers can be larger). If a tiny pointer references an element in an array filled to load factor , then the optimal tiny-pointer size is bits in the fixed-size case, and expected bits in the variable-size case. Our tiny-pointer constructions also require us to revisit several classic problems having to do with balls and bins; these results may be of independent interest. Using tiny pointers, we revisit five classic data-structure problems: the data-retrieval problem, succinct dynamic binary search trees, space-efficient stable dictionaries, space-efficient dictionaries with variable-size keys, and the internal-memory stash problem. These are all well-studied problems, and in each case tiny pointers allow for us to take a natural space-inefficient solution that uses pointers and make it space-efficient for free.
Context switching is known to be one of the most expensive operations performed by the operating system kernel which can kill the performance of many systems. It is a necessary evil on a busy system to keep it responsive, and to allow all the processes to make progress. But what makes it so expensive? This article decodes the hardware and software dynamics underlying context switching.
WireMock is a popular tool to simulate APIs and software dependencies, typically used by teams that need more power and functionality compared to simple mocking. It allows you to test applications without dependencies, simulate edge cases, and develop dependent features in parallel.
In this manuscript I overview my work on developing a Theory for Distributed Systems -- work that has involved many students and other collaborators. This effort started at Georgia Tech in the late 1970s, and has continued at MIT since 1981. This manuscript emphasizes the earlier contributions, and their impact on the directions of the field. These contributions include new distributed algorithms; rigorous proofs and analysis; discovery of errors in previous algorithms; lower bounds and other impossibility results expressing inherent limitations on the power of distributed systems; general mathematical foundations for modeling and analyzing distributed systems; and applications of these methods to understanding a variety of practical distributed systems, including distributed data-management systems, wired and wireless communication systems, and biological systems.
Let’s implement leader election using Amazon S3’s If-Match condition by building a distributed lock with it.
We saw last time that with linear types, we could precisely capture the state of sockets in their types. In this post, I want to use the same idea of tracking states in types, but applied to a more unusual example from our paper: sending rich structured data types across the network and back with as little copying as possible.
In this post, I’ll go through some of the perhaps obscure Git config settings that I have personally globally enabled and go into them to explain what they do and why they should probably be the default settings.
Fixi.js is an experimental, minimalist implementation of generalized hypermedia controls
Actor systems are a flexible model of concurrent and distributed programming, which are efficiently implementable, and avoid many classic concurrency bugs by construction. However actor systems must still deal with the challenge of messages arriving in unexpected orderings. We describe an approach to restricting the orders in which actors send messages to each other, by equipping actor references -- the handle used to address another actor -- with a protocol restricting which message types can be sent to another actor and in which order using that particular actor reference. This endows the actor references with the properties of static (flow-sensitive) capabilities, which we call actor capabilities. By sending other actors only restricted actor references, they may control which messages are sent in which orders by other actors. Rules for duplicating (splitting) actor references ensure that these restrictions apply even in the presence of delegation. The capabilities themselves restrict message ordering, which may form the foundation for stronger forms of reasoning. We demonstrate this by layering an effect system over the base type system, where the relationships enforced between the actor capabilities and the effects of an actor's behaviour ensure that an actor's behaviour is always prepared to handle any message that may arrive.
We present a practical model of non-transactional consistency levels in the context of distributed data replication. Unlike prior work, our simple Shared Object Pool (SOP) model defines common consistency levels in a unified framework centered around the single concept of ordering. This naturally reflects modern cloud object storage services and is thus easy to understand. We show that a consistency level can be intuitively defined by specifying two types of constraints on the validity of orderings allowed by the level: convergence, which bounds the lineage shape of the ordering, and relationship, which bounds the relative positions between operations. We give examples of representative protocols and systems, and discuss their availability upper bound. To further demonstrate the expressiveness and practical relevance of our model, we use it to implement a Jepsen-integrated consistency checker for the four most common levels (linearizable, sequential, causal+, and eventual); the checker analyzes consistency conformity for small-scale histories of real system runs (etcd, ZooKeeper, and RabbitMQ).
After a decade of research in userspace network stacks, why do new solutions remain inaccessible to most developers? We argue that this is because they ignored (1) the hardware constraints of public cloud NICs (vNICs) and (2) the flexibility required by applications. Concerning the former, state-of-the-art proposals rely on specific NIC features (e.g., flow steering, deep buffers) that are not broadly available in vNICs. As for the latter, most of these stacks enforce a restrictive execution model that does not align well with cloud application requirements. We propose a new userspace network stack, Machnet, built for public cloud VMs. Central to Machnet is a new ''Least Common Denominator'' model, a conceptual NIC with a minimal feature set supported by all kernel-bypass vNICs. The challenge is to build a new solution with performance comparable to existing stacks while relying only on basic features (e.g., no flow steering, no RSS reconfiguration). Machnet uses a microkernel design to provide higher flexibility in application execution compared to a library OS design; we show that microkernels' inter-process communication overhead is negligible on large cloud networks.
SwiftPaxos: Fast Geo-Replicated State Machines, in NSDI 2024, proposes a Paxos variant for networks with high latency, and different latencies between different pairs of nodes. Here’s a video of my presentation to the DistSys Reading Group, and a written review of the paper below.
Pwru is an eBPF-based tool for tracing network packets in the Linux kernel with advanced filtering capabilities. It allows fine-grained introspection of kernel state to facilitate debugging network connectivity issues.
Recently, we introduced the concept of graceful degradation: when a service encounters unexpected events, it’s often better to give some response than no response at all. We also introduced the most common approach to handle graceful degradation with load shedding. Yet, this is only one possible method, and today, we will explore another one brought by Facebook: adaptive LIFO (Last-In, First-Out).
Virtual routing and forwarding (VRF) is a network virtualization technique allowing traffic isolation at the layer 3 (L3) of the OSI model by creating independent routing and forwarding domains. It is essentially the combination of a separate routing table and associated network interfaces.
At first glance, context switching seems straightforward—save the current process's registers, switch page tables and stacks, and restore the new process's registers. However, the reality is much more complex, involving multiple data structures, hardware state management, and memory organization. To fully grasp context switching, we need to understand few key foundational concepts about the Linux kernel and X86-64 architecture. We'll explore these through a series of articles:
Google SRE has embraced systems theory and control theory. We have adopted the STAMP (System-Theoretic Accident Model and Processes) framework, developed by Professor Nancy Leveson at MIT, which shifts the focus from preventing individual component failures to understanding and managing complex system interactions. STAMP incorporates tools like Causal Analysis based on Systems Theory (CAST) for post-incident investigations and System-Theoretic Process Analysis (STPA) for hazard analysis.
We present PReQuaL (Probing to Reduce Queuing and Latency), a load balancer for distributed multi-tenant systems. PReQuaL is designed to minimize real-time request latency in the presence of heterogeneous server capacities and non-uniform, time-varying antagonist load. To achieve this, PReQuaL actively probes server load and leverages the power of d choices paradigm, extending it with asynchronous and reusable probes. Cutting against received wisdom, PReQuaL does not balance CPU load, but instead selects servers according to estimated latency and active requests-in-flight (RIF). We explore the major design features of PReQuaL on a testbed system and describe our experience using it to balance load within YouTube, where it has been running for more than a year. PReQuaL has dramatically decreased tail latency, error rates, and resource use, enabling YouTube and other production systems at Google to run at much higher utilization.
SQLite is already fast. But can we make it even faster? Researchers at the University of Helsinki and Cambridge began with this question and published a paper, “Serverless Runtime / Database Co-Design With Asynchronous I/O”. They demonstrate up to a 100x reduction in tail latency
Almost every database management system (DBMS) supporting transactions created in the last decade implements multi-version concurrency control (MVCC). Still, these systems rely on physical data structures (e.g., B+trees, hash tables) that do not natively support multi-versioning. As a result, there is a disconnect between the logical semantics of transactions and the DBMS’s underlying implementation. System developers must invest engineering efforts in coordinating transactional access to these data structures and nontransactional maintenance tasks. This burden leads to challenges when reasoning about the system’s correctness and performance and inhibits its modularity. In this paper, we propose the Deferred Action Framework (DAF), a new system architecture for scheduling maintenance tasks in an MVCC DBMS integrated with the system’s transactional semantics. DAF allows the system to register arbitrary actions and then defer their processing until they are deemed safe by transactional processing. We show that DAF can support garbage collection and index cleaning without compromising performance while facilitating higher-level implementation goals, such as non-blocking schema changes.
This zine is an introduction to choreographic programming, and is the result of a six-month collaboration between my student collaborator Ali Ali and me, published in December 2024. Choreographic programming lets programmers describe the behavior of a message-passing system as a single, unified program that describes all the participants in the system and how they interact!
Instead of counting offsets, keyset pagination uses a compound cursor
The volume of data generated and stored in contemporary global data centers is experiencing exponential growth. This rapid data growth necessitates efficient processing and analysis to extract valuable business insights. In distributed data processing systems, data undergoes exchanges between the compute servers that contribute significantly to the total data processing duration in adequately large clusters, necessitating efficient data transport protocols. Traditionally, data transport frameworks such as JDBC and ODBC have used TCP/IP-over-Ethernet as their underlying network protocol. Such frameworks require serializing the data into a single contiguous buffer before handing it off to the network card, primarily due to the requirement of contiguous data in TCP/IP. In OLAP use cases, this serialization process is costly for columnar data batches as it involves numerous memory copies that hurt data transport duration and overall data processing performance. We study the serialization overhead in the context of a widely-used columnar data format, Apache Arrow, and propose leveraging RDMA to transport Arrow data over Infiniband in a zero-copy manner. We design and implement Thallus, an RDMA-based columnar data transport protocol for Apache Arrow based on the Thallium framework from the Mochi ecosystem, compare it with a purely Thallium RPC-based implementation, and show substantial performance improvements can be achieved by using RDMA for columnar data transport.
Here’s an attempt at a basic tutorial for using Typst with Pandoc to produce beautiful – and easily customizable – PDFs and indeed printed pages.
In this blog post, I will motivate the problem: why we need a high-level interface for kernel-bypass I/O. The Demikernel OS architecture is a solution: it defines a high-level I/O abstraction for kernel-bypass and supports this abstraction across different kernel-bypass devices with library OSes. Read the HotOS paper if you are interested in our proposed solution.
Many concurrent algorithms require processes to perform fetch-and-add operations on a single memory location, which can be a hot spot of contention. We present a novel algorithm called Aggregating Funnels that reduces this contention by spreading the fetch-and-add operations across multiple memory locations. It aggregates fetch-and-add operations into batches so that the batch can be performed by a single hardware fetch-and-add instruction on one location and all operations in the batch can efficiently compute their results by performing a fetch-and-add instruction on a different location. We show experimentally that this approach achieves higher throughput than previous combining techniques, such as Combining Funnels, and is substantially more scalable than applying hardware fetch-and-add instructions on a single memory location. We show that replacing the fetch-and-add instructions in the fastest state-of-the-art concurrent queue by our Aggregating Funnels eliminates a bottleneck and greatly improves the queue's overall throughput.
In this article, I want to show how to integrate ephemeral PostgreSQL instances into your test setup. The examples are all specific to Go, but I expect that users of other programming languages and environments can benefit from some of these techniques as well.
Range filters are probabilistic data structures that answer approximate range emptiness queries. They aid in avoiding processing empty range queries and have use cases in many application domains such as key-value stores and social web analytics. However, current range filter designs do not support dynamically changing and growing datasets. Moreover, several of these designs also exhibit impractically high false positive rates under correlated workloads, which are common in practice. These impediments restrict the applicability of range filters across a wide range of use cases. We introduce Memento filter, the first range filter to offer dynamicity, fast operations, and a robust false positive rate guarantee for any workload. Memento filter partitions the key universe and clusters its keys according to this partitioning. For each cluster, it stores a fingerprint and a list of key suffixes contiguously. The encoding of these lists makes them amenable to existing dynamic filter structures. Due to the well-defined one-to-one mapping from keys to suffixes, Memento filter supports inserts and deletes and can even expand to accommodate a growing dataset. We implement Memento filter on top of a Rank-and-Select Quotient filter and InfiniFilter and demonstrate that it achieves competitive false positive rates and performance with the state-of-the-art while also providing dynamicity. Due to its dynamicity, Memento filter is the first range filter applicable to B-Trees. We showcase this by integrating Memento filter into WiredTiger, a B-Tree-based key-value store. Memento filter doubles WiredTiger's range query throughput when 50% of the queries are empty while keeping all other cost metrics unharmed.
Secure Shell (SSH) is a popular network protocol for accessing a shell on (remote) systems. Users might not pay enough attention to detail when asked to verify the system’s authenticity on the first connection to a system, thereby risking a network-level security breach. In this article, I will explain how SSHFP DNS records can help mitigate such risks and share the results of our large-scale analysis. At the time of measurement, only 1 in 10,000 domains used SSHFP records and less than 50% used a correct configuration.
The rapid evolution of cloud-native applications, characterized by dynamic, interconnected services, presents significant challenges for maintaining trustworthy and auditable systems, especially in sensitive contexts, such as finance or healthcare. Traditional methods of verification and certification are often inadequate due to the fast-past and dynamic development practices common in cloud computing. This paper introduces Advocate, a novel agent-based system designed to generate verifiable evidence of cloud-native application operations. By integrating with existing infrastructure tools, such as Kubernetes and distributed tracing systems, Advocate captures, authenticates, and stores evidence trails in a tamper-resistant manner. This approach not only supports the auditing process but also allows for privacy-preserving evidence aggregation. Advocate's extensible architecture facilitates its deployment in diverse environments, enabling the verification and adherence to policies and enhance trust in cloud services.
While Function as a Service (FaaS) platforms can initialize function sandboxes on worker nodes in 10-100s of milliseconds, the latency to schedule functions in real FaaS clusters can be orders of magnitude higher. The current approach of building FaaS cluster managers on top of legacy orchestration systems (e.g., Kubernetes) leads to high scheduling delays when clusters experience high sandbox churn, which is common for FaaS. Generic cluster managers use many hierarchical abstractions and internal components to manage and reconcile cluster state with frequent persistent updates. This becomes a bottleneck for FaaS since the cluster state frequently changes as sandboxes are created on the critical path of requests. Based on our root cause analysis of performance issues in existing FaaS cluster managers, we propose Dirigent, a clean-slate system architecture for FaaS orchestration with three key principles. First, Dirigent optimizes internal cluster manager abstractions to simplify state management. Second, it eliminates persistent state updates on the critical path of function invocations, leveraging the fact that FaaS abstracts sandbox locations from users to relax exact state reconstruction guarantees. Finally, Dirigent runs monolithic control and data planes to minimize internal communication overheads and maximize throughput. We compare Dirigent to state-of-the-art FaaS platforms and show that Dirigent reduces 99th percentile per-function scheduling latency for a production workload by 2.79x compared to AWS Lambda. Dirigent can spin up 2500 sandboxes per second at low latency, which is 1250x more than Knative.
Slim Reader/Writer Locks have a fantastic interface: pointer-sized (“slim”), zero-initialized, and non-allocating. Lacking cleanup, they compose naturally with arena allocation. Sounds like a futex? That’s because they’re built on futexes introduced at the same time. They’re also complemented by condition variables with the same desirable properties.
Extended Berkeley Packet Filter (eBPF) is a runtime that enables users to load programs into the operating system (OS) kernel, like Linux or Windows, and execute them safely and efficiently at designated kernel hooks. Each program passes through a verifier that reasons about the safety guarantees for execution. Hosting a safe virtual machine runtime within the kernel makes it dynamically programmable. Unlike the popular approach of bypassing or completely replacing the kernel, eBPF gives users the flexibility to modify the kernel on the fly, rapidly experiment and iterate, and deploy solutions to achieve their workload-specific needs, while working in concert with the kernel. In this paper, we present the first comprehensive description of the design and implementation of the eBPF runtime in the Linux kernel. We argue that eBPF today provides a mature and safe programming environment for the kernel. It has seen wide adoption since its inception and is increasingly being used not just to extend, but program entire components of the kernel, while preserving its runtime integrity. We outline the compelling advantages it offers for real-world production usage, and illustrate current use cases. Finally, we identify its key challenges, and discuss possible future directions.
We describe the design and implementation of Walter, a key-value store that supports transactions and replicates data across distant sites. A key feature behind Walter is a new property called Parallel Snapshot Isolation (PSI). PSI allows Walter to replicate data asynchronously, while providing strong guarantees within each site. PSI precludes write-write conflicts, so that developers need not worry about conflict-resolution logic. To prevent write-write conflicts and implement PSI, Walter uses two new and simple techniques: preferred sites and counting sets. We use Walter to build a social networking application and port a Twitter-like application
There is a process known as the Fiat-Shamir heuristic by which you can automatically transform certain interactive identification protocols into a non-interactive signature scheme. You can’t do this for every protocol, only ones that have a certain structure, but Schnorr identification meets the criteria. The resulting signature scheme is known, amazingly, as the Schnorr signature scheme.
If you’re considering adopting Sans I/O, here’s what you should be aware of.
Conflict-free Replicated Data Types (CRDTs) allow optimistic replication in a principled way. Different replicas can proceed independently, being available even under network partitions, and always converging deterministically: replicas that have received the same updates will have equivalent state, even if received in different orders. After a historical tour of the evolution from sequential data types to CRDTs, we present in detail the two main approaches to CRDTs, operation-based and state-based, including two important variations, the pure operation-based and the delta-state based. Intended as a tutorial for prospective CRDT researchers and designers, it provides solid coverage of the essential concepts, clarifying some misconceptions which frequently occur, but also presents some novel insights gained from considerable experience in designing both specific CRDTs and approaches to CRDTs.
After being a bit inspired by some ideas from people at work, the hackerspace and toots on mastodon, I figure out a SSH certificate authority would be a cool small project to hack on. Last year I wrote an SSH agent with TPM bound keys so this would nicely fit into the existing tooling.
We help you find European alternatives for digital service and products, like cloud services and SaaS products
Erasure codes are the way to more generally describe the space of trade-offs between storage efficiency and fault tolerance. One can say "I’d like this file carved into chunks, such that it can still be reconstructed with any chunks destroyed", and there’s an erasure code with those parameters which will provide the minimum-sized chunks necessary to meet that goal.
MultiPaxos, while a fundamental Replicated State Machine algorithm, suffers from a dearth of comprehensive guidelines for achieving a complete and correct implementation. This deficiency has hindered MultiPaxos' practical utility and adoption and has resulted in flawed claims about its capabilities. Our paper aims to bridge the gap between MultiPaxos' complexity and practical implementation through a meticulous and detailed design process spanning more than a year. It carefully dissects each phase of MultiPaxos and offers detailed step-by-step pseudocode -- in addition to a complete open-source implementation -- for all components, including the leader election, the failure detector, and the commit phase. The implementation of our complete design also provides better performance stability, resource usage, and network partition tolerance than naive MultiPaxos versions. Our specification includes a lightweight log compaction approach that avoids taking repeated snapshots, significantly improving resource usage and performance stability. Our failure detector, integrated into the commit phase of the algorithm, uses variable and adaptive heartbeat intervals to settle on a better leader under partial connectivity and network partitions, improving liveness under such conditions.
In this article we will learn about Packets of data and how do they travel peer-to-peer between 2 devices through corporate firewalls, Multiple NATs, flaky connections and unknown network addresses ports.
Listen/notify in Postgres is an incredible feature that makes itself useful in all kinds of situations. I’ve been using it a long time, started taking it for granted long ago, and was somewhat shocked recently looking into MySQL and SQLite to learn that even in 2024, no equivalent exists.
MVCC (Multi-Version Concurrency Control) is a method in which each write operation creates a “new version” of the data while retaining the “old version”. This allows concurrent read and write operations without blocking each other. PostgreSQL uses a variant of MVCC, also called Snapshot Isolation to isolate concurrent transactions. So, it is possible that a single piece of data could have multiple “versions” of it, and it is PostgreSQL’s responsibility to determine which ‘version’ shall be presented to the user based on multiple factors. This act is also known as the “visibility check” or “visibility control” In this blog, we will dive into PostgreSQL’s visibility check mechanism to understand how it works.
We present Trust<T>, a general, type- and memory-safe alternative to locking in concurrent programs. Instead of synchronizing multi-threaded access to an object of type T with a lock, the programmer may place the object in a Trust<T>. The object is then no longer directly accessible. Instead a designated thread, the object's trustee, is responsible for applying any requested operations to the object, as requested via the Trust<T> API. Locking is often said to offer a limited throughput per lock. Trust<T> is based on delegation, a message-passing technique which does not suffer this per-lock limitation. Instead, per-object throughput is limited by the capacity of the object's trustee, which is typically considerably higher. Our evaluation shows Trust<T> consistently and considerably outperforming locking where lock contention exists, with up to 22x higher throughput in microbenchmarks, and 5-9x for a home grown key-value store, as well as memcached, in situations with high lock contention. Moreover, Trust<T> is competitive with locks even in the absence of lock contention.
Ares is a modular framework, designed to implement dynamic, reconfigurable, fault-tolerant, read/write and strongly consistent distributed shared memory objects. Recent enhancements of the framework have realized the efficient implementation of large objects, by introducing versioning and data striping techniques. In this work, we identify performance bottlenecks of the Ares's variants by utilizing distributed tracing, a popular technique for monitoring and profiling distributed systems. We then propose optimizations across all versions of Ares, aiming in overcoming the identified flaws, while preserving correctness. We refer to the optimized version of Ares as Ares II, which now features a piggyback mechanism, a garbage collection mechanism, and a batching reconfiguration technique for improving the performance and storage efficiency of the original Ares. We rigorously prove the correctness of Ares II, and we demonstrate the performance improvements by an experimental comparison (via distributed tracing) of the Ares II variants with their original counterparts.
In this thesis, we introduce replay clocks (RepCl), a novel clock infrastructure that allows us to do offline analyses of distributed computations. The replay clock structure provides a methodology to replay a computation as it happened, with the ability to represent concurrent events effectively. It builds on the structures introduced by vector clocks (VC) and the Hybrid Logical Clock (HLC), combining their infrastructures to provide efficient replay. With such a clock, a user can replay a computation whilst considering multiple paths of executions, and check for constraint violations and properties that potential pathways could take in the presence of concurrent events. Specifically, if event e must occur before f then the replay clock must ensure that e is replayed before f. On the other hand, if e and f could occur in any order, replay should not force an order between them. We demonstrate that RepCl can be implemented with less than four integers for 64 processes for various system parameters if clocks are synchronized within 1ms. Furthermore, the overhead of RepCl (for computing timestamps and message size) is proportional to the size of the clock. Using simulations in a custom distributed system and NS-3, a state-of-the-art network simulator, we identify the expected overhead of RepCl. We also identify how a user can then identify feasibility region for RepCl, where unabridged replay is possible. Using the RepCl, we provide a tracer for distributed computations, that allows any computation using the RepCl to be replayed efficiently. The visualization allows users to analyze specific properties and constraints in an online fashion, with the ability to consider concurrent paths independently. The visualization provides per-process views and an overarching view of the whole computation based on the time recorded by the RepCl for each event.
Most tracers are designed around functions. You attach a probe to a function and then when the function is called and the probe is run, you access the function arguments and/or the return value. Function arguments are sufficient for the majority of use cases, as function bodies typically process and manipulate the arguments. But sometimes they are not enough. Sometimes we want to access the local variables defined in the function body. This post walks through a hypothetical example of accessing local variables. More specifically, we will use bpftrace to trace bpftrace.
This article examines the significant challenges encountered in implementing sharding within distributed replication systems. It identifies the impediments of achieving consensus among large participant sets, leading to scalability, throughput, and performance limitations. These issues primarily arise due to the message complexity inherent in consensus mechanisms. In response, we investigate the potential of sharding to mitigate these challenges, analyzing current implementations within distributed replication systems. Additionally, we offer a comprehensive review of replication systems, encompassing both classical distributed databases as well as Distributed Ledger Technologies (DLTs) employing sharding techniques. Through this analysis, the article aims to provide insights into addressing the scalability and performance concerns in distributed replication systems.
This is a short course on getting started with understanding how a TPM 2.0 works. In this course we explain a number of the features of the TPM 2.0 through the TPM2_Tools through examples and, optionally, exercises.
Pyodide is a port of CPython to WebAssembly. It interprets Python code, without any need to precompile the Python code itself to any other format. It runs in a web browser — check out this REPL. It is true to the CPython that Python developers know and expect, providing most of the Python Standard Library. It provides a foreign function interface (FFI) to JavaScript, allowing you to call JavaScript APIs directly from Python — more on this below. It provides popular open-source packages, and can import pure Python packages directly from PyPI.
MongoDB is a distributed database that supports replication and horizontal partitioning (sharding). MongoDB replica sets consist of a primary that accepts all client writes and then propagates those writes to the secondaries. Each member of the replica set contains the same set of data. For horizontal partitioning, each shard (or partition) is a replica set. This paper discusses the design and rationale behind MongoDB's implementation of a cluster-wide logical clock and causal consistency. The design leveraged ideas from across the research community to ensure that the implementation adds minimal processing overhead, tolerates possible operator errors, and gives protection against non-trusted client attacks. While the goal of the team was not to discover or test new algorithms, the practical implementation necessitated a novel combination of ideas from the research community on causal consistency, security, and minimal performance overhead at scale. This paper describes a large scale, practical implementation of causal consistency using a hybrid logical clock, adding the signing of logical time ranges to the protocol, and introducing performance optimizations necessary for systems at scale. The implementation seeks to define an event as a state change and as such must make forward progress guarantees even during periods of no state changes for a partition of data.
In this article I want to present you the tiny utility mdbooker. It allows me to convert my project’s README.md into a beautiful documentation site makesure.dev. The utility works in conjunction with the amazing mdBook tool.
Many modern key-value stores, such as RocksDB, rely on log-structured merge trees (LSMs). Originally designed for spinning disks, LSMs optimize for write performance by only making sequential writes. But this optimization comes at the cost of reads: LSMs must rely on expensive compaction jobs and Bloom filters---all to maintain reasonable read performance. For NVMe SSDs, we argue that trading off read performance for write performance is no longer always needed. With enough parallelism, NVMe SSDs have comparable random and sequential access performance. This change makes update-in-place designs, which traditionally provide excellent read performance, a viable alternative to LSMs. In this paper, we close the gap between log-structured and update-in-place designs on modern SSDs with the help of new components that take advantage of data and workload patterns. Specifically, we explore three key ideas: (A) record caching for efficient point operations, (B) page grouping for high-performance range scans, and (C) insert forecasting to reduce the reorganization costs of accommodating new records. We evaluate these ideas by implementing them in a prototype update-in-place key-value store called TreeLine. On YCSB, we find that TreeLine outperforms RocksDB and LeanStore by 2.20× and 2.07× respectively on average across the point workloads, and by up to 10.95× and 7.52× overall.
When you have an outage caused by a performance issue, you don't want to lose precious time just to install the tools needed to diagnose it. Here is a list of "crisis tools" I recommend installing on your Linux servers by default (if they aren't already), along with the (Ubuntu) package names that they come from
The sequential semantics of many concurrent data structures, such as stacks and queues, inevitably lead to memory contention in parallel environments, thus limiting scalability. Semantic relaxation has the potential to address this issue, increasing the parallelism at the expense of weakened semantics. Although prior research has shown that improved performance can be attained by relaxing concurrent data structure semantics, there is no one-size-fits-all relaxation that adequately addresses the varying needs of dynamic executions. In this paper, we first introduce the concept of elastic relaxation and consequently present the Lateral structure, which is an algorithmic component capable of supporting the design of elastically relaxed concurrent data structures. Using the Lateral , we design novel elastically relaxed, lock-free queues and stacks capable of reconfiguring relaxation during run time. We establish linearizability and define upper bounds for relaxation errors in our designs. Experimental evaluations show that our elastic designs hold up against state-of-the-art statically relaxed designs, while also swiftly managing trade-offs between relaxation and operational latency. We also outline how to use the Lateral to design elastically relaxed lock-free counters and deques.