We describe the design and implementation of Walter, a key-value store that supports transactions and replicates data across distant sites. A key feature behind Walter is a new property called Parallel Snapshot Isolation (PSI). PSI allows Walter to replicate data asynchronously, while providing strong guarantees within each site. PSI precludes write-write conflicts, so that developers need not worry about conflict-resolution logic. To prevent write-write conflicts and implement PSI, Walter uses two new and simple techniques: preferred sites and counting sets. We use Walter to build a social networking application and port a Twitter-like application
There is a process known as the Fiat-Shamir heuristic by which you can automatically transform certain interactive identification protocols into a non-interactive signature scheme. You can’t do this for every protocol, only ones that have a certain structure, but Schnorr identification meets the criteria. The resulting signature scheme is known, amazingly, as the Schnorr signature scheme.
If you’re considering adopting Sans I/O, here’s what you should be aware of.
Conflict-free Replicated Data Types (CRDTs) allow optimistic replication in a principled way. Different replicas can proceed independently, being available even under network partitions, and always converging deterministically: replicas that have received the same updates will have equivalent state, even if received in different orders. After a historical tour of the evolution from sequential data types to CRDTs, we present in detail the two main approaches to CRDTs, operation-based and state-based, including two important variations, the pure operation-based and the delta-state based. Intended as a tutorial for prospective CRDT researchers and designers, it provides solid coverage of the essential concepts, clarifying some misconceptions which frequently occur, but also presents some novel insights gained from considerable experience in designing both specific CRDTs and approaches to CRDTs.
After being a bit inspired by some ideas from people at work, the hackerspace and toots on mastodon, I figure out a SSH certificate authority would be a cool small project to hack on. Last year I wrote an SSH agent with TPM bound keys so this would nicely fit into the existing tooling.
We help you find European alternatives for digital service and products, like cloud services and SaaS products
Erasure codes are the way to more generally describe the space of trade-offs between storage efficiency and fault tolerance. One can say "I’d like this file carved into chunks, such that it can still be reconstructed with any chunks destroyed", and there’s an erasure code with those parameters which will provide the minimum-sized chunks necessary to meet that goal.
MultiPaxos, while a fundamental Replicated State Machine algorithm, suffers from a dearth of comprehensive guidelines for achieving a complete and correct implementation. This deficiency has hindered MultiPaxos' practical utility and adoption and has resulted in flawed claims about its capabilities. Our paper aims to bridge the gap between MultiPaxos' complexity and practical implementation through a meticulous and detailed design process spanning more than a year. It carefully dissects each phase of MultiPaxos and offers detailed step-by-step pseudocode -- in addition to a complete open-source implementation -- for all components, including the leader election, the failure detector, and the commit phase. The implementation of our complete design also provides better performance stability, resource usage, and network partition tolerance than naive MultiPaxos versions. Our specification includes a lightweight log compaction approach that avoids taking repeated snapshots, significantly improving resource usage and performance stability. Our failure detector, integrated into the commit phase of the algorithm, uses variable and adaptive heartbeat intervals to settle on a better leader under partial connectivity and network partitions, improving liveness under such conditions.
In this article we will learn about Packets of data and how do they travel peer-to-peer between 2 devices through corporate firewalls, Multiple NATs, flaky connections and unknown network addresses ports.
Listen/notify in Postgres is an incredible feature that makes itself useful in all kinds of situations. I’ve been using it a long time, started taking it for granted long ago, and was somewhat shocked recently looking into MySQL and SQLite to learn that even in 2024, no equivalent exists.
MVCC (Multi-Version Concurrency Control) is a method in which each write operation creates a “new version” of the data while retaining the “old version”. This allows concurrent read and write operations without blocking each other. PostgreSQL uses a variant of MVCC, also called Snapshot Isolation to isolate concurrent transactions. So, it is possible that a single piece of data could have multiple “versions” of it, and it is PostgreSQL’s responsibility to determine which ‘version’ shall be presented to the user based on multiple factors. This act is also known as the “visibility check” or “visibility control” In this blog, we will dive into PostgreSQL’s visibility check mechanism to understand how it works.
We present Trust<T>, a general, type- and memory-safe alternative to locking in concurrent programs. Instead of synchronizing multi-threaded access to an object of type T with a lock, the programmer may place the object in a Trust<T>. The object is then no longer directly accessible. Instead a designated thread, the object's trustee, is responsible for applying any requested operations to the object, as requested via the Trust<T> API. Locking is often said to offer a limited throughput per lock. Trust<T> is based on delegation, a message-passing technique which does not suffer this per-lock limitation. Instead, per-object throughput is limited by the capacity of the object's trustee, which is typically considerably higher. Our evaluation shows Trust<T> consistently and considerably outperforming locking where lock contention exists, with up to 22x higher throughput in microbenchmarks, and 5-9x for a home grown key-value store, as well as memcached, in situations with high lock contention. Moreover, Trust<T> is competitive with locks even in the absence of lock contention.
Ares is a modular framework, designed to implement dynamic, reconfigurable, fault-tolerant, read/write and strongly consistent distributed shared memory objects. Recent enhancements of the framework have realized the efficient implementation of large objects, by introducing versioning and data striping techniques. In this work, we identify performance bottlenecks of the Ares's variants by utilizing distributed tracing, a popular technique for monitoring and profiling distributed systems. We then propose optimizations across all versions of Ares, aiming in overcoming the identified flaws, while preserving correctness. We refer to the optimized version of Ares as Ares II, which now features a piggyback mechanism, a garbage collection mechanism, and a batching reconfiguration technique for improving the performance and storage efficiency of the original Ares. We rigorously prove the correctness of Ares II, and we demonstrate the performance improvements by an experimental comparison (via distributed tracing) of the Ares II variants with their original counterparts.
In this thesis, we introduce replay clocks (RepCl), a novel clock infrastructure that allows us to do offline analyses of distributed computations. The replay clock structure provides a methodology to replay a computation as it happened, with the ability to represent concurrent events effectively. It builds on the structures introduced by vector clocks (VC) and the Hybrid Logical Clock (HLC), combining their infrastructures to provide efficient replay. With such a clock, a user can replay a computation whilst considering multiple paths of executions, and check for constraint violations and properties that potential pathways could take in the presence of concurrent events. Specifically, if event e must occur before f then the replay clock must ensure that e is replayed before f. On the other hand, if e and f could occur in any order, replay should not force an order between them. We demonstrate that RepCl can be implemented with less than four integers for 64 processes for various system parameters if clocks are synchronized within 1ms. Furthermore, the overhead of RepCl (for computing timestamps and message size) is proportional to the size of the clock. Using simulations in a custom distributed system and NS-3, a state-of-the-art network simulator, we identify the expected overhead of RepCl. We also identify how a user can then identify feasibility region for RepCl, where unabridged replay is possible. Using the RepCl, we provide a tracer for distributed computations, that allows any computation using the RepCl to be replayed efficiently. The visualization allows users to analyze specific properties and constraints in an online fashion, with the ability to consider concurrent paths independently. The visualization provides per-process views and an overarching view of the whole computation based on the time recorded by the RepCl for each event.
Most tracers are designed around functions. You attach a probe to a function and then when the function is called and the probe is run, you access the function arguments and/or the return value. Function arguments are sufficient for the majority of use cases, as function bodies typically process and manipulate the arguments. But sometimes they are not enough. Sometimes we want to access the local variables defined in the function body. This post walks through a hypothetical example of accessing local variables. More specifically, we will use bpftrace to trace bpftrace.
This article examines the significant challenges encountered in implementing sharding within distributed replication systems. It identifies the impediments of achieving consensus among large participant sets, leading to scalability, throughput, and performance limitations. These issues primarily arise due to the message complexity inherent in consensus mechanisms. In response, we investigate the potential of sharding to mitigate these challenges, analyzing current implementations within distributed replication systems. Additionally, we offer a comprehensive review of replication systems, encompassing both classical distributed databases as well as Distributed Ledger Technologies (DLTs) employing sharding techniques. Through this analysis, the article aims to provide insights into addressing the scalability and performance concerns in distributed replication systems.
This is a short course on getting started with understanding how a TPM 2.0 works. In this course we explain a number of the features of the TPM 2.0 through the TPM2_Tools through examples and, optionally, exercises.
Pyodide is a port of CPython to WebAssembly. It interprets Python code, without any need to precompile the Python code itself to any other format. It runs in a web browser — check out this REPL. It is true to the CPython that Python developers know and expect, providing most of the Python Standard Library. It provides a foreign function interface (FFI) to JavaScript, allowing you to call JavaScript APIs directly from Python — more on this below. It provides popular open-source packages, and can import pure Python packages directly from PyPI.
MongoDB is a distributed database that supports replication and horizontal partitioning (sharding). MongoDB replica sets consist of a primary that accepts all client writes and then propagates those writes to the secondaries. Each member of the replica set contains the same set of data. For horizontal partitioning, each shard (or partition) is a replica set. This paper discusses the design and rationale behind MongoDB's implementation of a cluster-wide logical clock and causal consistency. The design leveraged ideas from across the research community to ensure that the implementation adds minimal processing overhead, tolerates possible operator errors, and gives protection against non-trusted client attacks. While the goal of the team was not to discover or test new algorithms, the practical implementation necessitated a novel combination of ideas from the research community on causal consistency, security, and minimal performance overhead at scale. This paper describes a large scale, practical implementation of causal consistency using a hybrid logical clock, adding the signing of logical time ranges to the protocol, and introducing performance optimizations necessary for systems at scale. The implementation seeks to define an event as a state change and as such must make forward progress guarantees even during periods of no state changes for a partition of data.
In this article I want to present you the tiny utility mdbooker. It allows me to convert my project’s README.md into a beautiful documentation site makesure.dev. The utility works in conjunction with the amazing mdBook tool.
Many modern key-value stores, such as RocksDB, rely on log-structured merge trees (LSMs). Originally designed for spinning disks, LSMs optimize for write performance by only making sequential writes. But this optimization comes at the cost of reads: LSMs must rely on expensive compaction jobs and Bloom filters---all to maintain reasonable read performance. For NVMe SSDs, we argue that trading off read performance for write performance is no longer always needed. With enough parallelism, NVMe SSDs have comparable random and sequential access performance. This change makes update-in-place designs, which traditionally provide excellent read performance, a viable alternative to LSMs. In this paper, we close the gap between log-structured and update-in-place designs on modern SSDs with the help of new components that take advantage of data and workload patterns. Specifically, we explore three key ideas: (A) record caching for efficient point operations, (B) page grouping for high-performance range scans, and (C) insert forecasting to reduce the reorganization costs of accommodating new records. We evaluate these ideas by implementing them in a prototype update-in-place key-value store called TreeLine. On YCSB, we find that TreeLine outperforms RocksDB and LeanStore by 2.20× and 2.07× respectively on average across the point workloads, and by up to 10.95× and 7.52× overall.
When you have an outage caused by a performance issue, you don't want to lose precious time just to install the tools needed to diagnose it. Here is a list of "crisis tools" I recommend installing on your Linux servers by default (if they aren't already), along with the (Ubuntu) package names that they come from
The sequential semantics of many concurrent data structures, such as stacks and queues, inevitably lead to memory contention in parallel environments, thus limiting scalability. Semantic relaxation has the potential to address this issue, increasing the parallelism at the expense of weakened semantics. Although prior research has shown that improved performance can be attained by relaxing concurrent data structure semantics, there is no one-size-fits-all relaxation that adequately addresses the varying needs of dynamic executions. In this paper, we first introduce the concept of elastic relaxation and consequently present the Lateral structure, which is an algorithmic component capable of supporting the design of elastically relaxed concurrent data structures. Using the Lateral , we design novel elastically relaxed, lock-free queues and stacks capable of reconfiguring relaxation during run time. We establish linearizability and define upper bounds for relaxation errors in our designs. Experimental evaluations show that our elastic designs hold up against state-of-the-art statically relaxed designs, while also swiftly managing trade-offs between relaxation and operational latency. We also outline how to use the Lateral to design elastically relaxed lock-free counters and deques.
The pidfd abstraction is a Linux-specific way of referring to processes that avoids the race conditions inherent in Unix process ID numbers. Since a pidfd is a file descriptor, it needs a filesystem to implement the usual operations performed on files. As the use of pidfds has grown, they have stressed the limits of the simple filesystem that was created for them. Christian Brauner has created a new filesystem for pidfds that seems likely to debut in the 6.9 kernel, but it ran into a little bump along the way, demonstrating that things you cannot see can still hurt you.
“Prolly Tree” is short for “Probabilistic B-tree”. “Prolly Tree” was coined by the good folks who built Noms, who as far as we can tell invented the data structure. ... A Prolly Tree is a data structure closely related to a B-tree. Prolly Trees are generally useful but have proven particularly effective as the basis of the storage engine for version controlled databases. This article explains Prolly Trees in detail.
This report explores the use of kernel-bypass networking in FaaS runtimes and demonstrates how using Junction, a novel kernel-bypass system, as the backend for executing components in faasd can enhance performance and isolation. Junction achieves this by reducing network and compute overheads and minimizing interactions with the host operating system. Junctiond, the integration of Junction with faasd, reduces median and P99 latency by 37.33% and 63.42%, respectively, and can handle 10 times more throughput while decreasing latency by 2x at the median and 3.5 times at the tail.
This blog discusses what we learned building this feature, and walks you step-by-step through how to build a replication system consuming Postgres’s logical replication protocol in Go. We’ll be using the jackc/pglogrepl library to consume and send messages with the Postgres primary, but most of the lessons generalize to other clients as well
What is it that Erlang’s releases and hot swapping facilities do? Can we steal those ideas and build upon them? These are the main questions that motivated me in writing this post. Let’s take a step back, ignoring Erlang for a moment, and ask ourselves: what would good support for upgrades look like? Zero-downtime: seamless, don’t interrupt existing client connections or sessions; If there’s any state then migrate it in a type-safe way; Backwards and forwards compatibility: old clients should be able to talk to newer servers, and newer clients should be able to talk to old servers; Atomicity: upgrades either succeed, or fail and rollback any changes; Downgrades: even if an upgrade succeeds we might want to rollback to an earlier version.
Large-scale data analytics frameworks are shifting towards shorter task durations and larger degrees of parallelism to provide low latency. Scheduling highly parallel jobs that complete in hundreds of milliseconds poses a major challenge for task schedulers, which will need to schedule millions of tasks per second on appropriate machines while offering millisecond-level latency and high availability. We demonstrate that a decentralized, randomized sampling approach provides near-optimal performance while avoiding the throughput and availability limitations of a centralized design. We implement and deploy our scheduler, Sparrow, on a 110-machine cluster and demonstrate that Sparrow performs within 12% of an ideal scheduler.
Fully-strict fork-join parallelism is a powerful model for shared-memory programming due to its optimal time scaling and strong bounds on memory scaling. The latter is rarely achieved due to the difficulty of implementing continuation stealing in traditional High Performance Computing (HPC) languages -- where it is often impossible without modifying the compiler or resorting to non-portable techniques. We demonstrate how stackless coroutines (a new feature in C++20) can enable fully-portable continuation stealing and present libfork a lock-free fine-grained parallelism library, combining coroutines with user-space, geometric segmented-stacks. We show our approach is able to achieve optimal time/memory scaling, both theoretically and empirically, across a variety of benchmarks. Compared to openMP (libomp), libfork is on average 7.2x faster and consumes 10x less memory. Similarly, compared to Intel's TBB, libfork is on average 2.7x faster and consumes 6.2x less memory. Additionally, we introduce non-uniform memory access (NUMA) optimizations for schedulers that demonstrate performance matching busy-waiting schedulers.
Zoned Namespace (ZNS) defines a new abstraction for host software to flexibly manage storage in flash-based SSDs as append-only zones. It also provides a Zone Append primitive to further boost the write performance of ZNS SSDs by exploiting intra-zone parallelism. However, making Zone Append effective for reliable and scalable storage, in the form of a RAID array of multiple ZNS SSDs, is non-trivial, since Zone Append offloads address management to ZNS SSDs and requires hosts to specifically manage RAID stripes across multiple drives. We propose ZapRAID, a high-performance log-structured RAID system for ZNS SSDs by carefully exploiting Zone Append to achieve high write parallelism and lightweight stripe management. ZapRAID adopts a group-based data layout with a coarse-grained ordering across multiple groups of stripes, such that it can use small-size metadata for stripe management on a per-group basis under Zone Append. It further adopts hybrid data management to simultaneously achieve intra-zone and inter-zone parallelism through a careful combination of both Zone Write and Zone Append primitives. We implement ZapRAID as a user-space block device, and evaluate ZapRAID using microbenchmarks, trace-driven experiments, and real-application experiments. Our evaluation results show that ZapRAID achieves high write throughput and maintains high performance in normal reads, degraded reads, crash recovery, and full-drive recovery
Shared software datapaths underpin modern datacentre networking. They implement mechanisms such as virtual switching, network virtualisation tunneling, or reliable transport, and enforce policies, such as tenant rate limits, virtual network isolation, or congestion control. However, because multiple applications, containers, or VMs share them, often across tenants, they pose a tail latency isolation challenge. Current isolation approaches either sacrifice efficiency via coarse-grained core partitioning or provide weak tail latency isolation when sharing cores with basic rate limits. This paper presents Virtuoso, a time protection mechanism for shared software datapaths that provides strong cross-tenant tail latency isolation while preserving low overhead and microsecond-scale latency. Our key insight is that tail latency is fundamentally a time metric, so byte or packet throughput is the wrong metric for controlling interference when packet processing costs vary. Our design instead enforces isolation through per-tenant CPU-time budgets at datapath intervention points within run-to-completion loops, without relying on preemption. In a case study, we instantiate Virtuoso in the TAS TCP stack and demonstrate a 7.8X reduction in victim tail latency under adversarial interference while keeping throughput within 5% of unmodified TAS. We also observe a 3X per-core efficiency improvement compared to siloed datapaths under bursty workloads.
The Raft consensus protocol (2014) provided a dynamic reconfiguration algorithm (a critical safety bug was found later, showing that reconfiguration protocols are tricky). Raft uses the main operation log (oplog) for both normal operations and reconfiguration operations. This coupling imposes fundamental restrictions on the operation of the two logs. MongoRaftReconfig avoids this by separating the oplog and "config state machine" (CSM), allowing reconfigurations to bypass the oplog SMR
After some research, trial, and error, I finally built and ran a relatively low-cost cluster with a high-speed full-mesh interconnected network. The most interesting part is that the networking is based on a USB4 ethernet bridge instead of a conventional ethernet switch and cables. I tested the network speed, and it can hit 11Gbps
We focus on the well-studied problem of distributed overlay network construction. We consider a synchronous gossip-based communication model where in each round a node can send a message of small size to another node whose identifier it knows. The network is assumed to be reconfigurable, i.e., a node can add new connections (edges) to other nodes whose identifier it knows or drop existing connections. Each node initially has only knowledge of its own identifier and the identifiers of its neighbors. The overlay construction problem is, given an arbitrary (connected) graph, to reconfigure it to obtain a bounded-degree expander graph as efficiently as possible. The overlay construction problem is relevant to building real-world peer-to-peer network topologies that have desirable properties such as low diameter, high conductance, robustness to adversarial deletions, etc. Our main result is that we show that starting from any arbitrary (connected) graph on nodes and edges, we can construct an overlay network that is a constant-degree expander in polylog rounds using only messages. Our time and message bounds are both essentially optimal (up to polylogarithmic factors). Our distributed overlay construction protocol is very lightweight as it uses gossip (each node communicates with only one neighbor in each round) and also scalable as it uses only messages, which is sublinear in (even when is moderately dense). To the best of our knowledge, this is the first result that achieves overlay network construction in polylog rounds and messages. Our protocol uses graph sketches in a novel way to construct an expander overlay that is both time and communication efficient. A consequence of our overlay construction protocol is that distributed computation can be performed very efficiently in this model.
This book is a journey through the design space and history of programming languages from the perspective of control structures: the language mechanisms that enable programs to control their execution flows. Starting with the “goto” jumps of early programming languages and the emergence of structured programming in the 1960s, the book explores advanced control structures for imperative languages such as generators and coroutines, then develops alternate views of control in functional languages, first as continuations and their control operators, then as algebraic effects and effect handlers. Blending history, code examples, and theory, the book offers an original, comparative perspective on programming languages, as well as an extensive introduction to algebraic effects and other contemporary research topics in P.L.
Algebraic effects are not just a research concept anymore. You can use them in real software, today.
Queues form, timeouts propagate, retries synchronize, and a minor disturbance becomes a major incident. The task is to prevent such peaks when possible and to drain safely when they occur, with mechanisms that are fair to clients and disciplined about capacity.
The futex essentially separates the locking from waiting (and waking) tasks. The flexibility you get from separating those two concerns is key to good lock performance. It becomes much easier to avoid unnecessary delays (like sleeps with exponential backoffs) and bottlenecks, particularly system calls themselves, which are quite expensive compared to most of the code involved in locking.
The MPTCP protocol is complex, mainly to be able to survive on the Internet where middleboxes such as NATs, firewalls, IDS or proxies can modify parts of the TCP packets. Worst case scenario, an MPTCP connection should fallback to “plain” TCP. Today, such fallbacks are rarer than before – probably because MPTCP has been used since 2013 on millions of Apple smartphones worldwide – but they can still exist, e.g. on some mobile networks using Performance Enhancing Proxies (PEPs) where MPTCP connections are not bypassed. In such cases, a solution to continue benefiting from MPTCP is to tunnel the MPTCP connections. Different solutions exist, but they usually add extra layers, and requires setting a virtual private network (VPN) up with private IP addresses between the client and the server. Here, a simpler solution is presented: TCP-in-UDP. This solution relies on eBPF, doesn’t add extra data per packet, and doesn’t require a virtual private network. Read on to find out more about that!
Algebraic effects1 (a.k.a. effect handlers) are a very useful up-and-coming feature that I personally think will see a huge surge in popularity in the programming languages of tomorrow. They’re one of the core features of Ante, as well as being the focus of many research languages including Koka, Effekt, Eff, and Flix. However, while many articles or documentation snippets try to explain what effect handlers are (including Ante’s own documentation), few really go in-depth on why you would want to use them. In this post I’ll explain exactly that and will include as complete a list as possible on all the use-cases of algebraic effects.
Synit is an experiment in applying pervasive reactivity and object capabilities to the System Layer of an operating system for personal computers, including laptops, desktops, and mobile phones. Its architecture follows the principles of the Syndicated Actor Model. Synit builds upon the Linux kernel, but replaces many pieces of familiar Linux software, including systemd, NetworkManager, D-Bus, and so on. It makes use of many concepts that will be familiar to Linux users, but also incorporates many ideas drawn from programming languages and operating systems not closely connected with Linux’s Unix heritage.
TinyKVM perhaps surprisingly places itself among the smallest serious sandboxing solutions out there, and may also be the fastest. It takes security seriously and tries to avoid complex guest features and kernel mode in general. TinyKVM has a minimal attack surface and no ambition to grow further in complexity, outside of nicer user-facing APIs and ports to other architectures.
Sequin is a tool for capturing changes and streaming data out of your Postgres database, guaranteeing exactly once processing.
He suggested that subinterpreters, which fairly recently each got their own GIL, might be the right approach
This is a PostgreSQL Docker container (based on postgres:16-alpine) that automatically upgrades your database. When it starts, it checks if your database files are for an older version (from PostgreSQL 9.5 onwards), and upgrades them (if needed), then starts the database server. If the database files don't need upgrading when it starts, then it skips the upgrade process and just starts PostgreSQL. The upgrade process uses the pg_upgrade utility behind the scenes, with the --link option enabled. This does an in-place upgrade for the quickest possible upgrade times.
GCRA is the “generic cell rate algorithm”, a rate-limiting algorithm that came from ATM. GCRA does the same job as the better-known leaky bucket algorithm, but using half the storage and with much less code.
Microservices are commonly used in modern cloud-native applications to achieve agility. However, the complexity of service dependencies in large-scale microservices systems can lead to anomaly propagation, making fault troubleshooting a challenge. To address this issue, distributed tracing systems have been proposed to trace complete request execution paths, enabling developers to troubleshoot anomalous services. However, existing distributed tracing systems have limitations such as invasive instrumentation, trace loss, or inaccurate trace correlation. To overcome these limitations, we propose a new tracing system based on eBPF (extended Berkeley Packet Filter), named Nahida, that can track complete requests in the kernel without intrusion, regardless of programming language or implementation. Our evaluation results show that Nahida can track over 92% of requests with stable accuracy, even under the high concurrency of user requests, while the state-of-the-art non-invasive approaches can not track any of the requests. Importantly, Nahida can track requests served by a multi-threaded application that none of the existing invasive tracing systems can handle by instrumenting tracing codes into libraries. Moreover, the overhead introduced by Nahida is negligible, increasing service latency by only 1.55%-2.1%. Overall, Nahida provides an effective and non-invasive solution for distributed tracing.
In this paper we build a case for providing job completion time predictions to cloud users, similar to the delivery date of a package or arrival time of a booked ride. Our analysis reveals that providing predictability can come at the expense of performance and fairness. Existing cloud scheduling systems optimize for extreme points in the trade-off space, making them either extremely unpredictable or impractical. To address this challenge, we present PCS, a new scheduling framework that aims to provide predictability while balancing other traditional objectives. The key idea behind PCS is to use Weighted-Fair-Queueing (WFQ) and find a suitable configuration of different WFQ parameters (e.g., class weights) that meets specific goals for predictability. It uses a simulation-aided search strategy, to efficiently discover WFQ configurations that lie on the Pareto front of the trade-off space between these objectives. We implement and evaluate PCS in the context of DNN job scheduling on GPUs. Our evaluation, on a small scale GPU testbed and larger-scale simulations, shows that PCS can provide accurate completion time estimates while marginally compromising on performance and fairness.
Let’s Encrypt is proud to introduce Sunlight, a new implementation of a Certificate Transparency log that we built from the ground up with modern Web PKI opportunities and constraints in mind
Consistent hashing is employed in distributed systems and networking applications to evenly and effectively distribute data across a cluster of nodes. This paper introduces BinomialHash, a consistent hashing algorithm that operates in constant time and requires minimal memory. We provide a detailed explanation of the algorithm, offer a pseudo-code implementation, and formally establish its strong theoretical guarantees.
The distribution of keys to a given number of buckets is a fundamental task in distributed data processing and storage. A simple, fast, and therefore popular approach is to map the hash values of keys to buckets based on the remainder after dividing by the number of buckets. Unfortunately, these mappings are not stable when the number of buckets changes, which can lead to severe spikes in system resource utilization, such as network or database requests. Consistent hash algorithms can minimize remappings, but are either significantly slower than the modulo-based approach, require floating-point arithmetic, or are based on a family of hash functions rarely available in standard libraries. This paper introduces JumpBackHash, which uses only integer arithmetic and a standard pseudorandom generator. Due to its speed and simple implementation, it can safely replace the modulo-based approach to improve assignment and system stability. A production-ready Java implementation of JumpBackHash has been released as part of the Hash4j open source library.
The PSP security protocol (PSP) is a way to transparently encrypt packets by efficiently offloading encryption and decryption to the network interface cards (NICs) that Google uses for connections inside its data centers. The protocol is similar to IPsec, in that it allows for wrapping arbitrary traffic in a layer of encryption. The difference is that PSP is encapsulated in UDP, and designed from the beginning to reduce the amount of state that NICs have to track in order to send and receive encrypted traffic, allowing for more simultaneous connections.
There are a bunch of posts on the internet about using git worktree command. As far as I can tell, 1most of them are primarily about using worktrees as a replacement of, or a supplement to git branches. Instead of switching branches, you just change directories. This is also how I originally had used worktrees, but that didn’t stick, and I abandoned them. But recently worktrees grew on me, though my new use-case is unlike branching.
A semigroup describes an operation of appending two values of some type to get a value of the same type.
"Get or create" is a very common operation for syncing data in the database, but implementing it correctly may be trickier than you may expect. If you ever had to implement it in a real system with real-life load, you may have overlooked potential race conditions, concurrency issues and even bloat!
The central component for handling HTAP workloads is our hybrid column-row storage engine that is able to manage hot and cold data in two different storage formats. In OLTP workloads, data access is typically focused on a small hot subset of the data. To efficiently support OLTP transactions, we store this hot data in an uncompressed, row-based format1. Cold data, which OLTP queries do not access frequently, is stored as large and encoded (i.e., compressed to allow processing without decompression) column chunks, an adaptation from data blocks
OpenObserve is a simple yet sophisticated log search, infrastructure monitoring, and APM solution. It is a full-fledged observability platform that can reduce your storage costs by ~140x compared to other solutions and requires much lower resource utilization resulting in much lower cost.
This library offers a simple protocol to encode/decode messages and exchange them between processes on a socket (inet or local).
A classic problem in parallel computing is to take a high-level parallel program written, for example, in nested-parallel style with fork-join constructs and run it efficiently on a real machine. The problem could be considered solved in theory, but not in practice, because the overheads of creating and managing parallel threads can overwhelm their benefits. Developing efficient parallel codes therefore usually requires extensive tuning and optimizations to reduce parallelism just to a point where the overheads become acceptable. In this paper, we present a scheduling technique that delivers provably efficient results for arbitrary nested-parallel programs, without the tuning needed for controlling parallelism overheads. The basic idea behind our technique is to create threads only at a beat (which we refer to as the ``heartbeat’’) and make sure to do useful work in between.
CRIB, for Checkpoint/Restore in (naturally) BPF. It is far from clear that CRIB will replace the existing solutions, but it is an interesting look at a different way of solving the problem.
High-performance, lightweight and cross-platform QUIC library
How to configure your C++ toolchain to produce binaries that are highly-debuggable with respect to your current bug.
This work aims to bridge the existing knowledge gap in the optimisation of latency-critical code, specifically focusing on high-frequency trading (HFT) systems. The research culminates in three main contributions: the creation of a Low-Latency Programming Repository, the optimisation of a market-neutral statistical arbitrage pairs trading strategy, and the implementation of the Disruptor pattern in C++. The repository serves as a practical guide and is enriched with rigorous statistical benchmarking, while the trading strategy optimisation led to substantial improvements in speed and profitability. The Disruptor pattern showcased significant performance enhancement over traditional queuing methods. Evaluation metrics include speed, cache utilisation, and statistical significance, among others. Techniques like Cache Warming and Constexpr showed the most significant gains in latency reduction. Future directions involve expanding the repository, testing the optimised trading algorithm in a live trading environment, and integrating the Disruptor pattern with the trading algorithm for comprehensive system benchmarking. The work is oriented towards academics and industry practitioners seeking to improve performance in latency-sensitive applications
Ibid
Systemd uses DBus as the mechanism to interact with it. This article introduces just enough DBus concepts and the usage of busctl to communicate with systemd. These concepts should be useful when using DBus libraries
How we efficiently store memory snapshots for VMs, and how we lazily load them to resume VMs within a second.
Reasoning about the use of external resources is an important aspect of many practical applications. Effect systems enable tracking such information in types, but at the cost of complicating signatures of common functions. Capabilities coupled with escape analysis offer safety and natural signatures, but are often overly coarse grained and restrictive. We present System C, which builds on and generalizes ideas from type-based escape analysis and demonstrates that capabilities and effects can be reconciled harmoniously. By assuming that all functions are second class, we can admit natural signatures for many common programs. By introducing a notion of boxed values, we can lift the restrictions of second-class values at the cost of needing to track degree-of-impurity information in types. The system we present is expressive enough to support effect handlers in full capacity. We practically evaluate System C in an implementation and prove its soundness.
Serverless computing has become increasingly popular for machine learning inference. However, current serverless platforms lack efficient support for GPUs, limiting their ability to deliver low-latency inference. In this paper, we propose FaaSwap, a GPU-efficient serverless inference platform. FaaSwap employs a holistic approach to system and algorithm design. It maintains models in main memory and dynamically swaps them onto GPUs upon request arrivals (i.e., late binding), thereby enabling a large number of inference functions to efficiently share a node's GPUs. FaaSwap uses various techniques, including asynchronous API redirection, GPU runtime sharing, pipelined model execution, and efficient GPU memory management, to achieve the optimal performance. We also develop an interference-aware request scheduling algorithm that allows FaaSwap to meet the latency SLOs for individual inference functions. We have implemented FaaSwap as a prototype on a leading commercial serverless platform. Experimental evaluations demonstrate that, with model swapping, FaaSwap can concurrently serve hundreds of functions on a single worker node with 4 V100 GPUs, while achieving inference performance comparable to native execution (where each function runs on a dedicated GPU). When deployed on a 6-node production testbed, FaaSwap meets the latency SLOs for over 1k functions, the maximum that the testbed can handle concurrently.
In kernel-centric operations, the uprobe component of eBPF frequently encounters performance bottlenecks, largely attributed to the overheads borne by context switches. Transitioning eBPF operations to user space bypasses these hindrances, thereby optimizing performance. This also enhances configurability and obviates the necessity for root access or privileges for kernel eBPF, subsequently minimizing the kernel attack surface. This paper introduces bpftime, a novel user-space eBPF runtime, which leverages binary rewriting to implement uprobe and syscall hook capabilities. Through bpftime, userspace uprobes achieve a 10x speed enhancement compared to their kernel counterparts without requiring dual context switches. Additionally, this runtime facilitates the programmatic hooking of syscalls within a process, both safely and efficiently. Bpftime can be seamlessly attached to any running process, limiting the need for either a restart or manual recompilation. Our implementation also extends to interprocess eBPF Maps within shared memory, catering to summary aggregation or control plane communication requirements. Compatibility with existing eBPF toolchains such as clang and libbpf is maintained, not only simplifying the development of user-space eBPF without necessitating any modifications but also supporting CO-RE through BTF. Through bpftime, we not only enhance uprobe performance but also extend the versatility and user-friendliness of eBPF runtime in user space, paving the way for more efficient and secure kernel operations.