We enhance the well-established software combining synchronization technique to create combining funnels. Previous software combining methods used a statically assigned tree whose depth was logarithmic in the total number of processors in the system. On shared memory multiprocessors the new method allows one to dynamically build combining trees with depth logarithmic in the actual number of processors concurrently accessing the data structure. The structure is comprised from a series of combining layers through which processors' requests are funneled. These layers use randomization instead of a rigid tree structure to allow processors to find partners for combining. By using an adaptive scheme the funnel can change width and depth to accommodate different access frequencies without requiring global agreement as to its size. Rather, processors choose parameters of the protocol privately, making this scheme very simple to implement and tune. When we add an “elimination” mechanism to the funnel structure, the randomly constructed “tree” is transformed into a “forest” of disjoint (and on average shallower) trees of requests, thus enhancing the level of parallelism and decreasing latency. We present two new linearizable combining funnel based data structures: a fetch-and-add object and a stack. We study the performance of these structures by benchmarking them against the most efficient software implementations of fetch-and-add and stacks known to date, combining trees and elimination trees, on a simulated shared memory multiprocessor using Proteus. Our empirical data shows that combining funnel-based fetch-and-add outperforms combining trees of fixed height by as much as 70%. In fact, even compared to combining trees optimized for a given load, funnel performance is the same or better. Elimination trees, which are not linearizable, are 10% faster than funnels under highest load, but as load drops, combining funnels adapt their size, giving them a 34% lead in latency
This repo contains the scripts necessary to install and run a tailscale instance on your Unifi Dream Machine (UDM/UDM Pro/UDR/UDM-SE).
Minimal, self-hosted, 0-config alternative to ngrok. Caddy+OpenSSH+50 lines of Python.
Nelua is a systems programming language for performance sensitive applications, like real-time applications and game engines. Its syntax and semantics are similar to Lua, but its garbage collection is optional, it provides optional type notations, and it is free from an interpreter. Nelua uses ahead-of-time compilation to generate optimized native binaries. It is metaprogrammable at compile-time using Lua and it is simple and easy to use.
A Development Container (or Dev Container for short) allows you to use a container as a full-featured development environment. It can be used to run an application, to separate tools, libraries, or runtimes needed for working with a codebase, and to aid in continuous integration and testing. Dev containers can be run locally or remotely, in a private or public cloud.
Handshake is a decentralized, permissionless naming protocol where every peer is validating and in charge of managing the root DNS naming zone with the goal of creating an alternative to existing Certificate Authorities and naming systems
This paper develops a theory of deadlock and leak freedom for higher-order locks in a shared memory concurrent setting. Higher-order locks allow sharing not only of basic values but also of other locks and channels, and are themselves first-class citizens. The theory is based on the notion of a sharing topology, administrating who is permitted to access shared data at what point in the program. The paper first develops higher-order locks for acyclic sharing topologies, instantiated in a 𝜆-calculus with higher-order locks and message-passing concurrency. The paper then extends the calculus to support circular dependencies with dynamic lock orders, which we illustrate with a dynamic version of Dijkstra’s dining philosophers problem. Well-typed programs in the resulting calculi are shown to be free of deadlocks and memory leaks, with proofs mechanized in the Coq proof assistant.
We describe the internal structure of boost::unordered_flat_map and provide theoretical analyses and benchmarking data to help readers gain insights into the key design elements behind this container's excellent performance. Interface and behavioral differences with the standard are also discussed.
In this series of posts we’ll write a program that reads lots of files stored on disk using Linux io_uring and C++20 coroutines. The main motivation comes from the fact that there are lots of in-depth resources for both io_uring and C++20 coroutines, but practically nothing showing how to combine both. We will discover that asynchronous I/O and coroutines go as well together as bread and butter.
The task-based dataflow programming model has emerged as an alternative to the process-centric programming model for extreme-scale applications. However, load balancing is still a challenge in task-based dataflow runtimes. In this paper, we present extensions to the PaR-SEC runtime to demonstrate that distributed work stealing is an effective load-balancing method for task-based dataflow runtimes. In contrast to shared-memory work stealing, we find that each process should consider future tasks and the expected waiting time for execution when determining whether to steal. We demonstrate the effectiveness of the proposed work-stealing policies for a sparse Cholesky factorization, which shows a speedup of up to 35% compared to a static division of work.
Cozo is a general-purpose, transactional, relational database that uses Datalog for query, is embeddable but can also handle huge amounts of data and concurrency, and focuses on graph data and algorithms. It supports time travel and it is performant!
We present nQUIC, a variant of QUIC-TLS that uses the Noise protocol framework for its key exchange and basis of its packet protector with no semantic transport changes. nQUIC is designed for deployment in systems and for applications that assert trust in raw public keys rather than PKI-based certificate chains. It uses a fixed key exchange algorithm, compromising agility for implementation and verification ease. nQUIC provides mandatory server and optional client authentication, resistance to Key Compromise Impersonation attacks, and forward and future secrecy of traffic key derivation, which makes it favorable to QUIC-TLS for long-lived QUIC connections in comparable applications. We developed two interoperable prototype implementations written in Go and Rust. Experimental results show that nQUIC finishes its handshake in a comparable amount of time as QUIC-TLS.
In serverless computing, applications are executed under lightweight virtualization and isolation environments, such as containers or micro virtual machines. Typically, their memory allocation is set by the user before deployment. All other resources, such as CPU, are allocated by the provider statically and proportionally to memory allocations. This contributes to either under-utilization or throttling. The former significantly impacts the provider, while the latter impacts the client. To solve this problem and accommodate both clients and providers, a solution is dynamic CPU allocation achieved through autoscaling. Autoscaling has been investigated for long-running applications using history-based techniques and prediction. However, serverless applications are short-running workloads, where such techniques are not well suited. In this paper, we investigate tiny autoscalers and how dynamic CPU allocation techniques perform for short-running serverless workloads. We experiment with Kubernetes as the underlying platform and implement using its vertical pod autoscaler several dynamic CPU rightsizing techniques. We compare these techniques using state-of-the-art serverless workloads. Our experiments show that dynamic CPU allocation for short-running serverless functions is feasible and can be achieved with lightweight algorithms that offer good performance.
The paper talks about the connection between Paxos and primary-backup replication applications, and presents vertical Paxos as a way of bridging these two.
A number of trusted execution environments (TEEs) have been proposed by both academia and industry. However, most of them require specific hardware or firmware changes and are bound to specific hardware vendors (such as Intel, AMD, ARM, and IBM). In this paper, we propose HyperEnclave, an open and cross-platform process-based TEE that relies on the widely-available virtualization extension to create the isolated execution environment. In particular, HyperEnclave is designed to support the flexible enclave operation modes to fulfill the security and performance demands under various enclave workloads. We provide the enclave SDK to run existing SGX programs on HyperEnclave with little or no source code changes. We have implemented HyperEnclave on commodity AMD servers and deployed the system in a world-leading FinTech company to support real-world privacy-preserving computations. The evaluation on both micro-benchmarks and application benchmarks shows the design of HyperEnclave introduces only a small overhead.
This pattern language1 describes a fourth option. It avoids all the above problems: it doesn’t use broad tests, doesn’t use mocks, doesn’t ignore infrastructure, and doesn’t require architectural changes. It has the speed, reliability, and maintainability of unit tests and the power of broad tests. But it’s not without tradeoffs of its own.
This paper introduces nonblocking transaction composition (NBTC), a new methodology for atomic composition of nonblocking operations on concurrent data structures. Unlike previous software transactional memory (STM) approaches, NBTC leverages the linearizability of existing nonblocking structures, reducing the number of memory accesses that must be executed together, atomically, to only one per operation in most cases (these are typically the linearizing instructions of the constituent operations). Our obstruction-free implementation of NBTC, which we call Medley, makes it easy to transform most nonblocking data structures into transactional counterparts while preserving their liveness and high concurrency. In our experiments, Medley outperforms Lock-Free Transactional Transform (LFTT), the fastest prior competing methodology, by 40--170%. The marginal overhead of Medley's transactional composition, relative to separate operations performed in succession, is roughly 2.2×. For persistent data structures, we observe that failure atomicity for transactions can be achieved "almost for free" with epoch-based periodic persistence. Toward that end, we integrate Medley with nbMontage, a general system for periodically persistent data structures. The resulting txMontage provides ACID transactions and achieves throughput up to two orders of magnitude higher than that of the OneFile persistent STM system.
We propose an asynchronous iterative scheme that allows a set of interconnected nodes to distributively reach an agreement within a pre-specified bound in a finite number of steps. While this scheme could be adopted in a wide variety of applications, we discuss it within the context of task scheduling for data centers. In this context, the algorithm is guaranteed to approximately converge to the optimal scheduling plan, given the available resources, in a finite number of steps. Furthermore, by being asynchronous, the proposed scheme is able to take into account the uncertainty that can be introduced from straggler nodes or communication issues in the form of latency variability while still converging to the target objective. In addition, by using extensive empirical evaluation through simulations we show that the proposed method exhibits state-of-the-art performance.
Wildebeest is an ActivityPub and Mastodon-compatible server whose goal is to allow anyone to operate their Fediverse server and identity on their domain without needing to keep infrastructure, with minimal setup and maintenance, and running in minutes. Wildebeest runs on top Cloudflare's Supercloud, uses Workers and Pages, the D1 database to store metadata and configurations, Zero Trust Access to handle authentication and Images for media handling.
Protohackers challenges you to create servers for network protocols. We give you the protocol spec. You write the server and host it. We automatically test it. There's a global leaderboard for the fastest solve times and a new problem every ~3 weeks.
This repo contains data structure implementations in golang. These implementations are usually focused on use in network related services (i.e. load balancers, etc) and storing stuff like IP addresses rather than any other type of data.
Demonstrate how to implement the Simple Query Protocol, where Java is an implementation detail
Hermit launches linux x86_64 programs in a special, hermetically isolated sandbox to control their execution. Hermit translates normal, nondeterministic behavior, into deterministic, repeatable behavior. This can be used for various applications, including replay-debugging, reproducible artifacts, chaos mode concurrency testing and bug analysis
For a data format, yaml is extremely complicated. It aims to be a human-friendly format, but in striving for that it introduces so much complexity, that I would argue it achieves the opposite result. Yaml is full of footguns and its friendliness is deceptive. In this post I want to demonstrate this through an example
Introducing server migration into QUIC enables a server to change address during a session, without disrupting the connection or requiring an additional handshake with the client. Differently from TCP, this mechanism allows preserving open connections during live relocation of server instances, removing the need of keeping the same IP address across machines, employing SDN solutions to transparently redirect traffic, or re-establishing sessions at application level. Migration is handled directly at transport level, so that relocation appears to an application as a period of peer unreachability.
Rasdaemon is a RAS (Reliability, Availability and Serviceability) logging tool. It records memory errors, using the EDAC tracing events. EDAC is a Linux kernel subsystem with handles detection of ECC errors from memory controllers for most chipsets on i386 and x86_64 architectures. EDAC drivers for other architectures like arm also exists.
Linkding is a simple bookmark service that you can host yourself. It's designed be to be minimal, fast, and easy to set up using Docker.
Backend subsetting—a technique for reducing the number of connections when connecting services together—is useful for reducing costs and may even be necessary for operating within the system limits. For more than a decade, Google used deterministic subsetting as its default backend subsetting algorithm, but although this algorithm balances the number of connections per backend task, deterministic subsetting has a high level of connection churn. Our goal at Google was to design an algorithm with reduced connection churn that could replace deterministic subsetting as the default backend subsetting algorithm
Reconfiguration is a rich source of faults in distributed systems. If you don't reconfigure, you lose availability when you lose additional nodes. But the reconfiguration algorithms are subtle, and it is easy to get them wrong, and lose consistency during reconfiguration. This is because a reconfiguration operation is run concurrently with the read/write operations and with potentially other reconfiguration operations, and they run when the system is already in some turmoil due to faults. Even the consensus papers/protocols do a bad job of addressing the reconfiguration problem; reconfiguration is always addressed as secondarily, incompletely, and many times incorrectly.
In this practical book, Mara Bos, team lead of the Rust library team, helps Rust programmers of all levels gain a clear understanding of low-level concurrency. You’ll learn everything about atomics and memory ordering and how they're combined with basic operating system APIs to build common primitives like mutexes and condition variables. Once you’re done, you’ll have a firm grasp of how Rust’s memory model, the processor, and the role of the operating system all fit together
Suppose that I give you a set of reference strings (“ftp”, “file”, “http”, “https”, “ws”, “wss”). Given a new string, you want to quickly tell whether it is part of this set.
Eio provides an effects-based direct-style IO stack for OCaml 5.0. For example, you can use Eio to read and write files, make network connections, or perform CPU-intensive calculations, running multiple operations at the same time. It aims to be easy to use, secure, well documented, and fast
Fortunately, the Linux Kernel Key Retention Service can perform all the functions of a typical agent process and probably even more! Initially it was designed for kernel services like dm-crypt/ecryptfs, but later was opened to use by userspace programs
In this blog post, we will walk through a new client-side load balancing technique we’ve developed and deployed widely at Twitter which has allowed our microservice architecture to efficiently scale clusters to thousands of instances. We call this new technique deterministic aperture
Homa/Linux is a Linux kernel module that implements the Homa transport protocol. Measurements of Homa/Linux reconfirm Homa's superior performance compared to TCP and DCTCP. In a cluster benchmark with 40 nodes, Homa/Linux provided lower latency than both TCP and DCTCP for all message sizes; for short messages, Homa's 99th percentile tail latency was 7–83x lower than TCP and DCTCP. The benchmarks also show that Homa has eliminated network congestion as a significant performance limitation. Both tail latency and throughput are now limited by software overheads, particularly software congestion caused by imperfect load balancing of the protocol stack across cores. Another factor of 5–10x in performance can be achieved if software overheads can be eliminated in the future.
Pg_crdt is an experimental extension adding support for conflict-free replicated data types (CRDTs) in Postgres.
For a long time I've been thinking that using a closed loop (sync) for measuring latency is wrong
FaaSnap is a VM snapshot-based platform that uses a set of complementary optimizations to improve function cold-start performance for Function-as-a-Service (FaaS) applications. Compact loading set files take better advantage of prefetching. Per-region memory mapping tailors page fault handling depending on the contents of different guest VM memory regions. Hierarchical overlapping memory-mapped regions simplify the mapping process. Concurrent paging allows the guest VM to start execution immediately, rather than pausing until the working set is loaded. Altogether, FaaSnap significantly reduces guest VM page fault handling time on the critical path and improves overall function loading performance. Experiments on serverless benchmarks show that it reduces end-to-end function execution by up to 3.5x compared to state-of-the-art, and on average is only 3.5% slower than snapshots cached in memory. Moreover, we show that FaaSnap is resilient to changes of working set and remains efficient under bursty workloads and when snapshots are located in remote storage
Firecracker has the ability to restore a MicroVM snapshot in as little as 4ms (or about 10ms for a full decent-sized Linux system), and it's no doubt possible to optimize this further. I expect that sub-millisecond restore times are possible, as are restore times with a CPU cost not much higher than a traditional fork (or even a traditional thread start). This reality changes the way we think about what VMs can be used for - making them useful for much smaller, shorter-lived, and transient applications than most would assume.
DAMON is a data access monitoring framework subsystem for the Linux kernel
One important aspect that is not often discussed in depth is the role of huge pages and the translation lookaside buffer (TLB). In this series of posts, we’ll explain what they are, why they matter, and how they can be used.
SplinterDB is a key-value store designed for high performance on fast storage devices
Git notes are powerful tools. And they could solve so many problems—if only they were better known and easier to use.
The ultimate goal of this project is to implement a multi-tier method-JIT for Lua. We employ an unique approach where the interpreter and the JIT tiers are automatically generated from a semantical description of the bytecodes. We believe this will ultimately result in less engineering cost, cleaner and more maintainable code, as well as the generalizability to support other languages.
My recommendation is to use a 160-bit (20 byte) random value that is then URL-safe base64-encoded. The URL-safe base64 variant can be used pretty much anywhere, and is reasonably compact.
Contemporary pervasive computing environments demand mechanism for coherently addressing high-level user needs despite changing availability of resources. We propose the formalization of goals as the semantic basis for this mechanism, and sketch a system architecture that separates policy-rich goals-level planning code from a policy-neutral component assembly model.
Below, I will propose a new programming technique that I call goal-oriented programming, which goes back to the roots of step-wise refinement of programs. I will show that this technique is not only easier than programming with threads, but it also provides better performance and portability. I will illustrate this by means of an example that I will work out first using threads, then using goal-oriented programming. I will conclude with a description of my experience with a transactional file system I have built this way.
Crsql is a run time loadable extension for SQLite that adds CRDT and sync support.
DDlog is a programming language for incremental computation. It is well suited for writing programs that continuously update their output in response to input changes. With DDlog, the programmer does not need to worry about writing incremental algorithms. Instead they specify the desired input-output mapping in a declarative manner, using a dialect of Datalog. The DDlog compiler then synthesizes an efficient incremental implementation
The 5-layer TCP and 7-layer OSI models are taught as high-level frameworks in which the various protocols that are used in computer networks operate. These models provide valid insights in the organization of network functionalities and protocols; however, the difficulties to fit some crucial technologies within them hints that they don't provide a complete model for the organization of -- and relationships between -- different mechanisms in a computer network. Recently, a recursive model for computer networks was proposed, which organizes networks in layers that conceptually provide the same mechanisms through a common interface. Instead of defined by function, these layers are distinguished by scope. We report our research on a model for computer networks. Following a rigorous regime alternating design with the evaluation of its implications in an implementation, we converged on a recursive architecture, named Ouroboros. One of our main main objectives was to disentangle the fundamental mechanisms that are found in computer networks as much as possible. Its distinguishing feature is the separation of unicast and broadcast as different mechanisms, giving rise to two different types of layers. These unicast and broadcast layers can easily be spotted in today's networks. This article presents the concepts underpinning Ouroboros, details its organization and interfaces, and introduces the free software prototype. We hope the insights it provides can guide future network design and implementation.
The rlite project provides a lightweight Free and Open Source implementation of the Recursive InterNetwork Architecture (RINA) for GNU/Linux operating systems
IRATI is an open source implementation of the RINA architecture targeted to the OS/Linux system, initially developed by the FP7-IRATI project
A Configuration Management System for computers that are Pets, not Cattle. This is for people who need to administer a handful of machines, all fairly different from each other and all Very Important. Those systems are not Cattle! They’re actually a bit more than Pets. They’re almost Family. For example: a laptop, workstation, and that personal tiny server in Sweden. They are all named after something dear. pets works on Linux systems, both Debian-like (APT) and RedHat-like (YUM).
Low-cost, easy to use KVM over IP devices
Let's talk about loadbalancing techniques and how they evolved at Google in response to various practical failure modes, from 2008 to 2012.
Clearly, it's important for us to figure out the right order to update our cells. Broadly, there are two solutions to this problem: dirty marking and topological sorting
There is a common view out there that the QUIC transport protocol (RFC 9000) is just another refinement to the original TCP transport protocol [1] [2]. I find it hard to agree with this sentiment, and for me QUIC represents a significant shift in the set of transport capabilities available to applications in terms of communication privacy, session control integrity and flexibility. QUIC embodies a different communications model that makes intrinsically useful to many more forms of application behaviours. Oh, yes. It’s also faster than TCP! In my opinion It’s likely that over time QUIC will replace TCP in the public Internet. So, for me QUIC is a lot more than just a few tweaks to TCP. Here we will describe both TCP and QUIC and look at the changes that QUIC has bought to the transport table.
Link speeds are now over 100Gbps and rising with hardware round-trip times (RTTs) of five or ten microseconds, which may go lower over the next few years. But that raw network speed is not accessible by applications; in particular, the latency and throughput for small messages is not anywhere near what the hardware numbers would support. "We're one to two orders of magnitude off—or more." The problem is the overhead in the software network stacks
This essentially makes WebAssembly a register machine without liveness analysis, but not only that, it’s a register machine that isn’t even in SSA form - both of the tools at our disposal to do optimisation are unavailable. In a true, optimising compiler we can recreate that information, but WebAssembly was already emitted by a compiler that generated that information once. There’s no technical reason why we should have to do that work again, it’s just a deficiency in the language. A compiler that has to act on a stream of WebAssembly has less ability to recreate this information and will end up generating significantly worse code for essentially no good reason.
Sharding is widely used to scale an application. Despite a decade of effort to build generic sharding frameworks that can be reused across different applications, the extent of their success remains unclear. We attempt to answer a fundamental question: what barriers prevent a sharding framework from getting adopted by the majority of sharded applications?
Our solution was bufferv2. Bufferv2 is a non-contiguous, reference counted, pooled, copy-on-write, buffer-like data structure.
Papers on leaderless consensus protocols such as EPaxos and Atlas describe the protocols from the ground up. This approach helps a reader without prior knowledge to understand how they work, but at the same time it makes it hard for an expert to grok them because they need to trace and check all the assumptions from scratch rather than rely on already proved protocols. This may explain why the leaderless consensus protocols didn't get traction in the industry despite being known since 2013. In this post I present a leaderless consensus protocol built in a modular way on top of known consensus protocol. Its modularity helps in understanding and retrofitting into existing solutions.
Ibid
Snapshot Isolation (SI), is a multi-version concurrency control algorithm introduced in [BBGMOO95] and later implemented by Oracle. SI avoids many concurrency errors, and it never delays read-only transactions. However it does not guarantee serializability. It has been widely assumed that, under SI, read-only transactions always execute serializably provided the concurrent update transactions are serializable. The reason for this is that all SI reads return values from a single instant of time when all committed transactions have completed their writes and no writes of non-committed transactions are visible. This seems to imply that read-only transactions will not read anomalous results so long as the update transactions with which they execute do not write such results. In the current note, however, we exhibit an example contradicting these assumptions: it is possible for an SI history to be non-serializable while the sub-history containing all update transactions is serializable.
AWS Lambda offers scalable serverless functions as a service, with the ability to scale up and add capacity in hundreds of milliseconds. When we launched Lambda in 2015, we supported functions up to 250MB in size. In 2020, we set out to allow Lambda customers to bring arbitrary container images, up to 10GBs, without adding additional scale-up latency. This talk covers how we did that, using erasure coding, convergent encryption, deterministic filesystem flattening, and some other unique tools. We’ll cover how our virtualization-based environment, and the Firecracker VMM, allowed us to hide all that magic from customer’s functions.
Awesome list of distributed transactions.
A list of papers about distributed consensus (heidihoward)
Mkcert is a simple tool for making locally-trusted development certificates. It requires no configuration.
Compensating transactions are intended to handle situations where it is required to undo either committedor uncommitted transactions that affect other transactions, without resorting to cascading aborts. This stands in sharp contrast to the standard approach to transaction recovery where cascading aborts are avoided by requiring transactions to read only committed data, and where committed transactions are treated as permanent and irreversible. We argue that this standard approach to recovery is not suitable for a wide range of advanced database applications, in particular those applications that incorporate long-duration or nested transactions. We show how compensating transactions can be effectively used to handle these types of applications. We present a model that allows the definition of a variety of types of correct compensation. These types of compensation range from traditional undo, at one extreme, to application-dependent, special-purpose compensating transactions, at the other extreme.
Aurae is a free and open source Rust project which houses a memory-safe systems runtime daemon built specifically for enterprise distributed systems called auraed. The auraed daemon can be ran as a pid 1 on a Linux kernel and manages containers, virtual machines, and spawning short-lived nested virtual instances of itself for an additional layer of isolation. Aurae is designed to work well with (but is deliberately decoupled from) Kubernetes. The auraed daemon runs "under" Kubernetes and exposes the Aurae Standard Library over an mTLS authenticated gRPC server