We’ve been hard at work on the next major revision of Tokio, Rust’s asynchronous runtime. Today, a complete rewrite of the scheduler has been submitted as a pull request. The result is huge performance and latency improvements.
In general, gRPC and RSocket attempt to solve different problems. gRPC is an RPC framework using HTTP/2. RSocket is useful.
CoreNIC is the product name of Netronome's NIC firmware implementation for Agilio SmartNICs. It provides a network interface compatible with the nfp Linux driver and DPDK
The Eytzinger layout offers the best all-around performance over a wide range of array lengths
We introduce protocol-aware recovery(PAR), a new approach that exploits protocol-specific knowledge to correctly recover from storage faults in distributed systems.
In this paper, we argue that server CPUs are ill-suited to run serverless workloads (i.e., lambdas) and present λ-NIC, an open-source framework, that runs interactive workloads directly on a SmartNIC
Rendering text, how hard could it be? As it turns out, incredibly hard! To my knowledge, literally no system renders text "perfectly". It's all best-effort, although some efforts are more important than others.
If we want to achieve a result like docker, without paying the full complexity price of docker, it pays to ask ourselves “what is docker made of?”
A new edition of the book Modern C is now available under a CC license
Tecton is a language and tool framework whose purpose is to foster structured development of computational systems (both hardware designs and software), using abstraction and specialization as the key structuring mechanisms
A Go cache library worthy of being compared to non-Go cache implementations
Welcome to FBSwiki. This is the wiki for the Second Edition of text Feedback Systems by Karl J. Åström and Richard M. Murray.
Cloud computing has motivated renewed interest in resource allocation problems with new consumption models. A common goal is to share a resource, such as CPU or I/O bandwidth, among distinct users with different demand patterns as well as different quality of service requirements. To ensure these service requirements, cloud offerings often come with a service level agreement (SLA) between the provider and the users. An SLA specifies the amount of a resource a user is entitled to utilize. In many cloud settings, providers would like to operate resources at high utilization while simultaneously respecting individual SLAs. There is typically a tradeoff between these two objectives; for example, utilization can be increased by shifting away resources from idle users to "scavenger" workload, but with the risk of the former then becoming active again. We study this fundamental tradeoff by formulating a resource allocation model that captures basic properties of cloud computing systems, including SLAs, highly limited feedback about the state of the system, and variable and unpredictable input sequences. Our main result is a simple and practical algorithm that achieves near-optimal performance on the above two objectives. First, we guarantee nearly optimal utilization of the resource even if compared to the omniscient offline dynamic optimum. Second, we simultaneously satisfy all individual SLAs up to a small error. The main algorithmic tool is a multiplicative weight update algorithm, and a duality argument to obtain its guarantees. Experiments on both synthetic and real production traces demonstrate the merits of our algorithm in practical settings.
We present a new lock-free multiple-producer and multiple-consumer (MPMC) FIFO queue design which is scalable and, unlike existing high-performant queues, very memory efficient. Moreover, the design is ABA safe and does not require any external memory allocators or safe memory reclamation techniques, typically needed by other scalable designs. In fact, this queue itself can be leveraged for object allocation and reclamation, as in data pools. We use FAA (fetch-and-add), a specialized and more scalable than CAS (compare-and-set) instruction, on the most contended hot spots of the algorithm. However, unlike prior attempts with FAA, our queue is both lock-free and linearizable. We propose a general approach, SCQ, for bounded queues. This approach can easily be extended to support unbounded FIFO queues which can store an arbitrary number of elements. SCQ is portable across virtually all existing architectures and flexible enough for a wide variety of uses. We measure the performance of our algorithm on the x86-64 and PowerPC architectures. Our evaluation validates that our queue has exceptional memory efficiency compared to other algorithms and its performance is often comparable to, or exceeding that of state-of-the-art scalable algorithms.
What are the uses to which gossip is particularly well-matched, and what are its limitations? What alternatives are there to gossip-based solutions, and when would we be better-off using a non-gossip protocol? When, in effect, is gossip the technology of choice?
This style of short-term planning, direct customer contact, and continuous iteration is well suited to software with a simple core and lots of customer visible features that are incrementally useful. It is not so well suited to software which has a very simple interface and tons of hidden internal complexity, software which isn’t useful until it’s fairly complete, or leapfrog solutions the customer can’t imagine.
The Haskell Phrasebook is a free quick-start Haskell guide comprised of a sequence of small annotated programs. It provides a cursory overview of selected Haskell features, jumping-off points for further reading, and recommendations to help get you writing programs as soon as possible
Andjelko Iharos explores the goals, design and the choices behind the implementations of EBtree, and how they produce a very fast and versatile data storage for many of HAProxys advanced features. EBtree is a different take on the ubiquitous tree data structure, and has been helping HAProxy, a high performance open source software load balancer, to keep up with demands for over a decade.
In this post, I describe a tri-factor model which I find more useful in the analysis of hash table algorithms and discuss several state-of-the-art algorithms in the context of this model.
The backoff algorithm is a part of Media Access Control (MAC) protocol which used to avoid collision in the Mobile Ad hoc Network (MANET). When the nodes in the network try to access the channel, one of these nodes gains access the channel while the other nodes still contend for a time period. Many backoff algorithms have been proposed to improve network performance. One of these algorithms is Fibonacci increment backoff (FIB), FIB algorithm achieves higher throughput than the exponential backoff that is used by the standard IEEE 802.11 when it used in a mobile ad hoc network. The Pessimistic Linear-exponential Backoff (PLEB) is another proposed backoff algorithm which uses a combination of two increment behaviors; Exponential backoff and Linear backoff this scheme merges the advantages of the two increment behaviors. Exponential increments give enough backoff time to enhance the network throughput by reducing the number of transmission failures, and the linear increment reduces the average packet delay. Ad hoc On demand Distance Vector (AODV) routing protocol use a demand-driven route establishment procedure. AODV maintain the route table at each node. This paper uses different backoff algorithms at different values of rebroadcast probability.
Say no to $ apt install vim in containers! cntr is a replacement for docker exec that brings all your developers tools with you. This allows to ship minimal runtime image in production and limit the surface for exploits.
In this article, I want to highlight how simple linker map files are and how much they can teach you about the program you are working on.
With the rapid developments of virtualization techniques, cloud data centers have had enabled cost effective, flexible, and customizable deployments of applications in virtualized infrastructure. Virtual machine placement problem is a problem of paramount importance to the design of cloud data centers. It aims to assign each virtual machine to a server in the cloud environment. Typically, the problem involves complex relations and multiple design factors as well as local policies that govern the assignment decisions. It also involves different parties including cloud administrators and customers that we need to consider their preferences while opting for a solution. Thus, it is significant to not only return an optimized solution to the underlying placement problem but also a solution that reflects the given preferences of these parties. In this paper, we provide a detailed review on the role of preferences in the current literature of the virtual machine placement problem. We further discuss some challenges and identify possible research opportunities to better incorporate preferences within the problem.
Real-time performance monitoring, done right
A circular buffer written in C using Posix calls to create a contiguously mapped memory space.
The wait-free hierarchy hrm classifies AsynchronousSharedMemory object types T by consensus number, where a type T has consensus number n if with objects of type T and atomic registers (all initialized to appropriate values) it is possible to solve wait-free consensus (i.e., agreement, validity, wait-free termination) for n processes but not for n 1 processes. The consensus number of any type is at least 1, since 1-process consensus requires no interaction, and may range up to ∞ for particularly powerful objects.
Highly available and high-performance message logging system is critical building block for various use cases that require global ordering, especially for deterministic distributed transactions. To achieve availability, we maintain multiple replicas that have the same payloads in exactly the same order. This introduces various challenging issues such as consistency between replicas after failure, while minimizing performance degradation. Replicated state machine-based consensus protocols are the most suitable candidates to fulfill those requirements, but double-write problem and different logging granularity make it hard to keep the system efficient. This paper suggests a novel way to build a replicated log store on top of Raft consensus protocol, aiming at providing the same level of consistency as well as fault-tolerance without sacrificing the throughput of the system.
Distributed systems are hard to reason about largely because of uncertainty about what may go wrong in a particular execution, and about whether the system will mitigate those faults. Tools that perturb executions can help test whether a system is robust to faults, while tools that observe executions can help better understand their system-wide effects. We present Box of Pain, a tracer and fault injector for unmodified distributed systems that addresses both concerns by interposing at the system call level and dynamically reconstructing the partial order of communication events based on causal relationships. Box of Pain’s lightweight approach to tracing and focus on simulating the effects of partial failures on communication rather than the failures themselves sets it apart from other tracing and fault injection systems. We present evidence of the promise of Box of Pain and its approach to lightweight observation and perturbation of distributed systems.
OneFile is a Software Transactional Memory (STM) meant to make it easy to implement lock-free and wait-free data structures
This book is a guide to how we do product development at Basecamp. It’s also a toolbox full of techniques that you can apply in your own way to your own process.
Partisan is the design of an alternative runtime system for improved scalability and reduced latency in actor applications
The difficulty with implementing saturated arithmetic is creating all the checks for overflow. The naive method requires several branches in the code. Since jumps tend to be slow to execute, the resulting math executes slowly. For performance, branch-free code is suggested. Another problem is that the overflow or carry flags are not directly accessible from high level languages like C. This means we may need rather complex round-about methods to determine whether or not an overflow occurred. Such code is difficult to make fast.
Priority queues are fundamental abstract data structures, often used to manage limited resources in parallel programming. Several proposed parallel priority queue implementations are based on skiplists, harnessing the potential for parallelism of the add() operations. In addition, methods such as Flat Combining have been proposed to reduce contention by batching together multiple operations to be executed by a single thread. While this technique can decrease lock-switching overhead and the number of pointer changes required by the removeMin() operations in the priority queue, it can also create a sequential bottleneck and limit parallelism, especially for non-conflicting add() operations. In this paper, we describe a novel priority queue design, harnessing the scalability of parallel insertions in conjunction with the efficiency of batched removals. Moreover, we present a new elimination algorithm suitable for a priority queue, which further increases concurrency on balanced workloads with similar numbers of add() and removeMin() operations. We implement and evaluate our design using a variety of techniques including locking, atomic operations, hardware transactional memory, as well as employing adaptive heuristics given the workload.
Package file handles write-ahead logs and space management of os.File-like entities
A date and time library based on the C 11/14/17 <chrono> header. Slightly modified versions of "date.h" and "tz.h" were voted into the C 20 working draft
Arachne is a new user-level implementation of threads that provides both low latency and high throughput for applications with extremely short-lived threads (only a few microseconds). Arachne is core-aware: each application determines how many cores it needs, based on its load; it always knows exactly which cores it has been allocated, and it controls the placement of its threads on those cores. A central core arbiter allocates cores between applications. Adding Arachne to memcached improved SLO-compliant throughput by 37%, reduced tail latency by more than 10x, and allowed memcached to coexist with background applications with almost no performance impact. Adding Arachne to the RAMCloud storage system increased its write throughput by more than 2.5x. The Arachne threading library is optimized to minimize cache misses; it can initiate a new user thread on a different core (with load balancing) in 320 ns. Arachne is implemented entirely at user level on Linux; no kernel modifications are needed
High-performance servers are non-uniform memory access (NUMA) machines. To fully leverage these machines, programmers need efficient concurrent data structures that are aware of the NUMA performance artifacts. We propose Node Replication (NR), a black-box approach to obtaining such data structures. NR takes an arbitrary sequential data structure and automatically transforms it into a NUMA-aware concurrent data structure satisfying linearizability. Using NR requires no expertise in concurrent data structure design, and the result is free of concurrency bugs. NR draws ideas from two disciplines: shared-memory algorithms and distributed systems. Briefly, NR implements a NUMA-aware shared log, and then uses the log to replicate data structures consistently across NUMA nodes. NR is best suited for contended data structures, where it can outperform lock-free algorithms by 3.1x, and lock-based solutions by 30x. To show the benefits of NR to a real application, we apply NR to the data structures of Redis, an in-memory storage system. The result outperforms other methods by up to 14x. The cost of NR is additional memory for its log and replicas.
Mimalloc (pronounced "me-malloc") is a general purpose allocator with excellent performance characteristics. Initially developed by Daan Leijen for the run-time systems of the Koka and Lean languages. It is a drop-in replacement for malloc and can be used in other programs without code changes
In this paper, we propose a sizing of network buffers that is easy to deploy, requires no application-level state or semantics, requires no measurements from the network, and is fully decentralized. Our approach does not rely on machine learning or other complex algorithms. The tradeoff inherent in our proposed buffer size is that it is very likely inefficient across a wide range of deployments. However, there is likely a small, finite number of networks where our proposed buffer size is optimal. This paper is tailored to such networks.
In this paper, we perform the first systematic study onconcurrency bugs in real Go programs. We studied six pop-ular Go software including Docker, Kubernetes, and gRPC.We analyzed 171 concurrency bugs in total, with more thanhalf of them caused by non-traditional, Go-specific problems.Apart from root causes of these bugs, we also studied theirfixes, performed experiments to reproduce them, and eval-uated them with two publicly-available Go bug detectors.Overall, our study provides a better understanding on Go’sconcurrency models and can guide future researchers andpractitioners in writing better, more reliable Go softwareand in developing debugging and diagnosis tools for Go.
After ten years in print, our publisher decided against further printings and has reverted the rights to us. We are publishing Elements of Programming in two forms: a free PDF and a paperback.
Some software, languages and models that have HobbyHorses.
Determining whether online users are authorized to access digital objects is central to preserving privacy. This paper presents the design, implementation, and deployment of Zanzibar, a global system for storing and evaluating access control lists. Zanzibar provides a uniform data model and configuration language for expressing a wide range of access control policies from hundreds of client services at Google, including Calendar, Cloud, Drive, Maps, Photos, and YouTube. Its authorization decisions respect causal ordering of user actions and thus provide external consistency amid changes to access control lists and object contents. Zanzibar scales to trillions of access control lists and millions of authorization requests per second to support services used by billions of people. It has maintained 95th-percentile latency of less than 10 milliseconds and availability of greater than 99.999% over 3 years of production use.
The Berkeley Tree DataBase provides very fast storage of scalar-valued timeseries data
Snmalloc is a research allocator. Its key design features are: Memory that is freed by the same thread that allocated it does not require any synchronising operations. Freeing memory in a different thread to initially allocated it, does not take any locks and instead uses a novel message passing scheme to return the memory to the original allocator, where it is recycled. The allocator uses large ranges of pages to reduce the amount of meta-data required.
Caching is a common approach for improving performance, yet most implementations use strictly classical techniques. In this article we will explore the modern methods used by Caffeine, an open-source Java caching library, that yield high hit rates and excellent concurrency.
Shuffle Sharding is a general-purpose technique, and you can also choose to Shuffle Shard across many kinds of resources, including pure in-memory data-structures such as queues, rate-limiters, locks and other contended resources.
In the popular computer game of Tetris, the player is given a sequence of tetromino pieces and must pack them into a rectangular gameboard initially occupied by a given configuration of filled squares; any completely filled row of the gameboard is cleared and all pieces above it drop by one row. We prove that in the offline version of Tetris, it is NP-complete to maximize the number of cleared rows, maximize the number of tetrises (quadruples of rows simultaneously filled and cleared), minimize the maximum height of an occupied square, or maximize the number of pieces placed before the game ends. We furthermore show the extreme inapproximability of the first and last of these objectives to within a factor of p^(1-epsilon), when given a sequence of p pieces, and the inapproximability of the third objective to within a factor of (2 - epsilon), for any epsilon>0. Our results hold under several variations on the rules of Tetris, including different models of rotation, limitations on player agility, and restricted piece sets.
Current cloud providers use fixed-price based mechanisms to allocate Virtual Machine (VM) instances to their users. The fixed-price based mechanisms do not provide an efficient allocation of resources and do not maximize the revenue of the cloud providers. A better alternative would be to use combinatorial auction-based resource allocation mechanisms. In this PhD dissertation we will design, study and implement combinatorial auction-based mechanisms for efficient provisioning and allocation of VM instances in cloud computing environments. We present our preliminary results consisting of three combinatorial auction-based mechanisms for VM provisioning and allocation.We also present an efficient bidding algorithm that can be used by the cloud users to decide on how to bid for their requested bundles of VM instances
Same as the other ones but Heidi Howard so ¯\_(ツ)_/¯
This is the missing manual. I reckon most engineers can wrap their heads around all the most important concepts and common quirks in less than an hour. That’s our goal here. An hour is a pretty small investment to learn something you literally can’t do any other way.
Distributed consensus, the ability to reach agreement in the face of failures and asynchrony, is a fundamental primitive for constructing reliable distributed systems from unreliable components. The Paxos algorithm is synonymous with distributed consensus, yet it performs poorly in practice and is famously difficult to understand. In this paper, we re-examine the foundations of distributed consensus. We derive an abstract solution to consensus, which utilises immutable state for intuitive reasoning about safety. We prove that our abstract solution generalises over Paxos as well as the Fast Paxos and Flexible Paxos algorithms. The surprising result of this analysis is a substantial weakening to the quorum requirements of these widely studied algorithms.
Workload predictions in cloud computing is obviously an important topic. Most of the existing publications employ various time series techniques, that might be difficult to implement. We suggest here another route, which has already been successfully used in financial engineering and photovoltaic energy. No mathematical modeling and machine learning procedures are needed. Our computer simulations via realistic data, which are quite convincing, show that a setting mixing algebraic estimation techniques and the daily seasonality behaves much better. An application to the computing resource allocation, via virtual machines, is sketched out.
Fine-grained, cgroup-based tool for profiling memory usage over time of a process tree
Fast, portable, non-Turing complete expression evaluation with gradual typing (Go) - google/cel-go
We prove that no fully transactional system can provide fast read transactions (including read-only ones that are considered the most frequent in practice). Specifically, to achieve fast read transactions, the system has to give up support of transactions that write more than one object. We prove this impossibility result for distributed storage systems that are causally consistent, i.e., they do not require to ensure any strong form of consistency. Therefore, our result holds also for any system that ensures a consistency level stronger than causal consistency, e.g., strict serializability. The impossibility result holds even for systems that store only two objects (and support at least two servers and at least four clients). It also holds for systems that are partially replicated. Our result justifies the design choices of state-of-the-art distributed transactional systems and insists that system designers should not put more effort to design fully-functional systems that support both fast read transactions and ensure causal or any stronger form of consistency.
Distributed systems are hard to reason about largely because of uncertainty about what may go wrong in a particular execution, and about whether the system will mitigate those faults. Tools that perturb executions can help test whether a system is robust to faults, while tools that observe executions can help better understand their system-wide effects. We present Box of Pain, a tracer and fault injector for unmodified distributed systems that addresses both concerns by interposing at the system call level and dynamically reconstructing the partial order of communication events based on causal relationships. Box of Pain's lightweight approach to tracing and focus on simulating the effects of partial failures on communication rather than the failures themselves sets it apart from other tracing and fault injection systems. We present evidence of the promise of Box of Pain and its approach to lightweight observation and perturbation of distributed systems.
The classification of the most used load balancing algorithms in distributed systems (including cloud technology, cluster systems, grid systems) is described. Comparative analysis of types of the load balancing algorithms is conducted in accordance with the classification, the advantages and drawbags of each type of the algorithms are shown. Performance indicators characterizing each algorithm are indicated.
Smoltcp is a standalone, event-driven TCP/IP stack that is designed for bare-metal, real-time systems. Its design goals are simplicity and robustness. Its design anti-goals include complicated compile-time computations, such as macro or type tricks, even at cost of performance degradation.
In this paper we study the properties of eventually consistent distributed systems that feature arbitrarily complex semantics and mix eventual and strong consistency. These systems execute requests in a highly-available, weakly-consistent fashion, but also enable stronger guarantees through additional inter-replica synchronization mechanisms that require the ability to solve distributed consensus. We use the seminal Bayou system as a case study, and then generalize our findings to a whole class of systems. We show dubious and unintuitive behaviour exhibited by those systems and provide a theoretical framework for reasoning about their correctness. We also state an impossibility result that formally proves the inherent limitation of such systems, namely temporary operation reordering, which admits interim disagreement between replicas on the relative order in which the client requests were executed.
This article is about some of the little tricks that I use in Vim. None of them are deep dives, and I encourage you to learn more about whatever’s interesting. They also aren’t connected to each other. But that’s fine. In total, they’re more than enough to help a lot.
The growing popularity of workflows in the cloud domain promoted the development of sophisticated autoscaling policies that allow automatic allocation and deallocation of resources. However, many state-of-the-art autoscaling policies for workflows are mostly plan-based or designed for batches (ensembles) of workflows. This reduces their flexibility when dealing with workloads of workflows, as the workloads are often subject to unpredictable resource demand fluctuations. Moreover, autoscaling in clouds almost always imposes budget constraints that should be satisfied. The budget-aware autoscalers for workflows usually require task runtime estimates to be provided beforehand, which is not always possible when dealing with workloads due to their dynamic nature. To address these issues, we propose a novel Performance-Feedback Autoscaler (PFA) that is budget-aware and does not require task runtime estimates for its operation. Instead, it uses the performance-feedback loop that monitors the average throughput on each resource type. We implement PFA in the popular Apache Airflow workflow management system, and compare the performance of our autoscaler with other two state-of-the-art autoscalers, and with the optimal solution obtained with the Mixed Integer Programming approach. Our results show that PFA outperforms other considered online autoscalers, as it effectively minimizes the average job slowdown by up to 47% while still satisfying the budget constraints. Moreover, PFA shows by up to 76% lower average runtime than the competitors.
TLA and Alloy are good for absolutes. You can say “there exists an error in this design”. But you can’t express statistical properties, like “this should take an average of less than 10 minutes” or “it is more likely this recovers than crashes.” Nor can you assign weights to actions, like “this has an 80% chance of continuing and a 20% chance of failing.” For that, we need probabilistic modeling (PM). PRISM is one such modeling language, and comes with some really powerful tools to inspect probabilities. But there’s a fundamental tradeoff in modeling: the more powerful the checker, the less expressive the language. And to get such a powerful checker, PRISM has to make some serious expressivity tradeoffs.
Build and run tiny vms from Dockerfiles. Small and sleek.
EventCount allows to wait for arbitrary predicates in non-blocking algorithms. Think of condition variable, but wait predicate does not need to be protected by a mutex.
In this paper, we study the problem of computing aggregates with gossip-style protocols. Our first contribution is an analysis of simple gossip-based protocols for the computations of sums, averages, random samples, quantiles, and other aggregate functions, and we show that our protocols converge exponentially fast to the true answer when using uniform gossip.Our second contribution is the definition of a precise notion of the speed with which a node’s data diffuses through the network. We show that this diffusion speed is at theheart of the approximation guarantees for all of the above problems. We analyze the diffusion speed of uniform gossip in the presence of node and link failures, as well as for flooding-based mechanisms. The latter expose interesting connections to random walks on graphs
With the advent of Software Defined Networks (SDN), Network Function Virtualisation (NFV) or Service Function Chaining (SFC), operators expect networks to support flexible services beyond the mere forwarding of packets. The network programmability framework which is being developed within the IETF by leveraging IPv6 Segment Routing enables the realisation of in-network functions. In this paper, we demonstrate that this vision of in-network programmability can be realised. By leveraging the eBPF support in the Linux kernel, we implement a flexible framework that allows network operators to encode their own network functions as eBPF code that is automatically executed while processing specific packets. Our lab measurements indicate that the overhead of calling such eBPF functions remains acceptable. Thanks to eBPF, operators can implement a variety of network functions. We describe the architecture of our implementation in the Linux kernel. This extension has been released with Linux 4.18. We illustrate the flexibility of our approach with three different use cases: delay measurements, hybrid networks and network discovery. Our lab measurements also indicate that the performance penalty of running eBPF network functions on Linux routers does not incur a significant overhead.
Whether you’re a Site Reliability Engineer (SRE), developer, or executive, as a service provider you have a vested interest in (or responsibility for) ensuring system reliability. However, “system reliability” in and of itself can be a vague and subjective term that depends on the specific needs of the enterprise. So, SLOs are necessary because they define your Quality of Service (QoS) and reliability goals in concrete, measurable, objective terms.