I often declare these kinds of wars on unneeded copies and allocations. Talking to the global allocator is not exactly cheap and making extra copies is just not elegant. Sometimes it means using a bump allocator, or even writing an extension for it. Sometimes it means extensive use of borrowing.
This technique optimizes the expressiveness of the call site at the expense of the one of the interface and implementation. Indeed, the interface needs naming and a comment to help clarify how to use it, and the implementation has more code to turn the parameters around.
Get started with your own BPF application quickly and painlessly with libbpf-bootstrap scaffolding, which takes care of all the mundane setup steps and lets you dive right into BPF fun and minimize the necessary boilerplate. We'll take a look at what libbpf-bootstrap provides and how everything is tied together.
Linux has a mechanism that can turn asynchronous signals into synchronous events that can be handled by epoll. It’s a great idea, really. If signals can be delivered via a file descriptor, much like data, then existing I/O multiplexing mechanisms like epoll can be used to handle signals much like we handle other file descriptors that represent local files or sockets. This is exactly the idea behind Linux’s signalfd(2) system call
This article will describe thread-local storage on ELF platforms in detail, and touch on other related topics, such as: thread-specific data keys and Windows/macOS TLS.
Here’s a problem. I have a set of keys and values. I also have some servers for a key-value store. This could be memcached, Redis, MySQL, whatever. I want to distribute the keys across the servers so…
We present a novel dynamic reconfiguration protocol for the MongoDB replication system that extends and generalizes the single server reconfiguration protocol of the Raft consensus algorithm. Our protocol decouples the processing of configuration changes from the main database operation log, which allows reconfigurations to proceed in cases when the main log is prevented from processing new operations. Additionally, this decoupling allows for configuration state to be managed by a logless replicated state machine, by optimizing away the explicit log and storing only the latest version of the configuration, avoiding the complexities of a log-based protocol. We provide a formal specification of the protocol along with results from automated verification of its safety properties. We also provide an experimental evaluation of the protocol benefits, showing how reconfigurations are able to quickly restore a system to healthy operation in scenarios where node failures have stalled the main operation log.
Libcsp is a high performance concurrency C library influenced by the CSP model.
Cosmopolitan Libc makes C a build-once run-anywhere language, like Java, except it doesn't need an interpreter or virtual machine. Instead, it reconfigures stock GCC and Clang to output a POSIX-approved polyglot format that runs natively on Linux + Mac + Windows + FreeBSD + OpenBSD + NetBSD + BIOS with the best possible performance and the tiniest footprint imaginable.
In this article, we discuss how AWS leverages idempotent API operations to mitigate some of the potential undesirable side effects of retrying in order to deliver a more robust service while still leveraging the benefits of retries to simplifying client-side code.
Exploring how buffer pool management works in databases by building one
We present Hemlock, a novel mutual exclusion locking algorithm that is extremely compact, requiring just one word per thread plus one word per lock, but which still provides local spinning in most circumstances, high throughput under contention, and low latency in the uncontended case. Hemlock is context-free -- not requiring any information to be passed from a lock operation to the corresponding unlock -- and FIFO. The performance of Hemlock is competitive with and often better than the best scalable spin locks.
Atkinson Hyperlegible font is named after Braille Institute founder, J. Robert Atkinson. What makes it different from traditional typography design is that it focuses on letterform distinction to increase character recognition, ultimately improving readability. We are making it free for anyone to use!
The Piecewise Geometric Model index (PGM-index) is a data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes while providing the same worst-case query time guarantees
Key-value stores are everywhere. They power a diverse set of data-driven applications across both industry and science.Key-value stores are used as stand-alone NoSQL systems but they are also used as a part of more complex pipelines and systems such as machine learning and relational systems.In this tutorial, we survey the state-of-the-art approaches on how the core storage engine of a key-value store system is designed. We focus on several critical components oft he engine, starting with the core data structures to lay out data across the memory hierarchy. We also discuss design issues related to caching, timestamps, concurrency control,updates, shifting workloads, as well as mixed workloads withboth analytical and transactional characteristics. We cover designs that are read-optimized, write-optimized as well as hybrids. We draw examples from several state-of-the-art systems but we also put everything together in a general framework which allows us to model storage engine designs under a single unified model and reason about the expected behavior of diverse designs. In addition, we show that given the vast number of possible storage engine designs and their complexity, there is a need to be able to describe and communicate design decisions at a high level descriptive language and we present a first version of such a language. We then use that framework to present several open challenges in the field especially in terms of supporting increasingly more diverse and dynamic applications in the era of data science and AI, including neural networks, graphs, and data versioning.
Guidelines for RMM Level 2 API design
Modern web services use in-memory caching extensively to increase throughput and reduce latency. There have been several workload analyses of production systems that have fueled research in improving the effectiveness of in-memory caching systems. However, the coverage is still sparse considering the wide spectrum of industrial cache use cases. In this work, we significantly further the understanding of real-world cache workloads by collecting production traces from 153 in-memory cache clusters at Twitter, sifting through over 80 TB of data, and sometimes interpreting the workloads in the context of the business logic behind them. We perform a comprehensive analysis to characterize cache workloads based on traffic pattern, time-to-live (TTL), popularity distribution, and size distribution. A fine-grained view of different workloads uncover the diversity of use cases: many are far more write-heavy or more skewed than previously shown and some display unique temporal patterns. We also observe that TTL is an important and sometimes defining parameter of cache working sets. Our simulations show that ideal replacement strategy in production caches can be surprising, for example, FIFO works the best for a large number of workloads.
A curated selection of distributed transactions protocols
LeanStore is a high-performance OLTP storage engine optimized for many-core CPUs and NVMe SSDs. Our goal is to achieve performance comparable to in-memory systems when the data set fits into RAM, while being able to fully exploit the bandwidth of fast NVMe SSDs for large data sets. While LeanStore is currently a research prototype, we hope to make it usable in production in the future.
Varlink is an interface description format and protocol that aims to make services accessible to both humans and machines in the simplest feasible way.A varlink interface combines the classic UNIX command line options, STDIN/OUT/ERROR text formats, man pages, service metadata and provides the equivalent over a single file descriptor, a.k.a. “FD3”. Varlink is plain-text, type-safe, discoverable, self-documenting, remotable, testable, easy to debug. Varlink is accessible from any programming environment
APIs offered by the kernel for file descriptor transfer between processes have been awkward to use and riddled with a number of gotchas. On newer versions of Linux (5.6 and above), a far better API exists
In this post, we will consider whether Raft can guarantee liveness, and specifically whether it can establish a stable leader, if a network fault means that some servers are no longer connected to each other.
PCIe-attached solid-state drives offer high throughput and largecapacity at low cost. Modern servers can easily host 4 or 8 suchSSDs, resulting in an aggregated bandwidth that hitherto was onlyachievable using DRAM. In this paper we study how to exploit suchDirectly-Attached NVMe Arrays (DANA) in database systems. Wefind that DANA presents new challenges that require rethinking theway I/O operations are performed at both the database and operat-ing system layer
Serverless computing offers an event driven pay-as-you-go framework for application development. A key selling point is the concept of no back-end server management, allowing developers to focus on application functionality. This is achieved through severe abstraction of the underlying architecture the functions run on. We examine the underlying architecture and report on the performance of serverless functions and how they are effected by certain factors such as memory allocation and interference caused by load induced by other users on the platform. Specifically, we focus on the serverless offerings of the four largest platforms; AWS Lambda, Google Cloud Functions, Microsoft Azure Functions and IBM Cloud Functions}. In this paper, we observe and contrast between these platforms in their approach to the common issue of "cold starts", we devise a means to unveil the underlying architecture serverless functions execute on and we investigate the effects of interference from load on the platform over the time span of one month.
Last time in the 1st article I briefly introduced Firecracker as a lightweight virtualization/containerization solution for extreme-scale (like AWS Lambda functions). This time around I am going to dig a bit deeper into the API and the management of microVMs. I am going to install Alpine on RPI, install Rust and Python, Docker, get the Linux kernel source and compile a new kernel with minimal config, compile our own Firecracker and then create a new rootfs to be able to boot up a guest. Most of these steps are optional, you can use the stock kernel the Firecracker team provides or download Firecracker release from Github.
Noise is a framework that can be used to construct secure channel protocols. Noise takes a fairly basic set of cryptographic operations and allows them to be combined in ways that provide various security properties. Noise is not a protocol itself: it is a protocol framework, so by "filling in the blanks" you get a concrete protocol that has essentially no knobs to twist
With the rise of data-centric process management paradigms, interdependent processes, such as artifacts or object lifecycles, form a business process through their interactions. Coordination processes may be used to coordinate these interactions, guiding the overall business process towards a meaningful goal. A coordination process model specifies coordination constraints between the interdependent processes in terms of semantic relationships. At run-time, these coordination constraints must be enforced by a coordination process instance. As the coordination of multiple interdependent processes is a complex endeavor, several challenges need to be fulfilled to achieve optimal process coordination. For example, processes must be allowed to run asynchronously and concurrently, taking their complex relations into account. This paper contributes the operational semantics of coordination processes, which enforces the coordination constraints at run-time. Coordination processes form complex structures to adequately represent processes and their relations, specifically supporting many-to-many relationships. Based on these complex structures, markings and process rules allow for the flexible enactment of the interdependent processes while fulfilling all challenges. Coordination processes represent a sophisticated solution to the complex problem of coordinating interdependent, concurrently running processes.
CUE is an open source language, with a rich set of APIs and tooling, for defining, generating, and validating all kinds of data: configuration, APIs, database schemas, code, … you name it.
P is a language for asynchronous event-driven programming. P allows the programmer to specify the system as a collection of interacting state machines, which communicate with each other using events. P unifies modeling and programming into one activity for the programmer. Not only can a P program be compiled into executable code, but it can also be systematically tested using Model Checking.
Strong typedefs in Rust. Make invalid states unrepresentable
I wanted to gather some notes I've kept in my head over the years about Interval Tree Clocks. Interval Tree clocks have been presented in a 2008 paper by Paulo Sérgio Almeida, Carlos Baquero, and Victor Fonte. It's one of the most interesting and readable papers I've seen, but I don't recall seeing many production systems using the mechanism, and I'm not fully sure why
Modern NVMe devices change the nature of how to best perform I/O in stateful applications. This trend has been going on for a while but has been so far masked by the fact that the APIs, especially the higher level ones, haven’t evolved to match what has been happening in the device — and more recently Linux Kernel layers. With the right set of APIs, Direct I/O is the new black.
Testing for causality between events in distributed executions is a fundamental problem. Vector clocks solve this problem but do not scale well. The probabilistic Bloom clock can determine causality between events with lower space, time, and message-space overhead than vector clock; however, predictions suffer from false positives. We give the protocol for the Bloom clock based on Counting Bloom filters and study its properties including the probabilities of a positive outcome and a false positive. We show the results of extensive experiments to determine how these above probabilities vary as a function of the Bloom timestamps of the two events being tested, and to determine the accuracy, precision, and false positive rate of a slice of the execution containing events in the temporal proximity of each other. Based on these experiments, we make recommendations for the setting of the Bloom clock parameters. We postulate the causality spread hypothesis from the application's perspective to indicate whether Bloom clocks will be suitable for correct predictions with high confidence. The Bloom clock design can serve as a viable space-, time-, and message-space-efficient alternative to vector clocks if false positives can be tolerated by an application.
Preloading is a feature of the dynamic linker (ld). On most UNIX systems ld allows you to load a number of libraries before those that are required by an executable. The cwrap project uses this feature to implement special libraries which can be used to fake certain behavior of the system. Think of it like "The Matrix" where reality is simulated and everything is a lie!
Consensus-based replicated systems are complex, monolithic, and difficult to upgrade once deployed. As a result, deployed systems do not benefit from innovative research, and new consensus protocols rarely reach production. We propose virtualizing consensus by virtualizing the shared log API, allowing services to change consensus protocols without downtime. Virtualization splits the logic of consensus into the VirtualLog, a generic and reusable reconfiguration layer; and pluggable ordering protocols called Loglets. Loglets are simple, since they do not need to support reconfiguration or leader election; diverse, consisting of different protocols, codebases, and even deployment modes; and composable, via RAID-like stacking and striping. We describe a production database called Delos which leverages virtual consensus for rapid, incremental development and deployment. Delos reached production within 8 months, and 4 months later upgraded its consensus protocol without downtime for a 10X latency improvement. Delos can dynamically change its performance properties by changing consensus protocols: we can scale throughput by up to 10X by switching to a disaggregated Loglet, and double the failure threshold of an instance without sacrificing throughput via a striped Loglet.
Our interest lies in load balancing jobs in large scale systems consisting of multiple dispatchers and FCFS servers. In the absence of any information on job sizes, dispatchers typically use queue length information reported by the servers to assign incoming jobs. When job sizes are highly variable, using only queue length information is clearly suboptimal and performance can be improved if some indication can be provided to the dispatcher about the size of an ongoing job. In a FCFS server measuring the attained service time of the ongoing job is easy and servers can therefore report this attained service time together with the queue length when queried by a dispatcher. In this paper we propose and analyse a variety of load balancing policies that exploit both the queue length and attained service time to assign jobs, as well as policies for which only the attained service time of the job in service is used. We present a unified analysis for all these policies in a large scale system under the usual asymptotic independence assumptions. The accuracy of the proposed analysis is illustrated using simulation.
State machine replication is a powerful primitive for distributed systems, which has the power to provide the strongest possible consistency guarantees even in the face of message delays and failures. However, the performance limitations of state machine replication have long limited its adoption. I believe that Fast Flexible Paxos is the first step towards building more performant state machine replication protocols which can support fast replication, thus addressing the leader bottleneck, without requiring most servers to participate in replication.
An opinionated list of CLI utilities for monitoring and inspecting Linux/BSD systems.
This article looks at a few approaches Amazon has taken to manage API requests to its systems to avoid overload by implementing API rate limiting (also referred to as “throttling” or "admission control”). Without these kinds of protection, a system becomes overloaded when more traffic flows into the system than it is scaled to handle at that time. API rate limiting lets us shape incoming traffic in different ways, such as prioritizing the client workloads that remain within their planned usage, while applying backpressure to the client workload that spikes unpredictably
Structslop is a static analyzer for Go that recommends struct field rearrangements to provide for maximum space/allocation efficiency
“Microservices” was a good idea taken too far and applied too bluntly. The fix isn’t just to dial back the granularity knob but instead to 1) focus on the split-join criteria as opposed to size; and 2) differentiate between the project model and the deployment model when applying them.
Transferring TCP sockets over a Unix domain socket is, actually, a tried and tested method to implement “hot restarts” or “zero downtime restarts”. Popular proxies like HAProxy and Envoy use very similar mechanisms to drain connections from one instance of the proxy to another without dropping any connections. However, many of these features are not very widely known.
Linux has rich virtual networking capabilities that are used as basis for hosting VMs and containers, as well as cloud environments. In this post, I will give a brief introduction to all commonly used virtual network interface types
This post will hopefully collect some of my thoughts on changelog automation so that I can go about writing a tool (or composing preexisting tools) that solves changelog management for me. Maybe it’ll be useful to others too!
We present Breakwater, an overload control scheme that can prevent overload in microsecond-scale services through a new, server-driven admission control scheme that issues credits based on server-side queueing delay. Breakwater contributes several techniques to amortize communication costs. It engages in demand speculation, where it assumes clients have unmet demand and issues additional credits when the server is not overloaded. Moreover, it piggybacks client-side demand information in RPC requests and credits in RPC responses. To cope with the occasional bursts in load caused by demand speculation, Breakwater drops requests when overloaded using active queue management.
In this paper, we show that resource partitioning is neither necessary nor sufficient. Many applications experience bursty request patterns or phased behavior, drastically changing the amount and type of resources they need. Unfortunately, partitioning-based systems fail to react quickly enough to keep up with these changes, resulting in extreme spikes in latency and lost opportunities to increase CPU utilization. Caladan is a new CPU scheduler that can achieve significantly better quality of service (tail latency, throughput, etc.) through a collection of control signals and policies that rely on fast core allocation instead of resource partitioning
We measure the impact of thread-per-core architecture on application tail latency by implementing a key-value store that uses application-level partitioning, and inter-thread messaging and compare its tail latency to Memcached which uses a traditional key-value store design. We show in an experimental evaluation that our approach reduces tail latency by up to 71% compared to baseline Memcached running on commodity hardware and Linux. However, we observe that the thread-per-core approach is held back by request steering and OS interfaces, and it could be further improved with NIC hardware offload
Large-scale interactive web services and advanced AI applications make sophisticated decisions in real-time, based on executing a massive amount of computation tasks on thousands of servers. Task schedulers, which often operate in heterogeneous and volatile environments, require high throughput, i.e., scheduling millions of tasks per second, and low latency, i.e., incurring minimal scheduling delays for millisecond-level tasks. Scheduling is further complicated by other users' workloads in a shared system, other background activities, and the diverse hardware configurations inside datacenters. We present Rosella, a new self-driving, distributed approach for task scheduling in heterogeneous clusters. Our system automatically learns the compute environment and adjust its scheduling policy in real-time. The solution provides high throughput and low latency simultaneously, because it runs in parallel on multiple machines with minimum coordination and only performs simple operations for each scheduling decision. Our learning module monitors total system load, and uses the information to dynamically determine optimal estimation strategy for the backends' compute-power. Our scheduling policy generalizes power-of-two-choice algorithms to handle heterogeneous workers, reducing the max queue length of O(logn) obtained by prior algorithms to O(loglogn). We implement a Rosella prototype and evaluate it with a variety of workloads. Experimental results show that Rosella significantly reduces task response times, and adapts to environment changes quickly.
SDLang is a simple and concise way to textually represent data. It has an XML-like structure – tags, values and attributes – which makes it a versatile choice for data serialization, configuration files, or declarative languages. Its syntax was inspired by the C family of languages (C/C++, C#, D, Java, …).
Nebula is a scalable overlay networking tool with a focus on performance, simplicity and security. It lets you seamlessly connect computers anywhere in the world. Nebula is portable, and runs on Linux, OSX, Windows, iOS, and Android. It can be used to connect a small number of computers, but is also able to connect tens of thousands of computers. Nebula incorporates a number of existing concepts like encryption, security groups, certificates, and tunneling, and each of those individual pieces existed before Nebula in various forms. What makes Nebula different to existing offerings is that it brings all of these ideas together, resulting in a sum that is greater than its individual parts.
Over the years, as we’ve expanded in scale and functionalities, Facebook has evolved from a basic web server architecture into a complex one with thousands of services working behind the scenes. It’s no trivial task to scale the wide range of back-end services needed for Facebook’s products. And we found that many of our teams were building their own custom sharding solutions with overlapping functionalities. To address this problem, we built Shard Manager as a generic platform that facilitates efficient development and operation of reliable sharded applications.
Sharding is a fundamental building block of large-scale applications, but most have their own custom, ad-hoc implementations. Our goal is to make sharding as easily reusable as a filesystem or lock manager. Slicer is Google’s general purpose sharding service. It monitors signals such as load hotspots and server health to dynamically shard work over a set of servers. Its goals are to maintain high availability and reduce load imbalance while minimizing churn from moved work.
In a partitioned Bloom Filter the m bit vector is split into k disjoint m/k sized parts, one per hash function. Contrary to hardware designs, where they prevail, software implementations mostly adopt standard Bloom filters, considering partitioned filters slightly worse, due to the slightly larger false positive rate (FPR). In this paper, by performing an in-depth analysis, first we show that the FPR advantage of standard Bloom filters is smaller than thought; more importantly, by studying the per-element FPR, we show that standard Bloom filters have weak spots in the domain: elements which will be tested as false positives much more frequently than expected. This is relevant in scenarios where an element is tested against many filters, e.g., in packet forwarding. Moreover, standard Bloom filters are prone to exhibit extremely weak spots if naive double hashing is used, something occurring in several, even mainstream, libraries. Partitioned Bloom filters exhibit a uniform distribution of the FPR over the domain and are robust to the naive use of double hashing, having no weak spots. Finally, by surveying several usages other than testing set membership, we point out the many advantages of having disjoint parts: they can be individually sampled, extracted, added or retired, leading to superior designs for, e.g., SIMD usage, size reduction, test of set disjointness, or duplicate detection in streams. Partitioned Bloom filters are better, and should replace the standard form, both in general purpose libraries and as the base for novel designs.
Ever wanted to create a cool QR code for your guests? But never wanted to type in your WiFi credentials into a form that submits them to a remote webserver to render the QR code? QiFi for the rescue! It will render the code in your browser, on your machine, so the WiFi stays as secure as it was before (read the code if you do not trust text on the internet :-))!
OpenMPTCProuter use MultiPath TCP (MPTCP) to really aggregate multiple Internet connections and OpenWrt.
Atomic operations (atomics) such as Compare-and-Swap (CAS) or Fetch-and-Add (FAA) are ubiquitous in parallel programming. Yet, performance tradeoffs between these operations and various characteristics of such systems, such as the structure of caches, are unclear and have not been thoroughly analyzed. In this paper we establish an evaluation methodology, develop a performance model, and present a set of detailed benchmarks for latency and bandwidth of different atomics. We consider various state-of-the-art x86 architectures: Intel Haswell, Xeon Phi, Ivy Bridge, and AMD Bulldozer. The results unveil surprising performance relationships between the considered atomics and architectural properties such as the coherence state of the accessed cache lines. One key finding is that all the tested atomics have comparable latency and bandwidth even if they are characterized by different consensus numbers. Another insight is that the hardware implementation of atomics prevents any instruction-level parallelism even if there are no dependencies between the issued operations. Finally, we discuss solutions to the discovered performance issues in the analyzed architectures. Our analysis enables simpler and more effective parallel programming and accelerates data processing on various architectures deployed in both off-the-shelf machines and large compute systems.
You can make any protocol work with a custom proxy. Take DNS: your edge servers listen for UDP packets, slap PROXY headers on them, relay the packets to worker servers, unwrap them, and deliver them to containers. You can intercept all of UDP with AF_PACKET sockets, and write the last hop packet that way too to fake addresses out ... But there's a problem with this approach .... it's not fun.
Lunatic is a set of libraries and a WebAssembly runtime which allows developers to build resilient actor systems.
OSv is an open-source versatile modular unikernel designed to run single unmodified Linux application securely as microVM on top of a hypervisor, when compared to traditional operating systems which were designed for a vast range of physical machines. Built from the ground up for effortless deployment and management of microservices and serverless apps, with superior performance
Nowadays, loss-based TCP congestion controls in general and CUBIC specifically became the de facto standard for the Internet. BBR congestion control challenges the loss-based approach by modeling the network based on estimated bandwidth and round-trip time. At Dropbox, we've been using BBRv1 since 2017 and are accustomed to its pros and cons. BBRv2 introduces a set of improvements to network modeling (explicit loss targets and inflight limits) and fairness (differential probing and headroom for new flows.) In this paper, we go over experimental data gathered on the Dropbox Edge Network. We compare BBRv2 to BBRv1 and CUBIC showing that BBRv2 is a definite improvement over both of them. We also show that BBRv2 experimental results match its theoretical design principles.
This post details my adventures with the Linux virtual memory subsystem, and my discovery of a creative way to taunt the OOM (out of memory) killer by accumulating memory in the kernel, rather than in userspace.
One of the fastest compact embeddable key-value ACID database without WAL.
This paper introduces Beldi, a library and runtime system for writing and composing fault-tolerant and transactional stateful serverless functions. Beldi runs on existing providers and lets developers write complex stateful applications that require fault tolerance and transactional semantics without the need to deal with tasks such as load balancing or maintaining virtual machines. Beldi's contributions include extending the log-based fault-tolerant approach in Olive (OSDI 2016) with new data structures, transaction protocols, function invocations, and garbage collection. They also include adapting the resulting framework to work over a federated environment where each serverless function has sovereignty over its own data. We implement three applications on Beldi, including a movie review service, a travel reservation system, and a social media site. Our evaluation on 1,000 AWS Lambdas shows that Beldi's approach is effective and affordable.
Low-latency online services have strict Service Level Objectives (SLOs) that require datacenter systems to support high throughput at microsecond-scale tail latency. Dataplane operating systems have been designed to scale up multi-core servers with minimal overhead for such SLOs. However, as application demands continue to increase, scaling up is not enough, and serving larger demands requires these systems to scale out to multiple servers in a rack. We present RackSched, the first rack-level microsecond-scale scheduler that provides the abstraction of a rack-scale computer (i.e., a huge server with hundreds to thousands of cores) to an external service with network-system co-design. The core of RackSched is a two-layer scheduling framework that integrates inter-server scheduling in the top-of-rack (ToR) switch with intra-server scheduling in each server. We use a combination of analytical results and simulations to show that it provides near-optimal performance as centralized scheduling policies, and is robust for both low-dispersion and high-dispersion workloads. We design a custom switch data plane for the inter-server scheduler, which realizes power-of-k-choices, ensures request affinity, and tracks server loads accurately and efficiently. We implement a RackSched prototype on a cluster of commodity servers connected by a Barefoot Tofino switch. End-to-end experiments on a twelve-server testbed show that RackSched improves the throughput by up to 1.44x, and scales out the throughput near linearly, while maintaining the same tail latency as one server until the system is saturated.
MyST allows you to write Sphinx documentation entirely in markdown. MyST markdown provides a markdown equivalent of the reStructuredText syntax, meaning that you can do anything in MyST that you can do with reStructuredText. It is an attempt to have the best of both worlds: the flexibility and extensibility of Sphinx with the simplicity and readability of Markdown.
The L4 microkernel has undergone 20 years of use and evolution. It has an active user and developer community, and there are commercial versions that are deployed on a large scale and in safety-critical systems. In this article we examine the lessons learnt in those 20 years about microkernel design and implementation. We revisit the L4 design papers, and examine the evolution of design and implementation from the original L4 to the latest generation of L4 kernels. We specifically look at seL4, which has pushed the L4 model furthest and was the first OS kernel to undergo a complete formal verification of its implementation as well as a sound analysis of worst-case execution times. We demonstrate that while much has changed, the fundamental principles of minimality, generality and high inter-process communication (IPC) performance remain the main drivers of design and implementation decisions.
The shared log abstraction has proved to be an important building block for distributed systems. However, none of the existing implementations achieve all three of low latency,high throughput, and total order.Ziplog is a new implementation of a totally ordered shared log that achieves latency and throughput comparable to what today can only be delivered by systems that optimize only one of these metrics at the ex-pense of the other.Ziplog achieves these results through a new API that, in-stead of adding new records to the log through a linearizable Append operation, relies on a linearizable InsertAfter operation that specifies the log position past which the new record should be inserted. This new API allows Ziplog to to-tally order records across shards without needing cross-shard coordination and with an average latency of fewer than three message delays
This tutorial will help you use pandoc to generate pdf and epub from a GitHub style markdown file. The main motivation for this blog post is to highlight what customizations I did to generate pdf and epub versions for self-publishing my ebooks. It wasn't easy to arrive at the set-up I ended up with, so I hope this will be useful for those looking to use pandoc to generate pdf and epub formats. This guide is specifically aimed at technical books that has code snippets.
While there have been efforts to build high-performance data structures with near-data-processing (NDP), prior designs have mostly failed to consider the data access patterns and impacts of cache.We propose the hybrid skiplist, a concurrent skiplist algorithm that takes advantage of both the cache-friendly nature of lock-free skiplists and low-latency memory access of NDP. We also pro-pose the hybrid biased skiplist, where frequently-accessed nodesare dynamically promoted to higher levels of the skiplist for better performance
Zheap is a way to keep table bloat under control by implementing a new PostgreSQL storage engine capable of running UPDATE-intense workloads more efficiently