Consensus-based replicated systems are complex, monolithic, and difficult to upgrade once deployed. As a result, deployed systems do not benefit from innovative research, and new consensus protocols rarely reach production. We propose virtualizing consensus by virtualizing the shared log API, allowing services to change consensus protocols without downtime. Virtualization splits the logic of consensus into the VirtualLog, a generic and reusable reconfiguration layer; and pluggable ordering protocols called Loglets. Loglets are simple, since they do not need to support reconfiguration or leader election; diverse, consisting of different protocols, codebases, and even deployment modes; and composable, via RAID-like stacking and striping. We describe a production database called Delos which leverages virtual consensus for rapid, incremental development and deployment. Delos reached production within 8 months, and 4 months later upgraded its consensus protocol without downtime for a 10X latency improvement. Delos can dynamically change its performance properties by changing consensus protocols: we can scale throughput by up to 10X by switching to a disaggregated Loglet, and double the failure threshold of an instance without sacrificing throughput via a striped Loglet.
Our interest lies in load balancing jobs in large scale systems consisting of multiple dispatchers and FCFS servers. In the absence of any information on job sizes, dispatchers typically use queue length information reported by the servers to assign incoming jobs. When job sizes are highly variable, using only queue length information is clearly suboptimal and performance can be improved if some indication can be provided to the dispatcher about the size of an ongoing job. In a FCFS server measuring the attained service time of the ongoing job is easy and servers can therefore report this attained service time together with the queue length when queried by a dispatcher. In this paper we propose and analyse a variety of load balancing policies that exploit both the queue length and attained service time to assign jobs, as well as policies for which only the attained service time of the job in service is used. We present a unified analysis for all these policies in a large scale system under the usual asymptotic independence assumptions. The accuracy of the proposed analysis is illustrated using simulation.
State machine replication is a powerful primitive for distributed systems, which has the power to provide the strongest possible consistency guarantees even in the face of message delays and failures. However, the performance limitations of state machine replication have long limited its adoption. I believe that Fast Flexible Paxos is the first step towards building more performant state machine replication protocols which can support fast replication, thus addressing the leader bottleneck, without requiring most servers to participate in replication.
An opinionated list of CLI utilities for monitoring and inspecting Linux/BSD systems.
This article looks at a few approaches Amazon has taken to manage API requests to its systems to avoid overload by implementing API rate limiting (also referred to as “throttling” or "admission control”). Without these kinds of protection, a system becomes overloaded when more traffic flows into the system than it is scaled to handle at that time. API rate limiting lets us shape incoming traffic in different ways, such as prioritizing the client workloads that remain within their planned usage, while applying backpressure to the client workload that spikes unpredictably
Structslop is a static analyzer for Go that recommends struct field rearrangements to provide for maximum space/allocation efficiency
“Microservices” was a good idea taken too far and applied too bluntly. The fix isn’t just to dial back the granularity knob but instead to 1) focus on the split-join criteria as opposed to size; and 2) differentiate between the project model and the deployment model when applying them.
Transferring TCP sockets over a Unix domain socket is, actually, a tried and tested method to implement “hot restarts” or “zero downtime restarts”. Popular proxies like HAProxy and Envoy use very similar mechanisms to drain connections from one instance of the proxy to another without dropping any connections. However, many of these features are not very widely known.
Linux has rich virtual networking capabilities that are used as basis for hosting VMs and containers, as well as cloud environments. In this post, I will give a brief introduction to all commonly used virtual network interface types
This post will hopefully collect some of my thoughts on changelog automation so that I can go about writing a tool (or composing preexisting tools) that solves changelog management for me. Maybe it’ll be useful to others too!
We present Breakwater, an overload control scheme that can prevent overload in microsecond-scale services through a new, server-driven admission control scheme that issues credits based on server-side queueing delay. Breakwater contributes several techniques to amortize communication costs. It engages in demand speculation, where it assumes clients have unmet demand and issues additional credits when the server is not overloaded. Moreover, it piggybacks client-side demand information in RPC requests and credits in RPC responses. To cope with the occasional bursts in load caused by demand speculation, Breakwater drops requests when overloaded using active queue management.
In this paper, we show that resource partitioning is neither necessary nor sufficient. Many applications experience bursty request patterns or phased behavior, drastically changing the amount and type of resources they need. Unfortunately, partitioning-based systems fail to react quickly enough to keep up with these changes, resulting in extreme spikes in latency and lost opportunities to increase CPU utilization. Caladan is a new CPU scheduler that can achieve significantly better quality of service (tail latency, throughput, etc.) through a collection of control signals and policies that rely on fast core allocation instead of resource partitioning
We measure the impact of thread-per-core architecture on application tail latency by implementing a key-value store that uses application-level partitioning, and inter-thread messaging and compare its tail latency to Memcached which uses a traditional key-value store design. We show in an experimental evaluation that our approach reduces tail latency by up to 71% compared to baseline Memcached running on commodity hardware and Linux. However, we observe that the thread-per-core approach is held back by request steering and OS interfaces, and it could be further improved with NIC hardware offload
Large-scale interactive web services and advanced AI applications make sophisticated decisions in real-time, based on executing a massive amount of computation tasks on thousands of servers. Task schedulers, which often operate in heterogeneous and volatile environments, require high throughput, i.e., scheduling millions of tasks per second, and low latency, i.e., incurring minimal scheduling delays for millisecond-level tasks. Scheduling is further complicated by other users' workloads in a shared system, other background activities, and the diverse hardware configurations inside datacenters. We present Rosella, a new self-driving, distributed approach for task scheduling in heterogeneous clusters. Our system automatically learns the compute environment and adjust its scheduling policy in real-time. The solution provides high throughput and low latency simultaneously, because it runs in parallel on multiple machines with minimum coordination and only performs simple operations for each scheduling decision. Our learning module monitors total system load, and uses the information to dynamically determine optimal estimation strategy for the backends' compute-power. Our scheduling policy generalizes power-of-two-choice algorithms to handle heterogeneous workers, reducing the max queue length of O(logn) obtained by prior algorithms to O(loglogn). We implement a Rosella prototype and evaluate it with a variety of workloads. Experimental results show that Rosella significantly reduces task response times, and adapts to environment changes quickly.
SDLang is a simple and concise way to textually represent data. It has an XML-like structure – tags, values and attributes – which makes it a versatile choice for data serialization, configuration files, or declarative languages. Its syntax was inspired by the C family of languages (C/C++, C#, D, Java, …).
Nebula is a scalable overlay networking tool with a focus on performance, simplicity and security. It lets you seamlessly connect computers anywhere in the world. Nebula is portable, and runs on Linux, OSX, Windows, iOS, and Android. It can be used to connect a small number of computers, but is also able to connect tens of thousands of computers. Nebula incorporates a number of existing concepts like encryption, security groups, certificates, and tunneling, and each of those individual pieces existed before Nebula in various forms. What makes Nebula different to existing offerings is that it brings all of these ideas together, resulting in a sum that is greater than its individual parts.
Over the years, as we’ve expanded in scale and functionalities, Facebook has evolved from a basic web server architecture into a complex one with thousands of services working behind the scenes. It’s no trivial task to scale the wide range of back-end services needed for Facebook’s products. And we found that many of our teams were building their own custom sharding solutions with overlapping functionalities. To address this problem, we built Shard Manager as a generic platform that facilitates efficient development and operation of reliable sharded applications.
Sharding is a fundamental building block of large-scale applications, but most have their own custom, ad-hoc implementations. Our goal is to make sharding as easily reusable as a filesystem or lock manager. Slicer is Google’s general purpose sharding service. It monitors signals such as load hotspots and server health to dynamically shard work over a set of servers. Its goals are to maintain high availability and reduce load imbalance while minimizing churn from moved work.
In a partitioned Bloom Filter the m bit vector is split into k disjoint m/k sized parts, one per hash function. Contrary to hardware designs, where they prevail, software implementations mostly adopt standard Bloom filters, considering partitioned filters slightly worse, due to the slightly larger false positive rate (FPR). In this paper, by performing an in-depth analysis, first we show that the FPR advantage of standard Bloom filters is smaller than thought; more importantly, by studying the per-element FPR, we show that standard Bloom filters have weak spots in the domain: elements which will be tested as false positives much more frequently than expected. This is relevant in scenarios where an element is tested against many filters, e.g., in packet forwarding. Moreover, standard Bloom filters are prone to exhibit extremely weak spots if naive double hashing is used, something occurring in several, even mainstream, libraries. Partitioned Bloom filters exhibit a uniform distribution of the FPR over the domain and are robust to the naive use of double hashing, having no weak spots. Finally, by surveying several usages other than testing set membership, we point out the many advantages of having disjoint parts: they can be individually sampled, extracted, added or retired, leading to superior designs for, e.g., SIMD usage, size reduction, test of set disjointness, or duplicate detection in streams. Partitioned Bloom filters are better, and should replace the standard form, both in general purpose libraries and as the base for novel designs.
Ever wanted to create a cool QR code for your guests? But never wanted to type in your WiFi credentials into a form that submits them to a remote webserver to render the QR code? QiFi for the rescue! It will render the code in your browser, on your machine, so the WiFi stays as secure as it was before (read the code if you do not trust text on the internet :-))!
OpenMPTCProuter use MultiPath TCP (MPTCP) to really aggregate multiple Internet connections and OpenWrt.
Atomic operations (atomics) such as Compare-and-Swap (CAS) or Fetch-and-Add (FAA) are ubiquitous in parallel programming. Yet, performance tradeoffs between these operations and various characteristics of such systems, such as the structure of caches, are unclear and have not been thoroughly analyzed. In this paper we establish an evaluation methodology, develop a performance model, and present a set of detailed benchmarks for latency and bandwidth of different atomics. We consider various state-of-the-art x86 architectures: Intel Haswell, Xeon Phi, Ivy Bridge, and AMD Bulldozer. The results unveil surprising performance relationships between the considered atomics and architectural properties such as the coherence state of the accessed cache lines. One key finding is that all the tested atomics have comparable latency and bandwidth even if they are characterized by different consensus numbers. Another insight is that the hardware implementation of atomics prevents any instruction-level parallelism even if there are no dependencies between the issued operations. Finally, we discuss solutions to the discovered performance issues in the analyzed architectures. Our analysis enables simpler and more effective parallel programming and accelerates data processing on various architectures deployed in both off-the-shelf machines and large compute systems.
You can make any protocol work with a custom proxy. Take DNS: your edge servers listen for UDP packets, slap PROXY headers on them, relay the packets to worker servers, unwrap them, and deliver them to containers. You can intercept all of UDP with AF_PACKET sockets, and write the last hop packet that way too to fake addresses out ... But there's a problem with this approach .... it's not fun.
Lunatic is a set of libraries and a WebAssembly runtime which allows developers to build resilient actor systems.
OSv is an open-source versatile modular unikernel designed to run single unmodified Linux application securely as microVM on top of a hypervisor, when compared to traditional operating systems which were designed for a vast range of physical machines. Built from the ground up for effortless deployment and management of microservices and serverless apps, with superior performance
Nowadays, loss-based TCP congestion controls in general and CUBIC specifically became the de facto standard for the Internet. BBR congestion control challenges the loss-based approach by modeling the network based on estimated bandwidth and round-trip time. At Dropbox, we've been using BBRv1 since 2017 and are accustomed to its pros and cons. BBRv2 introduces a set of improvements to network modeling (explicit loss targets and inflight limits) and fairness (differential probing and headroom for new flows.) In this paper, we go over experimental data gathered on the Dropbox Edge Network. We compare BBRv2 to BBRv1 and CUBIC showing that BBRv2 is a definite improvement over both of them. We also show that BBRv2 experimental results match its theoretical design principles.
This post details my adventures with the Linux virtual memory subsystem, and my discovery of a creative way to taunt the OOM (out of memory) killer by accumulating memory in the kernel, rather than in userspace.
One of the fastest compact embeddable key-value ACID database without WAL.
This paper introduces Beldi, a library and runtime system for writing and composing fault-tolerant and transactional stateful serverless functions. Beldi runs on existing providers and lets developers write complex stateful applications that require fault tolerance and transactional semantics without the need to deal with tasks such as load balancing or maintaining virtual machines. Beldi's contributions include extending the log-based fault-tolerant approach in Olive (OSDI 2016) with new data structures, transaction protocols, function invocations, and garbage collection. They also include adapting the resulting framework to work over a federated environment where each serverless function has sovereignty over its own data. We implement three applications on Beldi, including a movie review service, a travel reservation system, and a social media site. Our evaluation on 1,000 AWS Lambdas shows that Beldi's approach is effective and affordable.
Low-latency online services have strict Service Level Objectives (SLOs) that require datacenter systems to support high throughput at microsecond-scale tail latency. Dataplane operating systems have been designed to scale up multi-core servers with minimal overhead for such SLOs. However, as application demands continue to increase, scaling up is not enough, and serving larger demands requires these systems to scale out to multiple servers in a rack. We present RackSched, the first rack-level microsecond-scale scheduler that provides the abstraction of a rack-scale computer (i.e., a huge server with hundreds to thousands of cores) to an external service with network-system co-design. The core of RackSched is a two-layer scheduling framework that integrates inter-server scheduling in the top-of-rack (ToR) switch with intra-server scheduling in each server. We use a combination of analytical results and simulations to show that it provides near-optimal performance as centralized scheduling policies, and is robust for both low-dispersion and high-dispersion workloads. We design a custom switch data plane for the inter-server scheduler, which realizes power-of-k-choices, ensures request affinity, and tracks server loads accurately and efficiently. We implement a RackSched prototype on a cluster of commodity servers connected by a Barefoot Tofino switch. End-to-end experiments on a twelve-server testbed show that RackSched improves the throughput by up to 1.44x, and scales out the throughput near linearly, while maintaining the same tail latency as one server until the system is saturated.
MyST allows you to write Sphinx documentation entirely in markdown. MyST markdown provides a markdown equivalent of the reStructuredText syntax, meaning that you can do anything in MyST that you can do with reStructuredText. It is an attempt to have the best of both worlds: the flexibility and extensibility of Sphinx with the simplicity and readability of Markdown.
The L4 microkernel has undergone 20 years of use and evolution. It has an active user and developer community, and there are commercial versions that are deployed on a large scale and in safety-critical systems. In this article we examine the lessons learnt in those 20 years about microkernel design and implementation. We revisit the L4 design papers, and examine the evolution of design and implementation from the original L4 to the latest generation of L4 kernels. We specifically look at seL4, which has pushed the L4 model furthest and was the first OS kernel to undergo a complete formal verification of its implementation as well as a sound analysis of worst-case execution times. We demonstrate that while much has changed, the fundamental principles of minimality, generality and high inter-process communication (IPC) performance remain the main drivers of design and implementation decisions.
The shared log abstraction has proved to be an important building block for distributed systems. However, none of the existing implementations achieve all three of low latency,high throughput, and total order.Ziplog is a new implementation of a totally ordered shared log that achieves latency and throughput comparable to what today can only be delivered by systems that optimize only one of these metrics at the ex-pense of the other.Ziplog achieves these results through a new API that, in-stead of adding new records to the log through a linearizable Append operation, relies on a linearizable InsertAfter operation that specifies the log position past which the new record should be inserted. This new API allows Ziplog to to-tally order records across shards without needing cross-shard coordination and with an average latency of fewer than three message delays
This tutorial will help you use pandoc to generate pdf and epub from a GitHub style markdown file. The main motivation for this blog post is to highlight what customizations I did to generate pdf and epub versions for self-publishing my ebooks. It wasn't easy to arrive at the set-up I ended up with, so I hope this will be useful for those looking to use pandoc to generate pdf and epub formats. This guide is specifically aimed at technical books that has code snippets.
While there have been efforts to build high-performance data structures with near-data-processing (NDP), prior designs have mostly failed to consider the data access patterns and impacts of cache.We propose the hybrid skiplist, a concurrent skiplist algorithm that takes advantage of both the cache-friendly nature of lock-free skiplists and low-latency memory access of NDP. We also pro-pose the hybrid biased skiplist, where frequently-accessed nodesare dynamically promoted to higher levels of the skiplist for better performance
Zheap is a way to keep table bloat under control by implementing a new PostgreSQL storage engine capable of running UPDATE-intense workloads more efficiently
It’s been brought to the attention of many people in the game development industry that you can use fibers to parallelize your engines and games. This one presentation in particular: Parallelizing the Naughty Dog engine using fibers has been a talking point since it came out in 2015. However, it doesn’t quite go into the details enough to really explain why fibers are a good fit or how to actually implement them. In my pursuit to educate myself and take a stab at implementing them, I’ve concluded that there’s a distinctive lack of good information online. In response to the lack of literature on this topic, I’ve decided to explain it from multiple angles, and provide less known information on the subject. This document serves as a Sourcebook for those who really want to determine if it’s a right fit for them and how to go about actually doing it.
One great feature of Go is the built-in http.Server. It allows each app to serve HTTP and HTTPS traffic without having to put a reverse proxy such as Nginx in front of it.
This paper contains a set of related patterns. All the patterns are concerned with how you, a programmer, get the information you need to understand and fix a software bug. The solutions revolve around the use of a ring buffer to log key events in the life of a system.
Distributed software is everywhere. We consume or provide APIs, connect to databases, and load files written by others. As these systems evolve, we need to change data schemas while maintaining compatibility between different versions and components. But we don’t have powerful tools for handling this schema evolution, so we usually resort to ad hoc solutions, like conditionals peppered throughout the code and complex multi-step migration processes. We propose a principled replacement for these messy solutions: an isolated software layer that translates data between schemas on demand. This layer allows developers to maintain strong compatibility with many schema versions without complicating the main codebase. Translation logic is defined by composing bidirectional lenses, a kind of data transformation that can run both forward and backward.
Descriptions of many algorithms and data structures especially popular in field of competitive programming
Argbash (https://argbash.io) is a bash code generator that can assist you in writing scripts that accept arguments. You declare arguments that your script should use in a few lines and then, you run Argbash on those declarations to get a parsing code that can be used on all platforms that have bash
This new long version of my 1983 paper suggeststhe goals you might have for your system—Simple, Timely, Efficient, Adaptable, Dependable, Yummy(STEADY)—and effective techniques for achieving them—Approximate, Incremental, Divide & Conquer (AID). It gives a few principles for system design that are more than just hints, and many examples of howto apply the hints and principles.
Scipio (pronounced skip-io or |skɪpjəʊ|) is a Cooperative Thread-per-Core crate for Rust & Linux based on io_uring. Like other rust asynchronous crates it allows one to write asynchronous code that takes advantage of rust async/await, but unlike its counterparts it doesn't use helper threads anywhere
Project aims to generate reproducible workloads that are as close to real-life as possible, while being able to efficiently verify the cluster state against the model without pausing the workload itself.
We report an organization’s method for recruiting additional, specialized human resources during anomaly handling. The method has been tailored to encourage sharing adaptive capacity across organizational units. As predicted by Woods’ theory, this case shows that sharing adaptive capacity allows graceful extensibility that is particularly useful when a system is challenged by frequent but unpredictably severe events. We propose that (1) the ability to borrow adaptive capacity from other units is a hallmark of resilient systems and (2) the deliberate adjustment adaptive capacity sharing is a feature of some forms of resilience engineering. Some features of this domain that may lead to discovery of resilience and promote resilience engineering in other settings, notably hospital emergency rooms.
Network coordinate (NC) systems provide a lightweight and scalable way for predicting the distances, i.e.,round-trip latencies among Internet hosts. Most existing NC systems embed hosts into a low dimensional Euclidean space.Unfortunately, the persistent occurrence of Triangle Inequality Violation (TIV) on the Internet largely limits the distance prediction accuracy of those NC systems. Some alternative systems aim at handling the persistent TIV, however, they only achieve comparable prediction accuracy with Euclidean distance based NC systems. In this paper, we propose an NC system, so-called Phoenix, which is based on the matrix factorization model.Phoenix introduces a weight to each reference NC and trusts the NCs with higher weight values more than the others. The weight-based mechanism can substantially reduce the impact of the error propagation. Using the representative aggregate data sets and the newly measured dynamic data set collected from the Internet,our simulations show that Phoenix achieves significantly higher prediction accuracy than other NC systems. We also show that Phoenix quickly converges to steady state, performs well under host churn, handles the drift of the NCs successfully by using regularization, and is robust against measurement anomalies.Phoenix achieves a scalable yet accurate end-to-end distances monitoring. In addition, we study how well an NC system can characterize the TIV property on the Internet by introducing two new quantitative metrics, so-called 𝑅𝐸𝑅𝑃𝐿 and 𝐴𝐸𝑅𝑃𝐿. We show that Phoenix is able to characterize TIV better tha nother existing NC systems
Network coordinates (NC) system is an efficient mechanism for Internet distance prediction with scalable measurements. The intrinsical cause for the unsatisfactory accuracy of the simulation-based NC algorithms has been identified. Then Pharos, a fully decentralised and hierarchical scheme, is proposed to solve this problem. Pharos leverages multiple coordinate sets at different distance scales, with the right scale being chosen for prediction each time. We evaluate the performance of Pharos system with the King data set and latency data from PlanetLab, and compare it with the representative NC system, Vivaldi. The experimental results show that Pharos greatly outperforms Vivaldi in Internet distance prediction without adding any significant overhead. Our extensive evaluation results also demonstrate that Pharos can significantly improve the performance in distributed Internet applications, such as overlay multicast and server selection
Network Coordinate (NC) systems provide an efficient and scalable mechanism to estimate latencies among hosts.However, many popular algorithms like Vivaldi suffer greatly from the existence of Triangle Inequality Violations (TIVs). Two-layer systems like Pharos and hierarchical Vivaldi have been proposed to remedy the impact of TIVs. They divide the whole space into several location-based clusters and run NC systems on both global layer and local layer. However,the two-layer model is only able to optimize the intra-cluster links relating to a limited portion of TIV triangles. In this paper, we propose a new NC system, Tarantula, which divides the space in a novel way. By categorizing the TIVs into three classes, we show that Tarantula handles a much larger portion of existing TIVs than two-layer systems. Moreover, we presenttwo techniques to further strengthen the Tarantula system: 1)relate the updating step size in the Vivaldi algorithm used in Tarantula to ground-truth latency so as to improve the prediction for short links; 2) propose Dynamic Cluster Optimization to dynamically adjust clustering of hosts. Our experimental results show that Tarantula outperforms Pharos and Vivaldi significantly in terms of estimation accuracy. When implementing different NC systems in the application of server selection and detour finding,Tarantula again performs the best.
There is an alternative called the “Min-Max Heap” that doesn’t have pretty code, but it has shorter dependency chains, which is important on modern hardware. As a result it often ends up faster than a binary heap, even though it allows you to pop from both ends.
In this system your questions might change from the form “is it possible to reach a bad state” to “what is the probability of reaching a bad state?” Unfortunately these types of questions just cannot be answered within the nondeterministic model used above. You cannot model probability with nondeterminism. We must use a new type of model, a state machine that handles probability directly.
Pebble is a LevelDB/RocksDB inspired key-value store focused on performance and internal usage by CockroachDB. Pebble inherits the RocksDB file formats and a few extensions such as range deletion tombstones, table-level bloom filters, and updates to the MANIFEST format.
Ankerl::nanobench is a platform independent microbenchmarking library for C++11/14/17/20.
In this paper, we present Bipartisan Paxos: a family of state machine replication protocols that achieve low latency, high throughput, and simplicity. The Bipartisan Paxos protocols can commit a command in two message delays (the theoretical minimum), achieving low latency. They do not depend on a distinguished leader for normal processing or conflict resolution, achieving high throughput. They are designed with modularity and incrementality in mind, achieving simplicity
Providing ACID transactions under conflicts across globally distributed data is the Everest of transaction processing protocols. Transaction processing in this scenario is particularly costly due to the high latency of cross-continent network links, which inflates concurrency control and data replication overheads. To mitigate the problem, we introduce Ocean Vista – a novel distributed protocol that guarantees strict serializability. We observe that concurrency control and replication address different aspects of resolving the visibility of transactions, and we address both concerns using a multi-version protocol that tracks visibility using version watermarks and arrives at correct visibility decisions using efficient gossip.
Much like on-premises systems, the natural choice for running database analytics workloads in the cloud is to provision a cluster of nodes to run a database instance. However, analytics workloads are often bursty or low volume, leaving clusters idle much of the time, meaning customers pay for compute resources even when unused. The ability of cloud function services, such as AWS Lambda or Azure Functions, to run small, fine granularity tasks make them appear to be a natural choice for query processing in such settings. But implementing an analytics system on cloud functions comes with its own set of challenges. These include managing hundreds of tiny stateless resource-constrained workers, handling stragglers, and shuffling data through opaque cloud services. In this paper we present Starling, a query execution engine built on cloud function services that employs number of techniques to mitigate these challenges, providing interactive query latency at a lower total cost than provisioned systems with low-to-moderate utilization. In particular, on a 1TB TPC-H dataset in cloud storage, Starling is less expensive than the best provisioned systems for workloads when queries arrive 1 minute apart or more. Starling also has lower latency than competing systems reading from cloud object stores and can scale to larger datasets.
We explicitly motivate the subtle intricacies of Hinze andPaterson’s Finger Tree datastructure, by step-wise refininga naive implementation. The result is a new explanation ofhow Finger Trees work and why they have the particularstructure they have, and also a small simplification of theoriginal implementation.
Three insights motivate our work: (1) the live user traffic accessing a web service provides the most current target workload possible, (2) we can empirically test the system to identify its scalability limits, and (3) the user impact and operational overhead of empirical testing can be largely eliminated by building automation which adjusts live traffic based on feedback. We build on these insights in Kraken, a new system that runs load tests by continually shifting live user traffic to one or more data centers
Amazon Web Services (AWS) took a fresh look at the network to provide consistently low latency required for supercomputing applications, while keeping the benefits of public cloud: scalability, elastic on-demand capacity,cost effectiveness, and fast adoption of newer CPUs and GPUs. We built a new network transport protocol, Scalable Reliable Datagram (SRD), designed to utilize modern commodity multi-tenant datacenter networks (with a large number of network paths) while overcoming their limitations (load imbalance and inconsistent latency when unrelated flows collide). Instead of preserving packets order, SRD sends the packets over as many network paths as possible, while avoiding overloaded paths. To minimize jitter and to ensure fastest response to network congestion fluctuations, SRD is implemented in the AWS custom Nitro networking card. SRD is used by HPC/ML frameworks on EC2 hosts via AWS Elastic Fabric Adapter (EFA) kernel-bypass interface.
VinylDNS is a vendor agnostic front-end for enabling self-service DNS and streamlining DNS operations. It is designed to integrate with your existing DNS infrastructure, and provides extensibility to fit your installation. VinylDNS manages millions of DNS records supporting thousands of engineers in production at Comcast. The platform provides fine-grained access controls, auditing of changes, a self-service user interface, secure RESTful API, and integration with infrastructure automation tools like Ansible and Terraform.
Interactive services often have large-scale parallel implemen-tations. To deliver fast responses, the median and tail laten-cies of a service’s components must be low. In this paper,we explore the hardware, OS, and application-level sourcesof poor tail latency in high throughput servers executing onmulti-core machines.We model these network services as a queuing systemin order to establish the best-achievable latency distribution.Using fine-grained measurements of three different servers(a null RPC service, Memcached, and Nginx) on Linux, wethen explore why these servers exhibit significantly worsetail latencies than queuing models alone predict. The un-derlying causes include interference from background pro-cesses, request re-ordering caused by poor scheduling or con-strained concurrency models, suboptimal interrupt routing,CPU power saving mechanisms, and NUMA effects
The CHERI architecture allows pointers to be implemented as capabilities (rather than integer virtual addresses) in a manner that is compatible with, and strengthens, the semantics of the C language. In addition to the spatial protections offered by conventional fat pointers, CHERI capabilities offer strong integrity, enforced provenance validity, and access monotonicity. The stronger guarantees of these architectural capabilities must be reconciled with the real-world behavior of operating systems, run-time environments, and applications. When the process model, user-kernel interactions, dynamic linking, and memory management are all considered, we observe that simple derivation of architectural capabilities is insufficient to describe appropriate access to memory. We bridge this conceptual gap with a notional abstract capability that describes the accesses that should be allowed at a given point in execution, whether in the kernel or userspace. To investigate this notion at scale, we describe the first adaptation of a full C-language operating system (FreeBSD) with an enterprise database (PostgreSQL) for complete spatial and referential memory safety. We show that awareness of abstract capabilities, coupled with CHERI architectural capabilities, can provide more complete protection, strong compatibility, and acceptable performance overhead compared with the pre-CHERI baseline and software-only approaches. Our observations also have potentially significant implications for other mitigation techniques.
You'll be familiar with web bugs, the transparent images which track when someone opens an email. They work by embedding a unique URL in a page's image tag, and monitoring incoming GET requests. Imagine doing that, but for file reads, database queries, process executions, patterns in log files, Bitcoin transactions or even Linkedin Profile views. Canarytokens does all this and more, letting you implant traps in your production systems rather than setting up separate honeypots
Work-stealing is a popular technique to implement dynamic load balancing in a distributed manner. In this approach, each process owns a set of tasks that have to be executed. The owner of the set can put tasks in it and can take tasks from it to execute them. When a process runs out of tasks, instead of being idle, it becomes a thief to steal tasks from a victim. Thus, a work-stealing algorithm provides three high-level operations: Put and Take, which can be invoked only by the owner, and Steal, which can be invoked by a thief. One of the main targets when designing work-stealing algorithms is to make Put and Take as simple and efficient as possible. Unfortunately, it has been shown that any work-stealing algorithm in the standard asynchronous model must use expensive Read-After-Write synchronization patterns or atomic Read-Modify-Write instructions (e.g. Compare&Swap or Test&Set), which may be costly in practice. Thus, prior research has proposed idempotent work-stealing, a relaxation for which there are algorithms with Put and Take devoid of Read-Modify-Write atomic instructions and Read-After- Write synchronization patterns; however, Put uses fences among Write instructions, and Steal uses Compare&Swap and fences among Read instructions. This paper considers work-stealing with multiplicity, a relaxation in which every task is taken by at least one operation, with the requirement that any process can extract a task at most once. Three versions of the relaxation are considered and fully Read/Write algorithms are presented in the standard asynchronous model, all of them devoid of Read-After-Write synchronization patterns; the last algorithm is also fully fence-free.
Paxos, the de facto standard approach to solving distributed consensus, operates in two phases, each of which requires an intersecting quorum of nodes. Multi-Paxos reduces this to one phase by electing a leader but this leader is also a performance bottleneck. Fast Paxos bypasses the leader but has stronger quorum intersection requirements. In this paper we observe that Fast Paxos' intersection requirements can be safely relaxed, reducing to just one additional intersection requirement between phase-1 quorums and any pair of fast round phase-2 quorums. We thus find that the quorums used with Fast Paxos are larger than necessary, allowing alternative quorum systems to obtain new tradeoffs between performance and fault-tolerance.
In this post, I want to explore some of the features of Unix domain sockets that make it a suitable candidate for several of these use-cases, especially transferring a socket (or any file descriptor, for that matter) from one process to another where a parent-child relationship doesn’t necessarily exist between the two processes.
Research has begun to reveal many algorithms can be expressed as matrix multiplication, suggesting an unrealized connection between linear algebra and computer science. I speculate graphs are the missing piece of the puzzle. Graphs are not only useful as cognitive aides, but are suitable data structures for a wide variety of tasks, particularly on modern parallel processing hardware. In this essay, I explore the virtues of graphs, algebra, types, and show how these concepts can help us reason about programs. I propose a computational primitive based on graph signal processing, linking software engineering, graphs, and linear algebra. Finally, I share my predictions for the path ahead, which I consider to be the start of an exciting new chapter in computing history.
In this paper, we present an implementation of a cuckoo filter for membership testing, optimized for distributed data stores operating in high workloads. In large databases, querying becomes inefficient using traditional search methods. To achieve optimal performance it is necessary to use probabilistic data structures to test the membership of a given key, at the cost of getting false positives while querying data. The widely used bloom filters can be used for this, but they have limitations like no support for deletes. To improve upon this we use a modified version of the cuckoo filter that gives better amortized times for search, with less false positives.
We present Elle: a novel checker which infers an Adya-style dependency graph between client-observed transactions. It does so by carefully selecting database objects and operations when generating histories, so as to ensure that the results of database reads reveal information about their version history. Elle can detect every anomaly in Adya et al's formalism [Adya et al. 2000] (except for predicates), discriminate between them, and provide concise explanations of each. This paper makes the following contributions: we present Elle, demonstrate its soundness, measure its efficiency against the current state of the art, and give evidence of its effectiveness via a case study of four real databases.
In this post, we'll talk about how to establish a peer-to-peer connection between two machines, in spite of all the obstacles in the way.