Network coordinate (NC) systems provide a lightweight and scalable way for predicting the distances, i.e.,round-trip latencies among Internet hosts. Most existing NC systems embed hosts into a low dimensional Euclidean space.Unfortunately, the persistent occurrence of Triangle Inequality Violation (TIV) on the Internet largely limits the distance prediction accuracy of those NC systems. Some alternative systems aim at handling the persistent TIV, however, they only achieve comparable prediction accuracy with Euclidean distance based NC systems. In this paper, we propose an NC system, so-called Phoenix, which is based on the matrix factorization model.Phoenix introduces a weight to each reference NC and trusts the NCs with higher weight values more than the others. The weight-based mechanism can substantially reduce the impact of the error propagation. Using the representative aggregate data sets and the newly measured dynamic data set collected from the Internet,our simulations show that Phoenix achieves significantly higher prediction accuracy than other NC systems. We also show that Phoenix quickly converges to steady state, performs well under host churn, handles the drift of the NCs successfully by using regularization, and is robust against measurement anomalies.Phoenix achieves a scalable yet accurate end-to-end distances monitoring. In addition, we study how well an NC system can characterize the TIV property on the Internet by introducing two new quantitative metrics, so-called 𝑅𝐸𝑅𝑃𝐿 and 𝐴𝐸𝑅𝑃𝐿. We show that Phoenix is able to characterize TIV better tha nother existing NC systems
Network coordinates (NC) system is an efficient mechanism for Internet distance prediction with scalable measurements. The intrinsical cause for the unsatisfactory accuracy of the simulation-based NC algorithms has been identified. Then Pharos, a fully decentralised and hierarchical scheme, is proposed to solve this problem. Pharos leverages multiple coordinate sets at different distance scales, with the right scale being chosen for prediction each time. We evaluate the performance of Pharos system with the King data set and latency data from PlanetLab, and compare it with the representative NC system, Vivaldi. The experimental results show that Pharos greatly outperforms Vivaldi in Internet distance prediction without adding any significant overhead. Our extensive evaluation results also demonstrate that Pharos can significantly improve the performance in distributed Internet applications, such as overlay multicast and server selection
Network Coordinate (NC) systems provide an efficient and scalable mechanism to estimate latencies among hosts.However, many popular algorithms like Vivaldi suffer greatly from the existence of Triangle Inequality Violations (TIVs). Two-layer systems like Pharos and hierarchical Vivaldi have been proposed to remedy the impact of TIVs. They divide the whole space into several location-based clusters and run NC systems on both global layer and local layer. However,the two-layer model is only able to optimize the intra-cluster links relating to a limited portion of TIV triangles. In this paper, we propose a new NC system, Tarantula, which divides the space in a novel way. By categorizing the TIVs into three classes, we show that Tarantula handles a much larger portion of existing TIVs than two-layer systems. Moreover, we presenttwo techniques to further strengthen the Tarantula system: 1)relate the updating step size in the Vivaldi algorithm used in Tarantula to ground-truth latency so as to improve the prediction for short links; 2) propose Dynamic Cluster Optimization to dynamically adjust clustering of hosts. Our experimental results show that Tarantula outperforms Pharos and Vivaldi significantly in terms of estimation accuracy. When implementing different NC systems in the application of server selection and detour finding,Tarantula again performs the best.
There is an alternative called the “Min-Max Heap” that doesn’t have pretty code, but it has shorter dependency chains, which is important on modern hardware. As a result it often ends up faster than a binary heap, even though it allows you to pop from both ends.
In this system your questions might change from the form “is it possible to reach a bad state” to “what is the probability of reaching a bad state?” Unfortunately these types of questions just cannot be answered within the nondeterministic model used above. You cannot model probability with nondeterminism. We must use a new type of model, a state machine that handles probability directly.
Pebble is a LevelDB/RocksDB inspired key-value store focused on performance and internal usage by CockroachDB. Pebble inherits the RocksDB file formats and a few extensions such as range deletion tombstones, table-level bloom filters, and updates to the MANIFEST format.
Ankerl::nanobench is a platform independent microbenchmarking library for C++11/14/17/20.
In this paper, we present Bipartisan Paxos: a family of state machine replication protocols that achieve low latency, high throughput, and simplicity. The Bipartisan Paxos protocols can commit a command in two message delays (the theoretical minimum), achieving low latency. They do not depend on a distinguished leader for normal processing or conflict resolution, achieving high throughput. They are designed with modularity and incrementality in mind, achieving simplicity
Providing ACID transactions under conflicts across globally distributed data is the Everest of transaction processing protocols. Transaction processing in this scenario is particularly costly due to the high latency of cross-continent network links, which inflates concurrency control and data replication overheads. To mitigate the problem, we introduce Ocean Vista – a novel distributed protocol that guarantees strict serializability. We observe that concurrency control and replication address different aspects of resolving the visibility of transactions, and we address both concerns using a multi-version protocol that tracks visibility using version watermarks and arrives at correct visibility decisions using efficient gossip.
Much like on-premises systems, the natural choice for running database analytics workloads in the cloud is to provision a cluster of nodes to run a database instance. However, analytics workloads are often bursty or low volume, leaving clusters idle much of the time, meaning customers pay for compute resources even when unused. The ability of cloud function services, such as AWS Lambda or Azure Functions, to run small, fine granularity tasks make them appear to be a natural choice for query processing in such settings. But implementing an analytics system on cloud functions comes with its own set of challenges. These include managing hundreds of tiny stateless resource-constrained workers, handling stragglers, and shuffling data through opaque cloud services. In this paper we present Starling, a query execution engine built on cloud function services that employs number of techniques to mitigate these challenges, providing interactive query latency at a lower total cost than provisioned systems with low-to-moderate utilization. In particular, on a 1TB TPC-H dataset in cloud storage, Starling is less expensive than the best provisioned systems for workloads when queries arrive 1 minute apart or more. Starling also has lower latency than competing systems reading from cloud object stores and can scale to larger datasets.
We explicitly motivate the subtle intricacies of Hinze andPaterson’s Finger Tree datastructure, by step-wise refininga naive implementation. The result is a new explanation ofhow Finger Trees work and why they have the particularstructure they have, and also a small simplification of theoriginal implementation.
Three insights motivate our work: (1) the live user traffic accessing a web service provides the most current target workload possible, (2) we can empirically test the system to identify its scalability limits, and (3) the user impact and operational overhead of empirical testing can be largely eliminated by building automation which adjusts live traffic based on feedback. We build on these insights in Kraken, a new system that runs load tests by continually shifting live user traffic to one or more data centers
Amazon Web Services (AWS) took a fresh look at the network to provide consistently low latency required for supercomputing applications, while keeping the benefits of public cloud: scalability, elastic on-demand capacity,cost effectiveness, and fast adoption of newer CPUs and GPUs. We built a new network transport protocol, Scalable Reliable Datagram (SRD), designed to utilize modern commodity multi-tenant datacenter networks (with a large number of network paths) while overcoming their limitations (load imbalance and inconsistent latency when unrelated flows collide). Instead of preserving packets order, SRD sends the packets over as many network paths as possible, while avoiding overloaded paths. To minimize jitter and to ensure fastest response to network congestion fluctuations, SRD is implemented in the AWS custom Nitro networking card. SRD is used by HPC/ML frameworks on EC2 hosts via AWS Elastic Fabric Adapter (EFA) kernel-bypass interface.
VinylDNS is a vendor agnostic front-end for enabling self-service DNS and streamlining DNS operations. It is designed to integrate with your existing DNS infrastructure, and provides extensibility to fit your installation. VinylDNS manages millions of DNS records supporting thousands of engineers in production at Comcast. The platform provides fine-grained access controls, auditing of changes, a self-service user interface, secure RESTful API, and integration with infrastructure automation tools like Ansible and Terraform.
Interactive services often have large-scale parallel implemen-tations. To deliver fast responses, the median and tail laten-cies of a service’s components must be low. In this paper,we explore the hardware, OS, and application-level sourcesof poor tail latency in high throughput servers executing onmulti-core machines.We model these network services as a queuing systemin order to establish the best-achievable latency distribution.Using fine-grained measurements of three different servers(a null RPC service, Memcached, and Nginx) on Linux, wethen explore why these servers exhibit significantly worsetail latencies than queuing models alone predict. The un-derlying causes include interference from background pro-cesses, request re-ordering caused by poor scheduling or con-strained concurrency models, suboptimal interrupt routing,CPU power saving mechanisms, and NUMA effects
The CHERI architecture allows pointers to be implemented as capabilities (rather than integer virtual addresses) in a manner that is compatible with, and strengthens, the semantics of the C language. In addition to the spatial protections offered by conventional fat pointers, CHERI capabilities offer strong integrity, enforced provenance validity, and access monotonicity. The stronger guarantees of these architectural capabilities must be reconciled with the real-world behavior of operating systems, run-time environments, and applications. When the process model, user-kernel interactions, dynamic linking, and memory management are all considered, we observe that simple derivation of architectural capabilities is insufficient to describe appropriate access to memory. We bridge this conceptual gap with a notional abstract capability that describes the accesses that should be allowed at a given point in execution, whether in the kernel or userspace. To investigate this notion at scale, we describe the first adaptation of a full C-language operating system (FreeBSD) with an enterprise database (PostgreSQL) for complete spatial and referential memory safety. We show that awareness of abstract capabilities, coupled with CHERI architectural capabilities, can provide more complete protection, strong compatibility, and acceptable performance overhead compared with the pre-CHERI baseline and software-only approaches. Our observations also have potentially significant implications for other mitigation techniques.
You'll be familiar with web bugs, the transparent images which track when someone opens an email. They work by embedding a unique URL in a page's image tag, and monitoring incoming GET requests. Imagine doing that, but for file reads, database queries, process executions, patterns in log files, Bitcoin transactions or even Linkedin Profile views. Canarytokens does all this and more, letting you implant traps in your production systems rather than setting up separate honeypots
Work-stealing is a popular technique to implement dynamic load balancing in a distributed manner. In this approach, each process owns a set of tasks that have to be executed. The owner of the set can put tasks in it and can take tasks from it to execute them. When a process runs out of tasks, instead of being idle, it becomes a thief to steal tasks from a victim. Thus, a work-stealing algorithm provides three high-level operations: Put and Take, which can be invoked only by the owner, and Steal, which can be invoked by a thief. One of the main targets when designing work-stealing algorithms is to make Put and Take as simple and efficient as possible. Unfortunately, it has been shown that any work-stealing algorithm in the standard asynchronous model must use expensive Read-After-Write synchronization patterns or atomic Read-Modify-Write instructions (e.g. Compare&Swap or Test&Set), which may be costly in practice. Thus, prior research has proposed idempotent work-stealing, a relaxation for which there are algorithms with Put and Take devoid of Read-Modify-Write atomic instructions and Read-After- Write synchronization patterns; however, Put uses fences among Write instructions, and Steal uses Compare&Swap and fences among Read instructions. This paper considers work-stealing with multiplicity, a relaxation in which every task is taken by at least one operation, with the requirement that any process can extract a task at most once. Three versions of the relaxation are considered and fully Read/Write algorithms are presented in the standard asynchronous model, all of them devoid of Read-After-Write synchronization patterns; the last algorithm is also fully fence-free.
Paxos, the de facto standard approach to solving distributed consensus, operates in two phases, each of which requires an intersecting quorum of nodes. Multi-Paxos reduces this to one phase by electing a leader but this leader is also a performance bottleneck. Fast Paxos bypasses the leader but has stronger quorum intersection requirements. In this paper we observe that Fast Paxos' intersection requirements can be safely relaxed, reducing to just one additional intersection requirement between phase-1 quorums and any pair of fast round phase-2 quorums. We thus find that the quorums used with Fast Paxos are larger than necessary, allowing alternative quorum systems to obtain new tradeoffs between performance and fault-tolerance.
In this post, I want to explore some of the features of Unix domain sockets that make it a suitable candidate for several of these use-cases, especially transferring a socket (or any file descriptor, for that matter) from one process to another where a parent-child relationship doesn’t necessarily exist between the two processes.
Research has begun to reveal many algorithms can be expressed as matrix multiplication, suggesting an unrealized connection between linear algebra and computer science. I speculate graphs are the missing piece of the puzzle. Graphs are not only useful as cognitive aides, but are suitable data structures for a wide variety of tasks, particularly on modern parallel processing hardware. In this essay, I explore the virtues of graphs, algebra, types, and show how these concepts can help us reason about programs. I propose a computational primitive based on graph signal processing, linking software engineering, graphs, and linear algebra. Finally, I share my predictions for the path ahead, which I consider to be the start of an exciting new chapter in computing history.
In this paper, we present an implementation of a cuckoo filter for membership testing, optimized for distributed data stores operating in high workloads. In large databases, querying becomes inefficient using traditional search methods. To achieve optimal performance it is necessary to use probabilistic data structures to test the membership of a given key, at the cost of getting false positives while querying data. The widely used bloom filters can be used for this, but they have limitations like no support for deletes. To improve upon this we use a modified version of the cuckoo filter that gives better amortized times for search, with less false positives.
We present Elle: a novel checker which infers an Adya-style dependency graph between client-observed transactions. It does so by carefully selecting database objects and operations when generating histories, so as to ensure that the results of database reads reveal information about their version history. Elle can detect every anomaly in Adya et al's formalism [Adya et al. 2000] (except for predicates), discriminate between them, and provide concise explanations of each. This paper makes the following contributions: we present Elle, demonstrate its soundness, measure its efficiency against the current state of the art, and give evidence of its effectiveness via a case study of four real databases.
In this post, we'll talk about how to establish a peer-to-peer connection between two machines, in spite of all the obstacles in the way.
Given a system that needs to be modeled using software stability, how do you translate the components of the system into stability artifacts? A SSM is partitioned into three different levels: Enduring Business Themes (EBTs), Business Objects (BOs), and Industrial Objects (IOs). The EBTs represent elements that remain stable internally and externally. The BOs are objects that are internally adaptable but externally stable, and IOs are the external interface of the system
The Linux kernel has always been an ideal place to implement monitoring/observability, networking, and security. Unfortunately this was often impractical as it required changing kernel source code or loading kernel modules, and resulted in layers of abstractions stacked on top of each other. eBPF is a revolutionary technology that can run sandboxed programs in the Linux kernel without changing kernel source code or loading kernel modules. By making the Linux kernel programmable, infrastructure software can leverage existing layers, making them more intelligent and feature-rich without continuing to add additional layers of complexity to the system or compromising execution efficiency and safety.
Among various kinds of locks in Linux kernel code base, lock_sock() is probably the weirdest one (if RCU is not even weirder). As we all know, basically, there are two categories of locks in Linux kernel: blocking ones like a mutex or a semaphore; non-blocking ones like a spinlock, or a read-write lock. The pick of them largely depends on within which context you plan to use them. The weird part of this sock lock is actually it’s both blocking and non-blocking, depending on its context.
STX is a collection of libraries and utilities designed to make working with C++ easier and less error-prone. This includes but is not limited to well-proven and widely adopted paradigms, data-structures, and designs from other prominent projects and programming languages accross the software engineering community. At the core, all STX libraries are no_std . No RTTI, memory allocation, nor exceptions.
Automated monitoring, automated remediation workflows (for example, traffic shifting), and automated deployment systems are critical to detecting and resolving the vast majority of issues at this scale. However, for many reasons we still need to be able to see what these services, workflows, and deployments are doing at any moment in time
This guide covers timeless ideas that are helpful to keep in mind while working with systems where performance matters. Many of these ideas are fairly “durable” and will apply regardless of what hardware, programming language, operating system, or decade you are working in
In this post we will set out to establish a WireGuard tunnel between dynamically addressed peers that are both sitting behind a NAT. One of the primary goals for achieving this is to stick with WireGuard in its purest form, the code that now ships with the Linux Kernel. We do not want to compromise it in any fashion to achieve our goals, although we could get very creative with its user space implementation
This guide describes how to tune your AMD64/x86_64 hardware and Linux system for running real-time or low latency workloads.
Large-scale decentralized systems of autonomous agents interacting via asynchronous communication often experience the following self-healing dilemma: Fault-detection inherits network uncertainties making a faulty process indistinguishable from a slow process. The implications can be dramatic: Self-healing mechanisms become biased and cost-ineffective. In particular, triggering an undesirable fault-correction results in new faults that could be prevented with fault-tolerance instead. Nevertheless, fault-tolerance alone without eventually correcting persistent faults makes systems underperforming as well. Measuring, understanding and resolving such self-healing dilemmas is a timely challenge and critical requirement given the rise of distributed ledgers, edge computing, the Internet of Things in several application domains of energy, transport and health. This paper introduces a novel and general-purpose modeling of fault scenarios. They can accurately measure and predict inconsistencie
Edge computing moves the computation closer to the data and the data closer to the user to overcome the high latency communication of cloud computing. Storage at the edge allows data access with high speeds that enable latency-sensitive applications in areas such as autonomous driving and smart grid. However, several distributed services are typically designed for the cloud and building an efficient edge-enabled storage system is challenging because of the distributed and heterogeneous nature of the edge and its limited resources. In this paper, we propose EdgeKV, a decentralized storage system designed for the network edge. EdgeKV offers fast and reliable storage, utilizing data replication with strong consistency guarantees. With a location-transparent and interface-based design, EdgeKV can scale with a heterogeneous system of edge nodes. We implement a prototype of the EdgeKV modules in Golang and evaluate it in both the edge and cloud settings on the Grid'5000 testbed. We utilize the Yahoo! Cloud Serving Benchmark (YCSB) to analyze the system's performance under realistic workloads. Our evaluation results show that EdgeKV outperforms the cloud storage setting with both local and global data access with an average write response time and throughput improvements of 26% and 19% respectively under the same settings. Our evaluations also show that EdgeKV can scale with the number of clients, without sacrificing performance. Finally, we discuss the energy efficiency improvement when utilizing edge resources with EdgeKV instead of a centralized cloud.
Traditional approaches to replication require client requests to be ordered before making them durable by copying them to replicas. As a result, clients must wait for two round-trip times (RTTs) before updates complete. In this paper, we show that this entanglement of ordering and durability is unnecessary for strong consistency. Consistent Unordered Replication Protocol (CURP) allows clients to replicate requests that have not yet been ordered, as long as they are commutative. This strategy allows most operations to complete in 1 RTT (the same as an unreplicated system). We implemented CURP in the Redis and RAMCloud storage systems. In RAMCloud, CURP improved write latency by ~2x (14us -> 7.1us) and write throughput by 4x. Compared to unreplicated RAMCloud, CURP's latency overhead for 3-way replication is just 1us (6.1us vs 7.1us). CURP transformed a non-durable Redis cache into a consistent and durable storage system with only a small performance overhead.
State machine replication protocols, like MultiPaxos and Raft, are at the heart of nearly every strongly consistent distributed database. To tolerate machine failures, these protocols must replace failed machines with live machines, a process known as reconfiguration. Reconfiguration has become increasingly important over time as the need for frequent reconfiguration has grown. Despite this, reconfiguration has largely been neglected in the literature. In this paper, we present Matchmaker Paxos and Matchmaker MultiPaxos, a reconfigurable consensus and state machine replication protocol respectively. Our protocols can perform a reconfiguration with little to no impact on the latency or throughput of command processing; they can perform a reconfiguration in one round trip (theoretically) and a few milliseconds (empirically); they provide a number of theoretical insights; and they present a framework that can be generalized to other replication protocols in a way that previous reconfiguration techniques can not. We provide proofs of correctness for the protocols and optimizations, and present empirical results from an open source implementation.
In this post we'll design a break glass procedure for reaching SSH hosts in an emergency, using security keys that you can store offline. This is just one approach, but you can adapt it to your circumstances. We will store an offline SSH Certificate Authority on a hardware security key, and have our hosts trust that CA.
This is a short guide describing the latency implications of the virtual memory abstraction. If you are building systems requiring low and predictable latency such as realtime audio processing, control and high frequency trading (HFT) / algorithmic trading systems this guide will be useful to you. It is written from the perspective Linux kernel running on AMD64 / x86-64 architecture, but the general concepts applies to most operating systems and CPU architectures.
This paper presents a system called NetKernel that decouples the network stack from the guest virtual machine and offers it as an independent module. NetKernel represents a new paradigm where network stack can be managed as part of the virtualized infrastructure. It provides important efficiency benefits: By gaining control and visibility of the network stack, operator can perform network management more directly and flexibly, such as multiplexing VMs running different applications to the same network stack module to save CPU. Users also benefit from the simplified stack deployment and better performance. For example mTCP can be deployed without API change to support nginx natively, and shared memory networking can be readily enabled to improve performance of colocated VMs. Testbed evaluation using 100G NICs shows that NetKernel preserves the performance and scalability of both kernel and userspace network stacks, and provides the same isolation as the current architecture
State machine replication (SMR) is a well-known approach to implementing fault-tolerant services, providing high availability and strong consistency. To boost the performance of SMR, some proposals execute independent commands concurrently, while dependent commands execute sequentially in the total delivery order. The most general approach to handling command dependencies resorts to a directed acyclic graph (DAG), where nodes represent commands and edges represent dependencies. In this paper we show that due to the command arrival and multithreaded execution rates of SMR, a highly concurrent implementation of a DAG is needed. We show that a typical coarse-grained DAG implementation, where the whole graph is a critical section, results in a bottleneck in the replica. We propose two improvements to the coarse-grained DAG approach: fine-grained algorithms, using lock-coupling, and lock-free algorithms. Our fine-grain algorithms lock individual vertices in the DAG. The lock-free algorithms use nonblocking synchronization, with atomic operations, and lazy synchronization to postpone physical removal of nodes. All algorithms were integrated in a parallel SMR prototype. Experimental evaluation revealed that the fine-grained algorithms are also subject to a bottleneck. The lock-free implementation, however, sports linear speedup with the number of working threads, in some cases scaling up to 64 threads.
Barrierd offers the same functionality as membarrier(2)'s regular (non-EXPEDITED) asymmetric barriers. However, by tracking interrupts instead of waiting for a full RCU grace period, the barrier conditions are satisfied more quickly (on the order of 0.1-4 ms on my machine, rather than 25-80 ms).
The membarrier system call is used for synchronizing access to data structures shared by multiple threads. Its main use case is implementing synchronization primitives that can be split into fast and slow paths, for example the read-copy-update (RCU) algorithm. As the name implies, it's used to implement memory barrier semantics without actually using memory barrier instructions.
Back in February 2020, Blelloch and Wei submitted this cool preprint: Concurrent Reference Counting and Resource Management in Wait-free Constant Time. Their work mostly caught my attention because they propose a wait-free implementation of hazard pointers for safe memory reclamation.1 Safe memory reclamation (PDF) is a key component in lock-free algorithms when garbage collection isn’t an option,2 and hazard pointers (PDF) let us bound the amount of resources stranded by delayed cleanups much more tightly than, e.g., epoch reclamation (PDF). However the usual implementation has a loop in its read barriers (in the garbage collection sense), which can be annoying for code generation and bad for worst-case time bounds. Blelloch and Wei’s wait-free algorithm eliminates that loop
This package provides a simple way to enable basic seccomp system call filtering in any application (even proprietary one) via environment variables.
Ck_ec implements 32- and 64- bit event counts. Event counts let us easily integrate OS-level blocking (e.g., futexes) in lock-free protocols. Waiters block conditionally, if the event count's value is still equal to some old value.
Gcassert is a program for making assertions about compiler decisions in Golang programs, via inline comment directives like //gcassert:inline
We present a general transformation that takes any concurrent data structure written using CAS and adds wait-free linearizable query operations to it. These query operations may access arbitrary parts of the data structure, and do not interfere with the progress or running time of other operations. For example, our transformation can be used to add efficient and linearizable range queries, predecessor queries, and top-k queries to existing concurrent set data structures. We achieve this by presenting an efficient technique for taking lazy snapshots of CAS-based data structures.
Big Data is defined as high volume of variety of data with an exponential data growth rate. Data are amalgamated to generate revenue, which results a large data silo. Data are the oils of modern IT industries. Therefore, the data are growing at an exponential pace. The access mechanism of these data silos are defined by metadata. The metadata are decoupled from data server for various beneficial reasons. For instance, ease of maintenance. The metadata are stored in metadata server (MDS). Therefore, the study on the MDS is mandatory in designing of a large scale storage system. The MDS requires many parameters to augment with its architecture. The architecture of MDS depends on the demand of the storage system's requirements. Thus, MDS is categorized in various ways depending on the underlying architecture and design methodology. The article surveys on the various kinds of MDS architecture, designs, and methodologies. This article emphasizes on clustered MDS (cMDS) and the reports are prepared based on a) Bloom filter−based MDS, b) Client−funded MDS, c) Geo−aware MDS, d) Cache−aware MDS, e) Load−aware MDS, f) Hash−based MDS, and g) Tree−based MDS. Additionally, the article presents the issues and challenges of MDS for mammoth sized data.
Modern computing systems are highly concurrent. Threads run concurrently in shared-memorymulti-core systems, and programs run in different servers communicating by sending messages to eachother. Concurrent programming is hard because it requires to cope with many possible, unpredictablebehaviors of the processes, and the communication media. The article argues that right from the startin 1960’s, the main way of dealing with concurrency has been by reduction to sequential reasoning. Ittraces this history, and illustrates it through several examples, from early ideas based on mutual exclu-sion (which was initially introduced to access shared physical resources), passing through consensus andconcurrent objects (which are immaterial data), until today distributed ledgers. A discussion is also pre-sented, which addresses the limits that this approach encounters, related to fault-tolerance, performance,and inherently concurrent problems
In this article, we focus on latencies rather than throughput. We first document the fact that LSM KVs exhibit high tail latencies. The techniques that have been proposed for optimizing throughput do not address this issue, and, in fact, in some cases, exacerbate it. The root cause of these high tail latencies is interference between client writes, flushes, and compactions. Another major cause for tail latency is the heterogeneous nature of the workloads in terms of operation mix and item sizes whereby a few more computationally heavy requests slow down the vast majority of smaller requests. We introduce the notion of an Input/Output (I/O) bandwidth scheduler for an LSM-based KV store to reduce tail latency caused by interference of flushing and compactions and by workload heterogeneity. We explore three techniques as part of this I/O scheduler: (1) opportunistically allocating more bandwidth to internal operations during periods of low load, (2) prioritizing flushes and compactions at the lower levels of the tree, and (3) separating client requests by size and by data access path. SILK+ is a new open-source LSM KV that incorporates this notion of an I/O scheduler.
The authors proposed four desirable properties in transaction processing systems forachieving low-latency ofreadtransactions, with asynchronous and reliable communications, andreferred to them collectively as theSNOW properties: The underlying properties, in the contextof an execution, are (i)strict serializability(S) property where read andwritetransactions seemto occur atomically; (ii)non-blocking(N) property implies that for every read operation on anyobject, during areadtransaction, the response at the corresponding server is non-blocking; (iii)one version and one round(O) property implies every read operation, during a read transaction,completes inone-roundof client-server communication and the respective server responds withonly one version of the object value; and (iv)concurrentwritetransactions(W) property statesthatreadtransactions can have concurrentwritetransactions. Then they argued that it isimpossible to implement all the four properties, in the same system, even with at least threeclients. They referred to their result as theSNOWtheorem, and they posed the two-clientsetting as an open question.
The exponential growth of digital information is imposing increasing scale and efficiency demands on mod-ern storage infrastructures. As infrastructure complexity increases, so does the difficulty in ensuring qualityof service, maintainability, and resource fairness, raising unprecedented performance, scalability, and pro-grammability challenges. Software-Defined Storage (SDS) addresses these challenges by cleanly disentanglingcontrol and data flows, easing management, and improving control functionality of conventional storage sys-tems. Despite its momentum in the research community, many aspects of the paradigm are still unclear, un-defined, and unexplored, leading to misunderstandings that hamper the research and development of novelSDS technologies. In this article, we present an in-depth study of SDS systems, providing a thorough descrip-tion and categorization of each plane of functionality. Further, we propose a taxonomy and classification ofexisting SDS solutions according to different criteria. Finally, we provide key insights about the paradigm anddiscuss potential future research directions for the field.
This guide walks through creating PlantUML diagrams
In today's enterprise storage systems, supported data services such as snapshot delete or drive rebuild can cause tremendous performance interference if executed inline along with heavy foreground IO, often leading to missing SLOs (Service Level Objectives). Typical storage system applications such as web or VDI (Virtual Desktop Infrastructure) follow a repetitive high/low workload pattern that can be learned and forecasted. We propose a priority-based background scheduler that learns this repetitive pattern and allows storage systems to maintain peak performance and in turn meet service level objectives (SLOs) while supporting a number of data services. When foreground IO demand intensifies, system resources are dedicated to service foreground IO requests and any background processing that can be deferred are recorded to be processed in future idle cycles as long as forecast shows that storage pool has remaining capacity. The smart background scheduler adopts a resource partitioning model that allows both foreground and background IO to execute together as long as foreground IOs are not impacted where the scheduler harness any free cycle to clear background debt. Using traces from VDI application, we show how our technique surpasses a method that statically limit the deferred background debt and improve SLO violations from 54.6% when using a fixed background debt watermark to merely a 6.2% if dynamically set by our smart background scheduler.
Cloud providers offer a variety of execution platforms in form of bare-metal, VM, and containers. However, due to the pros and cons of each execution platform, choosing the appropriate platform for a specific cloud-based application has become a challenge for solution architects. The possibility to combine these platforms (e.g. deploying containers within VMs) offers new capacities that makes the challenge even further complicated. However, there is a little study in the literature on the pros and cons of deploying different application types on various execution platforms. In particular, evaluation of diverse hardware configurations and different CPU provisioning methods, such as CPU pinning, have not been sufficiently studied in the literature. In this work, the performance overhead of container, VM, and bare-metal execution platforms are measured and analyzed for four categories of real-world applications, namely video processing, parallel processing (MPI), web processing, and No-SQL, respectively representing CPU intensive, parallel processing, and two IO intensive processes. Our analyses reveal a set of interesting and sometimes counterintuitive findings that can be used as best practices by the solution architects to efficiently deploy cloud-based applications. Here are some notable mentions: (A) Under specific circumstances, containers can impose a higher overhead than VMs; (B) Containers on top of VMs can mitigate the overhead of VMs for certain applications; (C) Containers with a large number of cores impose a lower overhead than those with a few cores
The efficient implementation of function calls and non-local control transfers is a critical part of modern language implementations and is important in the implementation of everything from recursion, higher-order functions, concurrency and coroutines, to task-based parallelism. In a compiler, these features can be supported by a variety of mechanisms, including call stacks, segmented stacks, and heap-allocated continuation closures. An implementor of a high-level language with advanced control features might ask the question "what is the best choice for my implementation?" Unfortunately, the current literature does not provide much guidance, since previous studies suffer from various flaws in methodology and are outdated for modern hardware. In the absence of recent, well-normalized measurements and a holistic overview of their implementation specifics, the path of least resistance when choosing a strategy is to trust folklore, but the folklore is also suspect. This paper attempts to remedy this situation by providing an "apples-to-apples" comparison of six different approaches to implementing call stacks and continuations. This comparison uses the same source language, compiler pipeline, LLVM-backend, and runtime system, with the only differences being those required by the differences in implementation strategy. We compare the implementation challenges of the different approaches, their sequential performance, and their suitability to support advanced control mechanisms, including supporting heavily threaded code. In addition to the comparison of implementation strategies, the paper's contributions also include a number of useful implementation techniques that we discovered along the way.
Smoltcp is a standalone, event-driven TCP/IP stack that is designed for bare-metal, real-time systems. Its design goals are simplicity and robustness. Its design anti-goals include complicated compile-time computations, such as macro or type tricks, even at cost of performance degradation
The libpoireau library intercepts a small fraction of calls to malloc/calloc/etc., to generate a statistically representative overview of an application's heap footprint
In this paper, we set out the goal to revisit the results of “Starringinto the Abyss [...] of Concurrency Control with  Cores” and analyse in-memory DBMSs on today’s large hardware. Despitethe original assumption of the authors, today we do not see single-socket CPUs with 1000 cores. Instead multi-socket hardware madeits way into production data centres. Hence, we follow up on thisprior work with an evaluation of the characteristics of concurrencycontrol schemes on real production multi-socket hardware with1568 cores. To our surprise, we made several interesting findingswhich we report on in this paper.
Hash tables are an essential data-structure for numerous networking applications (e.g., connection tracking, firewalls, network address translators). Among these, cuckoo hash tables provide excellent performance by allowing lookups to be processed with very few memory accesses (2 to 3 per lookup). Yet, for large tables, cuckoo hash tables remain memory bound and each memory access impacts performance. In this paper, we propose algorithmic improvements to cuckoo hash tables allowing to eliminate some unnecessary memory accesses; these changes are conducted without altering the properties of the original cuckoo hash table so that all existing theoretical analysis remain applicable. On a single core, our hash table achieves 37M lookups per second for positive lookups (i.e., when the key looked up is present in the table), and 60M lookups per second for negative lookups, a 50% improvement over the implementation included into the DPDK. On a 18-core, with mostly positive lookups, our implementation achieves 496M lookups per second, a 45% improvement over DPDK.
The fundamental challenges that every researcher, systems architect, or developer faces when designing a new access method are how to minimize, i) read times (R), ii) update cost (U), and iii) memory (or storage) overhead (M). In this project we first conjecture that when optimizing the read-update-memory overheads, optimizing in any two areas negatively impacts the third.
A collection of tools for developers who have little to no artistic talent.
Htmx allows you to access AJAX, WebSockets and Server Sent Events directly in HTML, using attributes, so you can build modern user interfaces with the simplicity and power of hypertext
DuckDB is an embedded database designed to execute analytical SQL queries fast while embedded in another process
C library for accessing the PostgreSQL parser outside of the server. This library uses the actual PostgreSQL server source to parse SQL queries and return the internal PostgreSQL parse tree.
"Distroless" images contain only your application and its runtime dependencies. They do not contain package managers, shells or any other programs you would expect to find in a standard Linux distribution.
For decades, applications deployed on a world-wide scale have been forced to give up at least one of (1) strict serializability (2) low latency writes (3) high transactional throughput. In this paper we discuss SLOG: a system that avoids this tradeoff for workloads which contain physical region locality in data access. SLOG achieves high-throughput, strictly serializable ACID transactions at geo-replicated distance and scale for all transactions submitted across the world, all the while achieving low latency for transactions that initiate from a location close to the home region for data they access. Experiments find that SLOG can reduce latency by more than an order of magnitude relative to state-of-the-art strictly serializable geo-replicated database systems such as Spanner and Calvin, while maintaining high throughput under contention.
The shared log paradigm is at the heart of modern distributed applications in the growing cloud computing industry. Often, application logs must be stored durably for analytics, regulations, or failure recovery, and their smooth operation depends closely on how the log is implemented. Scalog is a new implementation of the shared log abstraction that offers an unprecedented combination of features for continuous smooth delivery of service: Scalog allows applications to customize data placement, supports reconfiguration with no loss in availability, and recovers quickly from failures. At the same time, Scalog provides high throughput and total order.
Llvm-mca is a performance analysis tool that uses information available in LLVM (e.g. scheduling models) to statically measure the performance of machine code in a specific CPU. Performance is measured in terms of throughput as well as processor resource consumption. The tool currently works for processors with an out-of-order backend, for which there is a scheduling model available in LLVM.
An error kernel is the part of an application or system that should never fail. This is ideally as small as possible since the less surface area that it has, the less likely catastrophic failures will happen. When they do happen, the core system is small enough to reason about which greatly helps with triage and cauterizing of wounds.