What are the uses to which gossip is particularly well-matched, and what are its limitations? What alternatives are there to gossip-based solutions, and when would we be better-off using a non-gossip protocol? When, in effect, is gossip the technology of choice?
This style of short-term planning, direct customer contact, and continuous iteration is well suited to software with a simple core and lots of customer visible features that are incrementally useful. It is not so well suited to software which has a very simple interface and tons of hidden internal complexity, software which isn’t useful until it’s fairly complete, or leapfrog solutions the customer can’t imagine.
The Haskell Phrasebook is a free quick-start Haskell guide comprised of a sequence of small annotated programs. It provides a cursory overview of selected Haskell features, jumping-off points for further reading, and recommendations to help get you writing programs as soon as possible
Andjelko Iharos explores the goals, design and the choices behind the implementations of EBtree, and how they produce a very fast and versatile data storage for many of HAProxys advanced features. EBtree is a different take on the ubiquitous tree data structure, and has been helping HAProxy, a high performance open source software load balancer, to keep up with demands for over a decade.
In this post, I describe a tri-factor model which I find more useful in the analysis of hash table algorithms and discuss several state-of-the-art algorithms in the context of this model.
The backoff algorithm is a part of Media Access Control (MAC) protocol which used to avoid collision in the Mobile Ad hoc Network (MANET). When the nodes in the network try to access the channel, one of these nodes gains access the channel while the other nodes still contend for a time period. Many backoff algorithms have been proposed to improve network performance. One of these algorithms is Fibonacci increment backoff (FIB), FIB algorithm achieves higher throughput than the exponential backoff that is used by the standard IEEE 802.11 when it used in a mobile ad hoc network. The Pessimistic Linear-exponential Backoff (PLEB) is another proposed backoff algorithm which uses a combination of two increment behaviors; Exponential backoff and Linear backoff this scheme merges the advantages of the two increment behaviors. Exponential increments give enough backoff time to enhance the network throughput by reducing the number of transmission failures, and the linear increment reduces the average packet delay. Ad hoc On demand Distance Vector (AODV) routing protocol use a demand-driven route establishment procedure. AODV maintain the route table at each node. This paper uses different backoff algorithms at different values of rebroadcast probability.
Say no to $ apt install vim in containers! cntr is a replacement for docker exec that brings all your developers tools with you. This allows to ship minimal runtime image in production and limit the surface for exploits.
In this article, I want to highlight how simple linker map files are and how much they can teach you about the program you are working on.
With the rapid developments of virtualization techniques, cloud data centers have had enabled cost effective, flexible, and customizable deployments of applications in virtualized infrastructure. Virtual machine placement problem is a problem of paramount importance to the design of cloud data centers. It aims to assign each virtual machine to a server in the cloud environment. Typically, the problem involves complex relations and multiple design factors as well as local policies that govern the assignment decisions. It also involves different parties including cloud administrators and customers that we need to consider their preferences while opting for a solution. Thus, it is significant to not only return an optimized solution to the underlying placement problem but also a solution that reflects the given preferences of these parties. In this paper, we provide a detailed review on the role of preferences in the current literature of the virtual machine placement problem. We further discuss some challenges and identify possible research opportunities to better incorporate preferences within the problem.
Real-time performance monitoring, done right
A circular buffer written in C using Posix calls to create a contiguously mapped memory space.
The wait-free hierarchy hrm classifies AsynchronousSharedMemory object types T by consensus number, where a type T has consensus number n if with objects of type T and atomic registers (all initialized to appropriate values) it is possible to solve wait-free consensus (i.e., agreement, validity, wait-free termination) for n processes but not for n 1 processes. The consensus number of any type is at least 1, since 1-process consensus requires no interaction, and may range up to ∞ for particularly powerful objects.
Highly available and high-performance message logging system is critical building block for various use cases that require global ordering, especially for deterministic distributed transactions. To achieve availability, we maintain multiple replicas that have the same payloads in exactly the same order. This introduces various challenging issues such as consistency between replicas after failure, while minimizing performance degradation. Replicated state machine-based consensus protocols are the most suitable candidates to fulfill those requirements, but double-write problem and different logging granularity make it hard to keep the system efficient. This paper suggests a novel way to build a replicated log store on top of Raft consensus protocol, aiming at providing the same level of consistency as well as fault-tolerance without sacrificing the throughput of the system.
Distributed systems are hard to reason about largely because of uncertainty about what may go wrong in a particular execution, and about whether the system will mitigate those faults. Tools that perturb executions can help test whether a system is robust to faults, while tools that observe executions can help better understand their system-wide effects. We present Box of Pain, a tracer and fault injector for unmodified distributed systems that addresses both concerns by interposing at the system call level and dynamically reconstructing the partial order of communication events based on causal relationships. Box of Pain’s lightweight approach to tracing and focus on simulating the effects of partial failures on communication rather than the failures themselves sets it apart from other tracing and fault injection systems. We present evidence of the promise of Box of Pain and its approach to lightweight observation and perturbation of distributed systems.
OneFile is a Software Transactional Memory (STM) meant to make it easy to implement lock-free and wait-free data structures
This book is a guide to how we do product development at Basecamp. It’s also a toolbox full of techniques that you can apply in your own way to your own process.
Partisan is the design of an alternative runtime system for improved scalability and reduced latency in actor applications
The difficulty with implementing saturated arithmetic is creating all the checks for overflow. The naive method requires several branches in the code. Since jumps tend to be slow to execute, the resulting math executes slowly. For performance, branch-free code is suggested. Another problem is that the overflow or carry flags are not directly accessible from high level languages like C. This means we may need rather complex round-about methods to determine whether or not an overflow occurred. Such code is difficult to make fast.
Priority queues are fundamental abstract data structures, often used to manage limited resources in parallel programming. Several proposed parallel priority queue implementations are based on skiplists, harnessing the potential for parallelism of the add() operations. In addition, methods such as Flat Combining have been proposed to reduce contention by batching together multiple operations to be executed by a single thread. While this technique can decrease lock-switching overhead and the number of pointer changes required by the removeMin() operations in the priority queue, it can also create a sequential bottleneck and limit parallelism, especially for non-conflicting add() operations. In this paper, we describe a novel priority queue design, harnessing the scalability of parallel insertions in conjunction with the efficiency of batched removals. Moreover, we present a new elimination algorithm suitable for a priority queue, which further increases concurrency on balanced workloads with similar numbers of add() and removeMin() operations. We implement and evaluate our design using a variety of techniques including locking, atomic operations, hardware transactional memory, as well as employing adaptive heuristics given the workload.
Package file handles write-ahead logs and space management of os.File-like entities
A date and time library based on the C 11/14/17 <chrono> header. Slightly modified versions of "date.h" and "tz.h" were voted into the C 20 working draft
Arachne is a new user-level implementation of threads that provides both low latency and high throughput for applications with extremely short-lived threads (only a few microseconds). Arachne is core-aware: each application determines how many cores it needs, based on its load; it always knows exactly which cores it has been allocated, and it controls the placement of its threads on those cores. A central core arbiter allocates cores between applications. Adding Arachne to memcached improved SLO-compliant throughput by 37%, reduced tail latency by more than 10x, and allowed memcached to coexist with background applications with almost no performance impact. Adding Arachne to the RAMCloud storage system increased its write throughput by more than 2.5x. The Arachne threading library is optimized to minimize cache misses; it can initiate a new user thread on a different core (with load balancing) in 320 ns. Arachne is implemented entirely at user level on Linux; no kernel modifications are needed
High-performance servers are non-uniform memory access (NUMA) machines. To fully leverage these machines, programmers need efficient concurrent data structures that are aware of the NUMA performance artifacts. We propose Node Replication (NR), a black-box approach to obtaining such data structures. NR takes an arbitrary sequential data structure and automatically transforms it into a NUMA-aware concurrent data structure satisfying linearizability. Using NR requires no expertise in concurrent data structure design, and the result is free of concurrency bugs. NR draws ideas from two disciplines: shared-memory algorithms and distributed systems. Briefly, NR implements a NUMA-aware shared log, and then uses the log to replicate data structures consistently across NUMA nodes. NR is best suited for contended data structures, where it can outperform lock-free algorithms by 3.1x, and lock-based solutions by 30x. To show the benefits of NR to a real application, we apply NR to the data structures of Redis, an in-memory storage system. The result outperforms other methods by up to 14x. The cost of NR is additional memory for its log and replicas.
Mimalloc (pronounced "me-malloc") is a general purpose allocator with excellent performance characteristics. Initially developed by Daan Leijen for the run-time systems of the Koka and Lean languages. It is a drop-in replacement for malloc and can be used in other programs without code changes
In this paper, we propose a sizing of network buffers that is easy to deploy, requires no application-level state or semantics, requires no measurements from the network, and is fully decentralized. Our approach does not rely on machine learning or other complex algorithms. The tradeoff inherent in our proposed buffer size is that it is very likely inefficient across a wide range of deployments. However, there is likely a small, finite number of networks where our proposed buffer size is optimal. This paper is tailored to such networks.
In this paper, we perform the first systematic study onconcurrency bugs in real Go programs. We studied six pop-ular Go software including Docker, Kubernetes, and gRPC.We analyzed 171 concurrency bugs in total, with more thanhalf of them caused by non-traditional, Go-specific problems.Apart from root causes of these bugs, we also studied theirfixes, performed experiments to reproduce them, and eval-uated them with two publicly-available Go bug detectors.Overall, our study provides a better understanding on Go’sconcurrency models and can guide future researchers andpractitioners in writing better, more reliable Go softwareand in developing debugging and diagnosis tools for Go.
After ten years in print, our publisher decided against further printings and has reverted the rights to us. We are publishing Elements of Programming in two forms: a free PDF and a paperback.
Some software, languages and models that have HobbyHorses.
Determining whether online users are authorized to access digital objects is central to preserving privacy. This paper presents the design, implementation, and deployment of Zanzibar, a global system for storing and evaluating access control lists. Zanzibar provides a uniform data model and configuration language for expressing a wide range of access control policies from hundreds of client services at Google, including Calendar, Cloud, Drive, Maps, Photos, and YouTube. Its authorization decisions respect causal ordering of user actions and thus provide external consistency amid changes to access control lists and object contents. Zanzibar scales to trillions of access control lists and millions of authorization requests per second to support services used by billions of people. It has maintained 95th-percentile latency of less than 10 milliseconds and availability of greater than 99.999% over 3 years of production use.
The Berkeley Tree DataBase provides very fast storage of scalar-valued timeseries data
Snmalloc is a research allocator. Its key design features are: Memory that is freed by the same thread that allocated it does not require any synchronising operations. Freeing memory in a different thread to initially allocated it, does not take any locks and instead uses a novel message passing scheme to return the memory to the original allocator, where it is recycled. The allocator uses large ranges of pages to reduce the amount of meta-data required.
Caching is a common approach for improving performance, yet most implementations use strictly classical techniques. In this article we will explore the modern methods used by Caffeine, an open-source Java caching library, that yield high hit rates and excellent concurrency.
Shuffle Sharding is a general-purpose technique, and you can also choose to Shuffle Shard across many kinds of resources, including pure in-memory data-structures such as queues, rate-limiters, locks and other contended resources.
In the popular computer game of Tetris, the player is given a sequence of tetromino pieces and must pack them into a rectangular gameboard initially occupied by a given configuration of filled squares; any completely filled row of the gameboard is cleared and all pieces above it drop by one row. We prove that in the offline version of Tetris, it is NP-complete to maximize the number of cleared rows, maximize the number of tetrises (quadruples of rows simultaneously filled and cleared), minimize the maximum height of an occupied square, or maximize the number of pieces placed before the game ends. We furthermore show the extreme inapproximability of the first and last of these objectives to within a factor of p^(1-epsilon), when given a sequence of p pieces, and the inapproximability of the third objective to within a factor of (2 - epsilon), for any epsilon>0. Our results hold under several variations on the rules of Tetris, including different models of rotation, limitations on player agility, and restricted piece sets.
Current cloud providers use fixed-price based mechanisms to allocate Virtual Machine (VM) instances to their users. The fixed-price based mechanisms do not provide an efficient allocation of resources and do not maximize the revenue of the cloud providers. A better alternative would be to use combinatorial auction-based resource allocation mechanisms. In this PhD dissertation we will design, study and implement combinatorial auction-based mechanisms for efficient provisioning and allocation of VM instances in cloud computing environments. We present our preliminary results consisting of three combinatorial auction-based mechanisms for VM provisioning and allocation.We also present an efficient bidding algorithm that can be used by the cloud users to decide on how to bid for their requested bundles of VM instances
Same as the other ones but Heidi Howard so ¯\_(ツ)_/¯
This is the missing manual. I reckon most engineers can wrap their heads around all the most important concepts and common quirks in less than an hour. That’s our goal here. An hour is a pretty small investment to learn something you literally can’t do any other way.
Distributed consensus, the ability to reach agreement in the face of failures and asynchrony, is a fundamental primitive for constructing reliable distributed systems from unreliable components. The Paxos algorithm is synonymous with distributed consensus, yet it performs poorly in practice and is famously difficult to understand. In this paper, we re-examine the foundations of distributed consensus. We derive an abstract solution to consensus, which utilises immutable state for intuitive reasoning about safety. We prove that our abstract solution generalises over Paxos as well as the Fast Paxos and Flexible Paxos algorithms. The surprising result of this analysis is a substantial weakening to the quorum requirements of these widely studied algorithms.
Workload predictions in cloud computing is obviously an important topic. Most of the existing publications employ various time series techniques, that might be difficult to implement. We suggest here another route, which has already been successfully used in financial engineering and photovoltaic energy. No mathematical modeling and machine learning procedures are needed. Our computer simulations via realistic data, which are quite convincing, show that a setting mixing algebraic estimation techniques and the daily seasonality behaves much better. An application to the computing resource allocation, via virtual machines, is sketched out.
Fine-grained, cgroup-based tool for profiling memory usage over time of a process tree
Fast, portable, non-Turing complete expression evaluation with gradual typing (Go) - google/cel-go
We prove that no fully transactional system can provide fast read transactions (including read-only ones that are considered the most frequent in practice). Specifically, to achieve fast read transactions, the system has to give up support of transactions that write more than one object. We prove this impossibility result for distributed storage systems that are causally consistent, i.e., they do not require to ensure any strong form of consistency. Therefore, our result holds also for any system that ensures a consistency level stronger than causal consistency, e.g., strict serializability. The impossibility result holds even for systems that store only two objects (and support at least two servers and at least four clients). It also holds for systems that are partially replicated. Our result justifies the design choices of state-of-the-art distributed transactional systems and insists that system designers should not put more effort to design fully-functional systems that support both fast read transactions and ensure causal or any stronger form of consistency.
Distributed systems are hard to reason about largely because of uncertainty about what may go wrong in a particular execution, and about whether the system will mitigate those faults. Tools that perturb executions can help test whether a system is robust to faults, while tools that observe executions can help better understand their system-wide effects. We present Box of Pain, a tracer and fault injector for unmodified distributed systems that addresses both concerns by interposing at the system call level and dynamically reconstructing the partial order of communication events based on causal relationships. Box of Pain's lightweight approach to tracing and focus on simulating the effects of partial failures on communication rather than the failures themselves sets it apart from other tracing and fault injection systems. We present evidence of the promise of Box of Pain and its approach to lightweight observation and perturbation of distributed systems.
The classification of the most used load balancing algorithms in distributed systems (including cloud technology, cluster systems, grid systems) is described. Comparative analysis of types of the load balancing algorithms is conducted in accordance with the classification, the advantages and drawbags of each type of the algorithms are shown. Performance indicators characterizing each algorithm are indicated.
Smoltcp is a standalone, event-driven TCP/IP stack that is designed for bare-metal, real-time systems. Its design goals are simplicity and robustness. Its design anti-goals include complicated compile-time computations, such as macro or type tricks, even at cost of performance degradation.
In this paper we study the properties of eventually consistent distributed systems that feature arbitrarily complex semantics and mix eventual and strong consistency. These systems execute requests in a highly-available, weakly-consistent fashion, but also enable stronger guarantees through additional inter-replica synchronization mechanisms that require the ability to solve distributed consensus. We use the seminal Bayou system as a case study, and then generalize our findings to a whole class of systems. We show dubious and unintuitive behaviour exhibited by those systems and provide a theoretical framework for reasoning about their correctness. We also state an impossibility result that formally proves the inherent limitation of such systems, namely temporary operation reordering, which admits interim disagreement between replicas on the relative order in which the client requests were executed.
This article is about some of the little tricks that I use in Vim. None of them are deep dives, and I encourage you to learn more about whatever’s interesting. They also aren’t connected to each other. But that’s fine. In total, they’re more than enough to help a lot.
The growing popularity of workflows in the cloud domain promoted the development of sophisticated autoscaling policies that allow automatic allocation and deallocation of resources. However, many state-of-the-art autoscaling policies for workflows are mostly plan-based or designed for batches (ensembles) of workflows. This reduces their flexibility when dealing with workloads of workflows, as the workloads are often subject to unpredictable resource demand fluctuations. Moreover, autoscaling in clouds almost always imposes budget constraints that should be satisfied. The budget-aware autoscalers for workflows usually require task runtime estimates to be provided beforehand, which is not always possible when dealing with workloads due to their dynamic nature. To address these issues, we propose a novel Performance-Feedback Autoscaler (PFA) that is budget-aware and does not require task runtime estimates for its operation. Instead, it uses the performance-feedback loop that monitors the average throughput on each resource type. We implement PFA in the popular Apache Airflow workflow management system, and compare the performance of our autoscaler with other two state-of-the-art autoscalers, and with the optimal solution obtained with the Mixed Integer Programming approach. Our results show that PFA outperforms other considered online autoscalers, as it effectively minimizes the average job slowdown by up to 47% while still satisfying the budget constraints. Moreover, PFA shows by up to 76% lower average runtime than the competitors.
TLA and Alloy are good for absolutes. You can say “there exists an error in this design”. But you can’t express statistical properties, like “this should take an average of less than 10 minutes” or “it is more likely this recovers than crashes.” Nor can you assign weights to actions, like “this has an 80% chance of continuing and a 20% chance of failing.” For that, we need probabilistic modeling (PM). PRISM is one such modeling language, and comes with some really powerful tools to inspect probabilities. But there’s a fundamental tradeoff in modeling: the more powerful the checker, the less expressive the language. And to get such a powerful checker, PRISM has to make some serious expressivity tradeoffs.
Build and run tiny vms from Dockerfiles. Small and sleek.
EventCount allows to wait for arbitrary predicates in non-blocking algorithms. Think of condition variable, but wait predicate does not need to be protected by a mutex.
In this paper, we study the problem of computing aggregates with gossip-style protocols. Our first contribution is an analysis of simple gossip-based protocols for the computations of sums, averages, random samples, quantiles, and other aggregate functions, and we show that our protocols converge exponentially fast to the true answer when using uniform gossip.Our second contribution is the definition of a precise notion of the speed with which a node’s data diffuses through the network. We show that this diffusion speed is at theheart of the approximation guarantees for all of the above problems. We analyze the diffusion speed of uniform gossip in the presence of node and link failures, as well as for flooding-based mechanisms. The latter expose interesting connections to random walks on graphs
With the advent of Software Defined Networks (SDN), Network Function Virtualisation (NFV) or Service Function Chaining (SFC), operators expect networks to support flexible services beyond the mere forwarding of packets. The network programmability framework which is being developed within the IETF by leveraging IPv6 Segment Routing enables the realisation of in-network functions. In this paper, we demonstrate that this vision of in-network programmability can be realised. By leveraging the eBPF support in the Linux kernel, we implement a flexible framework that allows network operators to encode their own network functions as eBPF code that is automatically executed while processing specific packets. Our lab measurements indicate that the overhead of calling such eBPF functions remains acceptable. Thanks to eBPF, operators can implement a variety of network functions. We describe the architecture of our implementation in the Linux kernel. This extension has been released with Linux 4.18. We illustrate the flexibility of our approach with three different use cases: delay measurements, hybrid networks and network discovery. Our lab measurements also indicate that the performance penalty of running eBPF network functions on Linux routers does not incur a significant overhead.
Whether you’re a Site Reliability Engineer (SRE), developer, or executive, as a service provider you have a vested interest in (or responsibility for) ensuring system reliability. However, “system reliability” in and of itself can be a vague and subjective term that depends on the specific needs of the enterprise. So, SLOs are necessary because they define your Quality of Service (QoS) and reliability goals in concrete, measurable, objective terms.
A comprehensive 10-page probability cheatsheet that covers a semester's worth of introduction to probability.
This handbook is a guide to achieving consistent and reliable low-latency operation of applications. It explores how modern operating systems schedule user applications and how to observe and modify application operating parameters; and introduces techniques for making sure that system hardware is working cooperatively with your application
To achieve good performance, modern applications often par-tition their state across multiple geographically distributednodes. While this approach reduces latency in the commoncase, it can be challenging for programmers to use correctly,especially in applications that require strong consistency. Weintroducepredictive treaties, a mechanism that can signifi-cantly reduce distributed coordination without losing strongconsistency. The central insight behind our approach is thatmany computations can be expressed in terms of predicatesover distributed state that can be partitioned and enforcedlocally. Predictive treaties improve on previous work by al-lowing the locally enforced predicates to depend on time.Intuitively, by predicting the evolution of system state, coor-dination can be significantly reduced compared to static ap-proaches. We implemented predictive treaties in a distributedsystem that exposes them in an intuitive programming model.We evaluate performance on several benchmarks, includingTPC-C, showing that predictive treaties can significantly in-crease performance by orders of magnitude and can evenoutperform customized algorithms.
Delta-t logically supports a permanent, reliable, flow controlled, full duplex, labeled bit stream connection between two ports. There is no extra packet exchange overhead to reliably manage connection state as in other stream oriented protocols. Therefore Delta-t can support an efficient, low delay, minimum packet exchange, reliable transaction oriented service as well as high stream throughput.
In my opinion we need to remove veto power from workers and architect systems in which the system does not have freedom to abort a transaction
The ALTS trust model has been tailored for cloud-like containerized applications. Identities are bound to entities instead of to a specific server name or host. This trust model facilitates seamless microservice replication, load balancing, and rescheduling across hosts
Event counts let waiters detect when they would have been woken up (the event count’s version counter has changed), and thus patch up this window where waiters can miss wake-ups for data changes they have yet to observe. Crucially, waiters detect lost wake-ups, rather than preventing them by locking writers out. Event counts thus preserve lock-freedom (and even wait-freedom!).
Internet applications increasingly employ TCP not as a stream abstraction, but as a substrate for application-level transports, a use that converts TCP's in-order semantics from a convenience blessing to a performance curse. As Internet evolution makes TCP's use as a substrate likely to grow, we offer Minion, an architecture for backward-compatible out-of-order delivery atop TCP and TLS. Small OS API extensions allow applications to manage TCP's send buffer and to receive TCP segments out-of-order. Atop these extensions, Minion builds application-level protocols offering true unordered datagram delivery, within streams preserving strict wire-compatibility with unsecured or TLS-secured TCP connections. Minion's protocols can run on unmodified TCP stacks, but benefit incrementally when either endpoint is upgraded, for a backward-compatible deployment path. Experiments suggest that Minion can noticeably improve performance of applications such as conferencing, virtual private networking, and web browsing, while incurring minimal CPU or bandwidth costs.
Beam is a distributed knowledge graph store, sometimes called an RDF store or a triple store
We therefore proposea structure for an OS called parakernel, which eliminates most OS abstractions and provides interfaces for applications to leverage the full potential of the underlying hardware. The parakernel facilitates application-level parallelism by securely partitioning the resources and multiplexing only those resources that are not partitioned.
This brief tutorial shows where some of the most used and quoted sysctl/network parameters are located into the Linux network flow
The algorithm claims to implement fault-tolerant distributed locks (or rather, leases) on top of Redis, and the page asks for feedback from people who are into distributed systems. The algorithm instinctively set off some alarm bells in the back of my mind, so I spent a bit of time thinking about it and writing up these notes.
Programs written in C/C++ can suffer from serious memory fragmentation, leading to low utilization of memory, degraded performance, and application failure due to memory exhaustion. This paper introduces Mesh, a plug-in replacement for malloc that, for the first time, eliminates fragmentation in unmodified C/C++ applications. Mesh combines novel randomized algorithms with widely-supported virtual memory operations to provably reduce fragmentation, breaking the classical Robson bounds with high probability. Mesh generally matches the runtime performance of state-of-the-art memory allocators while reducing memory consumption; in particular, it reduces the memory of consumption of Firefox by 16% and Redis by 39%.
We present ixy, a userspace network driver designed for simplicity and educationalpurposes to show that fast packet IO is not black magic butcareful engineering. Ixy focuses on the bare essentials of user space packet processing: a packet forwarder including the whole NIC driver uses less than 1000 lines of C code.
With this design, the optimistic case (where the thread isn't interrupted) is extremely fast because expensive locks and atomic instructions can be entirely avoided