I/O is the key to determining the performance bottleneck of a web server, and the traditional Linux I/O mechanism leads to a large number of data copy operations and loss of performance, so we urgently need a new technology to solve the problem of large number of data copies, and the answer is Zero-copy.
I was recently confronted with a nice example of how adding FOR UPDATE to a query can introduce transaction anomalies. This article will explain how that happens and how you can cope with the problem.
TLA+ is a “formal specification language”, a means of designing systems that lets you directly test those designs. Developed by the Turing award-winner Leslie Lamport, TLA+ has been endorsed by companies like AWS, Microsoft, and Crowdstrike. TLA+ doesn’t replace our engineering skill but augments it. With TLA+ we can design systems faster and more confidently
This post attempts to explain how Lisp condition systems surpass ordinary exception handling. Condition systems don't unwind the stack by default and thus allow computations to be restarted, which is a useful tool.
Recently, while idly browsing through the source code of Python, I came upon an interesting comment in the bytecode VM implementation (Python/ceval.c) about using the computed gotos extension of GCC. Driven by curiosity, I decided to code a simple example to evaluate the difference between using a computed goto and a traditional switch statement for a simple VM. This post is a summary of my findings.
K-Sortable Globally Unique IDs
Systemd (since version 239) supports a concept of “Portable Services”. “Portable Services” are a delivery method for system services that uses two specific features of container management: (1) Applications are bundled. I.e. multiple services, their binaries and all their dependencies are packaged in an image, and are run directly from it. (2) Stricter default security policies, i.e. sand-boxing of applications. The primary tool for interacting with Portable Services is portablectl, and they are managed by the systemd-portabled service.
Dragonfly is a modern in-memory datastore, fully compatible with Redis and Memcached APIs. Dragonfly implements novel algorithms and data structures on top of a multi-threaded, shared-nothing architecture. As a result, Dragonfly reaches x25 performance compared to Redis and supports millions of QPS on a single instance
Karmem is a fast binary serialization format. The priority of Karmem is to be easy to use while been fast as possible. It's optimized to take Golang and TinyGo's maximum performance and is efficient for repeatable reads, reading different content of the same type. Karmem has proven to be ten times faster than Google Flatbuffers, with the additional overhead of bounds-checking included
The standardization of NVMe Zoned Namespaces (ZNS) in the NVMe 2.0 specification presents a unique new addition to storage devices. Unlike traditional SSDs, where the flash media management idiosyncrasies are hidden behind a flash translation layer (FTL) inside the device, ZNS devices push certain operations regarding data placement and garbage collection out from the device to the host. This allows the host to achieve more optimal data placement and predictable garbage collection overheads, along with lower device write amplification. Thus, additionally increasing flash media lifetime. As a result, ZNS devices are gaining significant attention in the research community. However, with the current software stack there are numerous ways of integrating ZNS devices into a host system. In this work, we begin to systematically analyze the integration options, report on the current software support for ZNS devices in the Linux Kernel, and provide an initial set of performance measurements. Our main findings show that larger I/O sizes are required to saturate the ZNS device bandwidth, and configuration of the I/O scheduler can provide workload dependent performance gains, requiring careful consideration of ZNS integration and configuration depending on the application workload and its access patterns. Our dataset and code are available at https: //github.com/nicktehrany/ZNS-Study.
Ipftrace2 is a tool which allows you to trace the journey of packets inside the Linux kernel. It is similar to the ftrace in some sense but, you can trace which flow have gone through which functions inside the kernel which is usually more important information for the network people than which functions are called information provided by ftrace.
Today, companies and data centers are moving towards distributed and serverless storage systems instead of traditional file systems. As a result of such transition, allocating sufficient resources to users and parties to satisfy their service level demands has become crucial in distributed storage systems. The Quality of Service (QoS) is a research area that tries to tackle such challenges. The schedulability of system components and requests is of great importance to achieve the QoS goals in a distributed storage. Many QoS solutions are designed and implemented through request scheduling at different levels of system architecture. However, the bufferbloat phenomenon in storage backends can compromise the request schedulability of the system. In a storage server, bufferbloat happens when the server submits all requests immediately to the storage backend due to a too large buffer in the storage backend. In recent decades, many research works tried to solve the bufferbloat problem for network systems. Nevertheless, none of these works are suitable for storage system environments and workloads. This paper presents the SF_CoDel algorithm, an adaptive extension of the Controlled Delay (CoDel) algorithm, to mitigate the bufferbloat for different workloads in storage systems. SF_CoDel manages this purpose by controlling the amount of work submitted to the storage backend. The evaluation of our algorithm indicates that SF_CoDel can mitigate the bufferbloat in storage servers.
128-bit compatibility with UUID 1.21e+24 unique ULIDs per millisecond Lexicographically sortable! Canonically encoded as a 26 character string, as opposed to the 36 character UUID Uses Crockford's base32 for better efficiency and readability (5 bits per character) Case insensitive No special characters (URL safe) Monotonic sort order (correctly detects and handles the same millisecond)
Our engineering team focuses on getting the maximum amount of information from the network while sending as little traffic as possible. This lean approach to network discovery is driven by our goal of being fast and safe for all networks. The more we can learn about a system from a single measurement, the less traffic we create and the quicker things run. In this post, I want to share how Rumble uses one of the most common network protocols to obtain a wealth of information about network-attached devices.
ReadySet is a lightweight SQL caching engine that precomputes frequently-accessed query results and automatically keeps these results up-to-date over time as the underlying data in your database changes. ReadySet is wire-compatible with MySQL and Postgres and can be adopted without code changes
TCPLS solves the same issue as QUIC, is in userspace, provides new and modern transport features that directly work on the Internet, and is ensured to work through time thanks to the randomized protocol information.
A slim framework for building browser and gRPC-compatible HTTP APIs. Connect is production-ready — focused, simple, and debuggable — and it's fully compatible with gRPC clients and servers. If you're frustrated by the complexity and instability of today's gRPC libraries, we think you'll find Connect a breath of fresh air.
Libpas is a fast and memory-efficient memory allocation toolkit capable of supporting many heaps at once, engineered with the hopes that someday it'll be used for comprehensive isoheaping of all malloc/new callsites in C/C++ programs.
Lunatic is an Erlang-inspired runtime for WebAssembly By combining the fault-tolerance and massive concurrency of Erlang with the capability-based security of WebAssembly, it creates a powerful programming model.
In this demonstration a client connects to a server, negotiates a QUIC connection with TLS encryption, sends "ping", receives "pong", then terminates the connection.
Of course, this trick violates every coding standard in the book. Try doing this in your company's code and you will probably be subject to a stern telling off if not disciplinary action! You have embedded unmatched braces in macros, used case within sub-blocks, and as for the crReturn macro with its terrifyingly disruptive contents . . . It's a wonder you haven't been fired on the spot for such irresponsible coding practice. You should be ashamed of yourself.
This is a tool that allows you to specify the glibc version that you want to link against, regardless of what version is installed on your machine. This allows you to make portable Linux binaries, without having to build your binaries on an ancient distro (which is the current standard practice).
A shell script which checks your $HOME for unwanted files and directories.
How to gracefully transition from hard to soft deletion in a real application?
Modern customization point objects ([customization.point.object]) were a step forward over raw ADL for making libraries customizable. However, there are a couple of problems they leave unsolved. [...] This paper presents a solution: a single ADL customization point named tag_invoke that takes as its first argument a CPO that is used as a tag to select an overload.
The unshare command creates new namespaces and then executes the specified program.
Algebraic effect handlers are a powerful way to incorporate effects in a programming language. Sometimes perhaps even too powerful. In this article we define a restriction of general effect handlers with scoped resumptions. We argue one can still express all important effects, while improving reasoning about effect handlers. Using the newly gained guarantees, we define a sound and coherent evidence translation for effect handlers, which directly passes the handlers as evidence to each operation. We prove full soundness and coherence of the translation into plain lambda calculus. The evidence in turn enables efficient implementations of effect operations; in particular, we show we can execute tail-resumptive operations in place (without needing to capture the evaluation context), and how we can replace the runtime search for a handler by indexing with a constant offset
Let’s start with a very basic implementation
A tool for automatically converting mitmproxy captures to OpenAPI 3.0 specifications. This means that you can automatically reverse-engineer REST APIs by just running the apps and capturing the traffic.
Traditional NoSQL systems scale by sharding data across multiple servers and by performing each operation on a small number of servers. Because transactions on multiple keys necessarily require coordination across multiple servers, NoSQL systems often explicitly avoid making transactional guarantees in order to avoid such coordination. Past work on transactional systems control this coordination by either increasing the granularity at which transactions are ordered, sacrificing serializability, or by making clock synchronicity assumptions. This paper presents a novel protocol for providing serializable transactions on top of a sharded data store. Called acyclic transactions, this protocol allows multiple transactions to prepare and commit simultaneously, improving concurrency in the system, while ensuring that no cycles form between concurrently-committing transactions. We have fully implemented acyclic transactions in a document store called Warp. Experiments show that Warp achieves 4 times higher throughput than Sinfonia's mini-transactions on the standard TPC-C benchmark with no aborts. Further, the system achieves 75% of the throughput of the non-transactional key-value store it builds upon.
Reading and writing binary formats is hard, especially if it’s an interchange format that should work across a multitude of platforms and languages. Kaitai Struct tries to make this job easier — you only have to describe the binary format once and then everybody can use it from their programming languages — cross-language, cross-platform.
Selected Go-internal packages factored out from the standard library - rogpeppe/go-internal: Selected Go-internal packages factored out from the standard library
The OpenZiti project is an open source initiative focused on bringing Zero Trust to any application. The project provides all the pieces required to implement or integrate Zero Trust into your solutions: The overlay network, Tunneling Applications for all operating systems, Numerous SDKs making it easy to add Zero Trust concepts directly into your application. Ziti makes it easy to embed Zero Trust, programmable networking directly into your app. With Ziti you can have Zero Trust, high performance networking on any Internet connection, without VPNs!
ArcticDB is an embeddable columnar database written in Go. It features semi-structured schemas (could also be described as typed wide-columns), and uses Apache Parquet for storage, and Apache Arrow at query time. Building on top of Apache Arrow, ArcticDB provides a query builder and various optimizers (it reminds of DataFrame-like APIs).
Snare is a minimalistic GitHub webhooks runner daemon. When snare receives a webhook event from a given repository, it authenticates the request, and then executes a user-defined “per-repo program” with information about the webhook event.
An open source, self-hosted implementation of the Tailscale control server
Netmaker makes networks with WireGuard. Netmaker automates fast, secure, and distributed virtual networks.
XDG Base Directory specification, $XDG_CONFIG_HOME etc. Great thing - configs separated from user data and cache, no clutter in home directory. Unfortunately, many programs still don't respect it, including Vim. But what would be our favourite text editor if we wouldn't be able to reconfigure it!
An isolated software layer that translates data between schemas on demand. This layer allows developers to maintain strong compatibility with many schema versions without complicating the main codebase. Translation logic is defined by composing bidirectional lenses, a kind of data transformation that can run both forward and backward.
This repository contains the trustfall query engine, which can be used to query any data source or combination of data sources: databases, APIs, raw files (JSON, CSV, etc.), git version control, etc.
Verneuil is a VFS (OS abstraction layer) for SQLite that accesses local database files like the default unix VFS while asynchronously replicating snapshots to S3-compatible blob stores. We wrote it to improve the scalability and availability of pre-existing services for which SQLite is a good fit, at least for single-node deployments.
XTDB is a general-purpose bitemporal database for SQL, Datalog & graph queries
Lots of server need to get the "real" client IP1 from X-Forwarded-For, Forwarded, and other HTTP headers. It seems like it should easy to do so and lots of developers assume it is, but... it's not, and it gets done incorrectly far too often. This can and will lead to bugs and vulnerabilities.
The Zoned Namespace (ZNS) interface represents a new division of functionality between host software and flash-based SSDs. Current flash-based SSDs maintain the decades-old block interface, which comes at substantial expense in terms of capacity over-provisioning, DRAM for page mapping tables, garbage collection overheads, and host software complexity attempting to mitigate garbage collection. ZNS offers shelter from this ever-rising block interface tax. This paper describes the ZNS interface and explains how it affects both SSD hardware/firmware and host software. By exposing flash erase block boundaries and write-ordering rules, the ZNS interface requires the host software to address these issues while continuing to manage media reliability within the SSD. We describe how storage software can be specialized to the semantics of the ZNS interface, often resulting in significant efficiency benefits. We show the work required to enable support for ZNS SSDs, and show how modified versions of f2fs and RocksDB take advantage of a ZNS SSD to achieve higher throughput and lower tail latency as compared to running on a block-interface SSD with identical physical hardware. For example, we find that the 99.9th-percentile random-read latency for our zone-specialized RocksDB is at least 2–4× lower on a ZNS SSD compared to a blockinterface SSD, and the write throughput is 2× higher.
Zoned Storage is a class of storage devices that enables host and storage devices to cooperate to achieve higher storage capacities, increased throughput, and lower latencies. The zoned storage interface is available through the SCSI Zoned Block Commands (ZBC) and Zoned Device ATA Command Set (ZAC) standards for Shingled Magnetic Recording (SMR) hard disks and with the NVMe Zoned Namespaces (ZNS) standard for NVMe Solid State Disks.
The paper shows that anti-entropy protocols can process only a limited rate of updates, and proposes and evaluates a new state reconciliation mechanism as well as a flow control scheme for anti-entropy protocols [scuttlebutt]
Mobile Edge Computing (MEC) has been gaining significant interest from first responders and tactical teams, primarily because they can employ handheld mobile devices to form a computing cluster (for computing tasks like face/scene recognition, virtual assistance) when connectivity to the cloud is not present or it is limited. High user mobility in first responder or tactical environments makes MEC challenging, as wireless links observe substantial fluctuations. Typical cloud-based coordination (e.g., ZooKeeper-based service discovery and coordination, device naming, security) needed by edge computing tasks cannot work in these environments. Driven by the need for a resilient and lightweight coordination service, in this paper, we design and implement EdgeKeeper to provide cloud-like coordination for MEC systems. It provides naming, network management, application coordination, and security to distributed edge computing applications. It maintains an edge cluster among devices and intelligently stores its data on a group of replicas to guard against node failure and disconnections. We provide a full-system implementation of EdgeKeeper for Android and Linux platforms. We have integrated EdgeKeeper with existing MEC applications and performed real-world performance evaluations in a wide-area search and rescue operation conducted by first responders, which proves it to be lightweight and suitable for mobile devices.
ZeroBin is a minimalist, opensource online pastebin/discussion board where the server has zero knowledge of hosted data. Data is encrypted/decrypted in the browser using 256 bits AES
Magic-trace collects and displays high-resolution traces of what a process is doing
Make vim follow your system-wide dark mode preference
Memray is a memory profiler for Python. It can track memory allocations in Python code, in native extension modules, and in the Python interpreter itself. It can generate several different types of reports to help you analyze the captured memory usage data. While commonly used as a CLI tool, it can also be used as a library to perform more fine-grained profiling tasks.
Dynamic techniques are a scalable and effective way to analyze concurrent programs. Instead of analyzing all behaviors of a program, these techniques detect errors by focusing on a single program execution. Often a crucial step in these techniques is to define a causal ordering between events in the execution, which is then computed using vector clocks, a simple data structure that stores logical times of threads. The two basic operations of vector clocks, namely join and copy, require Θ(k) time, where k is the number of threads. Thus they are a computational bottleneck when k is large. In this work, we introduce tree clocks, a new data structure that replaces vector clocks for computing causal orderings in program executions. Joining and copying tree clocks takes time that is roughly proportional to the number of entries being modified, and hence the two operations do not suffer the a-priori Θ(k) cost per application. We show that when used to compute the classic happens-before (HB) partial order, tree clocks are optimal, in the sense that no other data structure can lead to smaller asymptotic running time. Moreover, we demonstrate that tree clocks can be used to compute other partial orders, such as schedulable-happens-before (SHB) and the standard Mazurkiewicz (MAZ) partial order, and thus are a versatile data structure. Our experiments show that just by replacing vector clocks with tree clocks, the computation becomes from 2.02 × faster (MAZ) to 2.66 × (SHB) and 2.97 × (HB) on average per benchmark. These results illustrate that tree clocks have the potential to become a standard data structure with wide applications in concurrent analyses.
How do we authenticate the user on a in-secure connection without revealing the password
Snid is a lightweight proxy server that forwards TLS connections based on the server name indication (SNI) hostname
The conventional solution for this problem is HTTP reverse proxying, but I want to do better. I want to be able to act like IPv6 really is ubiquitous, but continue to support IPv4-only clients with a minimum amount of complexity and mental overhead. To accomplish this, I've turned to SNI-based proxying.
This page describes an overlay network based on stateless IPv6 tunnels, which have better reliability and scalability characteristics than stateful IPv4 overlays. It uses IETF protocols that are natively supported by the Linux kernel, and since it is independent of Kubernetes itself can support communcication between processes both inside and outside of containers.
Warpgate is a smart SSH bastion host for Linux that can be used with any SSH client.
Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.
User-defined effects and effect handlers are advertised and advocated as a relatively easy-to-understand and modular approach to delimited control. They offer the ability of suspending and resuming a computation and allow information to be transmitted both ways between the computation, which requests a certain service, and the handler, which provides this service. Yet, a key question remains, to this day, largely unanswered: how does one modularly specify and verify programs in the presence of both user-defined effect handlers and primitive effects, such as heap-allocated mutable state? We answer this question by presenting a Separation Logic with built-in support for effect handlers, both shallow and deep. The specification of a program fragment includes a protocol that describes the effects that the program may perform as well as the replies that it can expect to receive. The logic allows local reasoning via a frame rule and a bind rule. It is based on Iris and inherits all of its advanced features, including support for higher-order functions, user-defined ghost state, and invariants. We illustrate its power via several case studies, including (1) a generic formulation of control inversion, which turns a producer that “pushes” elements towards a consumer into a producer from which one can “pull” elements on demand, and (2) a simple system for cooperative concurrency, where several threads execute concurrently, can spawn new threads, and communicate via promises.
We explore asynchronous programming with algebraic effects. We complement their conventional synchronous treatment by showing how to naturally also accommodate asynchrony within them, namely, by decoupling the execution of operation calls into signalling that an operation’s implementation needs to be executed, and interrupting a running computation with the operation’s result, to which the computation can react by installing interrupt handlers. We formalise these ideas in a small core calculus, called 𝜆ñ. We demonstrate the flexibility of 𝜆ñ using examples ranging from a multi-party web application, to preemptive multi-threading, to remote function calls, to a parallel variant of runners of algebraic effects. In addition, the paper is accompanied by a formalisation of 𝜆ñ’s type safety proofs in Agda, and a prototype implementation of 𝜆ñ in OCaml.
Algebraic effects and handlers are an emerging new feature to model effectful computations and attract attention not only from researchers but also from programmers. They are implemented in various ways as part of compilers, interpreters, or as libraries. We present a direct embedding of one-shot algebraic effects and handlers in a language which has asymmetric coroutines. The key observation is that, by restricting the use of continuations to be one-shot, we obtain a simple and sufficiently general implementation via coroutines, which are available in many modern programming languages. We have implemented our embedding as a library in Lua and Ruby, which allows one to write effectful programs in a modular way using algebraic effects and handlers.
This article advocates the revival of coroutines as a convenient general control abstraction. After proposing a new classification of coroutines, we introduce the concept of full asymmetric coroutines and provide a precise definition for it through an operational semantics. We then demonstrate that full coroutines have an expressive power equivalent to one-shot continuations and one-shot delimited continuations. We also show that full asymmetric coroutines and one-shot delimited continuations have many similarities, and therefore present comparable benefits. Nevertheless, coroutines are easier implemented and understood, especially in the realm of procedural languages
We developed an easy-to-use checkpoint/restore tool that uses the CRIU engine
This article explains a straightforward approach for generating Perfect Hash Functions
We are looking to implement a library through which side-effect “requests” can be separated from their implementation.
Bubblewrap works by creating a new, completely empty, mount namespace where the root is on a tmpfs that is invisible from the host, and will be automatically cleaned up when the last process exits. You can then use commandline options to construct the root filesystem and process environment and command to run in the namespace
Stop building slow, complex, fragile software systems. Safely run your application on a single server. Fully-replicated database with no pain and little cost.
Flagsmith is an open source, fully featured, Feature Flag and Remote Config service. Use our hosted API, deploy to your own private cloud, or run on-premise.
Advanced multi-threaded PostgreSQL connection pooler and request router