Matt McShane

Introducing TCP-in-UDP solution
- tcp
- protocol
- ebpf
2025-07-18T13:14:19+00:00
The MPTCP protocol is complex, mainly to be able to survive on the Internet where middleboxes such as NATs, firewalls, IDS or proxies can modify parts of the TCP packets. Worst case scenario, an MPTCP connection should fallback to “plain” TCP. Today, such fallbacks are rarer than before – probably because MPTCP has been used since 2013 on millions of Apple smartphones worldwide – but they can still exist, e.g. on some mobile networks using Performance Enhancing Proxies (PEPs) where MPTCP connections are not bypassed. In such cases, a solution to continue benefiting from MPTCP is to tunnel the MPTCP connections. Different solutions exist, but they usually add extra layers, and requires setting a virtual private network (VPN) up with private IP addresses between the client and the server. Here, a simpler solution is presented: TCP-in-UDP. This solution relies on eBPF, doesn’t add extra data per packet, and doesn’t require a virtual private network. Read on to find out more about that!
Why Algebraic Effects?
- algebraic-effects
2025-05-25T21:33:01+00:00
Algebraic effects1 (a.k.a. effect handlers) are a very useful up-and-coming feature that I personally think will see a huge surge in popularity in the programming languages of tomorrow. They’re one of the core features of Ante, as well as being the focus of many research languages including Koka, Effekt, Eff, and Flix. However, while many articles or documentation snippets try to explain what effect handlers are (including Ante’s own documentation), few really go in-depth on why you would want to use them. In this post I’ll explain exactly that and will include as complete a list as possible on all the use-cases of algebraic effects.
SYNIT
- linux
- os
- actors
- capabilities
2025-04-27T16:31:27+00:00
Synit is an experiment in applying pervasive reactivity and object capabilities to the System Layer of an operating system for personal computers, including laptops, desktops, and mobile phones. Its architecture follows the principles of the Syndicated Actor Model. Synit builds upon the Linux kernel, but replaces many pieces of familiar Linux software, including systemd, NetworkManager, D-Bus, and so on. It makes use of many concepts that will be familiar to Linux users, but also incorporates many ideas drawn from programming languages and operating systems not closely connected with Linux’s Unix heritage.
TinyKVM: The Fastest Sandbox
- sandbox
- virtualization
2025-03-14T19:58:15+00:00
TinyKVM perhaps surprisingly places itself among the smallest serious sandboxing solutions out there, and may also be the fastest. It takes security seriously and tries to avoid complex guest features and kernel mode in general. TinyKVM has a minimal attack surface and no ambition to grow further in complexity, outside of nicer user-facing APIs and ports to other architectures.
Sequin
- cdc
- event
- postgresql
2024-11-01T03:12:16+00:00
Sequin is a tool for capturing changes and streaming data out of your Postgres database, guaranteeing exactly once processing.
MemHive: sharing immutable data between Python subinterpreters
- python
- concurrency
2024-10-18T00:55:25+00:00
He suggested that subinterpreters, which fairly recently each got their own GIL, might be the right approach
pgautoupgrade
- postgresql
2024-09-11T18:24:15+00:00
This is a PostgreSQL Docker container (based on postgres:16-alpine) that automatically upgrades your database. When it starts, it checks if your database files are for an older version (from PostgreSQL 9.5 onwards), and upgrades them (if needed), then starts the database server. If the database files don't need upgrading when it starts, then it skips the upgrade process and just starts PostgreSQL. The upgrade process uses the pg_upgrade utility behind the scenes, with the --link option enabled. This does an in-place upgrade for the quickest possible upgrade times.
GCRA: leaky buckets without the buckets
- algorithm
- load-balancing
2024-09-04T13:19:36+00:00
GCRA is the “generic cell rate algorithm”, a rate-limiting algorithm that came from ATM. GCRA does the same job as the better-known leaky bucket algorithm, but using half the storage and with much less code.
Nahida: In-Band Distributed Tracing with eBPF
- observability
- ebpf
2024-08-19T01:49:26+00:00
Microservices are commonly used in modern cloud-native applications to achieve agility. However, the complexity of service dependencies in large-scale microservices systems can lead to anomaly propagation, making fault troubleshooting a challenge. To address this issue, distributed tracing systems have been proposed to trace complete request execution paths, enabling developers to troubleshoot anomalous services. However, existing distributed tracing systems have limitations such as invasive instrumentation, trace loss, or inaccurate trace correlation. To overcome these limitations, we propose a new tracing system based on eBPF (extended Berkeley Packet Filter), named Nahida, that can track complete requests in the kernel without intrusion, regardless of programming language or implementation. Our evaluation results show that Nahida can track over 92% of requests with stable accuracy, even under the high concurrency of user requests, while the state-of-the-art non-invasive approaches can not track any of the requests. Importantly, Nahida can track requests served by a multi-threaded application that none of the existing invasive tracing systems can handle by instrumenting tracing codes into libraries. Moreover, the overhead introduced by Nahida is negligible, increasing service latency by only 1.55%-2.1%. Overall, Nahida provides an effective and non-invasive solution for distributed tracing.
Towards providing reliable job completion time predictions using PCS
- scheduling
2024-08-19T01:46:40+00:00
In this paper we build a case for providing job completion time predictions to cloud users, similar to the delivery date of a package or arrival time of a booked ride. Our analysis reveals that providing predictability can come at the expense of performance and fairness. Existing cloud scheduling systems optimize for extreme points in the trade-off space, making them either extremely unpredictable or impractical. To address this challenge, we present PCS, a new scheduling framework that aims to provide predictability while balancing other traditional objectives. The key idea behind PCS is to use Weighted-Fair-Queueing (WFQ) and find a suitable configuration of different WFQ parameters (e.g., class weights) that meets specific goals for predictability. It uses a simulation-aided search strategy, to efficiently discover WFQ configurations that lie on the Pareto front of the trade-off space between these objectives. We implement and evaluate PCS in the context of DNN job scheduling on GPUs. Our evaluation, on a small scale GPU testbed and larger-scale simulations, shows that PCS can provide accurate completion time estimates while marginally compromising on performance and fairness.
Introducing Sunlight, a CT implementation built for scalability, ease of operation, and reduced cost - Let's Encrypt
- identity
- integrity
- pki
2024-08-19T01:38:23+00:00
Let’s Encrypt is proud to introduce Sunlight, a new implementation of a Certificate Transparency log that we built from the ground up with modern Web PKI opportunities and constraints in mind
BinomialHash: A Constant Time, Minimal Memory Consistent Hash Algorithm
- consistent-hashing
- algorithm
2024-08-19T01:25:04+00:00
Consistent hashing is employed in distributed systems and networking applications to evenly and effectively distribute data across a cluster of nodes. This paper introduces BinomialHash, a consistent hashing algorithm that operates in constant time and requires minimal memory. We provide a detailed explanation of the algorithm, offer a pseudo-code implementation, and formally establish its strong theoretical guarantees.
JumpBackHash: Say Goodbye to the Modulo Operation to Distribute Keys Uniformly to Buckets
- algorithm
- consistent-hashing
2024-08-19T01:24:33+00:00
The distribution of keys to a given number of buckets is a fundamental task in distributed data processing and storage. A simple, fast, and therefore popular approach is to map the hash values of keys to buckets based on the remainder after dividing by the number of buckets. Unfortunately, these mappings are not stable when the number of buckets changes, which can lead to severe spikes in system resource utilization, such as network or database requests. Consistent hash algorithms can minimize remappings, but are either significantly slower than the modulo-based approach, require floating-point arithmetic, or are based on a family of hash functions rarely available in standard libraries. This paper introduces JumpBackHash, which uses only integer arithmetic and a standard pseudorandom generator. Due to its speed and simple implementation, it can safely replace the modulo-based approach to improve assignment and system stability. A production-ready Java implementation of JumpBackHash has been released as part of the Hash4j open source library.
Offload-friendly network encryption in the kernel
2024-08-19T01:23:35+00:00
The PSP security protocol (PSP) is a way to transparently encrypt packets by efficiently offloading encryption and decryption to the network interface cards (NICs) that Google uses for connections inside its data centers. The protocol is similar to IPsec, in that it allows for wrapping arbitrary traffic in a layer of encryption. The difference is that PSP is encapsulated in UDP, and designed from the beginning to reduce the amount of state that NICs have to track in order to send and receive encrypted traffic, allowing for more simultaneous connections.
How I Use Git Worktrees
- git
2024-08-19T01:19:02+00:00
There are a bunch of posts on the internet about using git worktree command. As far as I can tell, 1most of them are primarily about using worktrees as a replacement of, or a supplement to git branches. Instead of switching branches, you just change directories. This is also how I originally had used worktrees, but that didn’t stick, and I abandoned them. But recently worktrees grew on me, though my new use-case is unlike branching.
Pragmatic Category Theory | Part 1: Semigroup Intro
- math
2024-08-19T01:16:23+00:00
A semigroup describes an operation of appending two values of some type to get a value of the same type.
How to Get or Create in PostgreSQL
- sql
- postgresql
- database
2024-08-19T01:14:46+00:00
"Get or create" is a very common operation for syncing data in the database, but implementing it correctly may be trickier than you may expect. If you ever had to implement it in a real system with real-life load, you may have overlooked potential race conditions, concurrency issues and even bloat!
Can You Do Both: Fast Scans and Fast Writes in a Single System?
- storage
2024-08-19T01:13:01+00:00
The central component for handling HTAP workloads is our hybrid column-row storage engine that is able to manage hot and cold data in two different storage formats. In OLTP workloads, data access is typically focused on a small hot subset of the data. To efficiently support OLTP transactions, we store this hot data in an uncompressed, row-based format1. Cold data, which OLTP queries do not access frequently, is stored as large and encoded (i.e., compressed to allow processing without decompression) column chunks, an adaptation from data blocks
OpenObserve
- observability
- open-telemetry
2024-08-15T18:51:18+00:00
OpenObserve is a simple yet sophisticated log search, infrastructure monitoring, and APM solution. It is a full-fledged observability platform that can reduce your storage costs by ~140x compared to other solutions and requires much lower resource utilization resulting in much lower cost.
Printf Oriented Message Protocol
- messaging
- serialization
2024-08-13T15:06:58+00:00
This library offers a simple protocol to encode/decode messages and exchange them between processes on a socket (inet or local).
Heartbeat Scheduling
- concurrency
- scheduling
2024-08-13T03:38:49+00:00
A classic problem in parallel computing is to take a high-level parallel program written, for example, in nested-parallel style with fork-join constructs and run it efficiently on a real machine. The problem could be considered solved in theory, but not in practice, because the overheads of creating and managing parallel threads can overwhelm their benefits. Developing efficient parallel codes therefore usually requires extensive tuning and optimizations to reduce parallelism just to a point where the overheads become acceptable. In this paper, we present a scheduling technique that delivers provably efficient results for arbitrary nested-parallel programs, without the tuning needed for controlling parallelism overheads. The basic idea behind our technique is to create threads only at a beat (which we refer to as the ``heartbeat’’) and make sure to do useful work in between.
CRIB: checkpoint/restore in BPF
2024-08-08T03:31:36+00:00
CRIB, for Checkpoint/Restore in (naturally) BPF. It is far from clear that CRIB will replace the existing solutions, but it is an interesting look at a different way of solving the problem.
TQUIC
- rust
- quick
2024-08-08T03:17:10+00:00
High-performance, lightweight and cross-platform QUIC library
How to build highly-debuggable C++ binaries
- c++
- debugging
- fault-removal
- build
2024-07-29T15:37:35+00:00
How to configure your C++ toolchain to produce binaries that are highly-debuggable with respect to your current bug.
C++ Design Patterns for Low-latency Applications Including High-frequency Trading
- c++
- performance
2024-07-08T20:23:54+00:00
This work aims to bridge the existing knowledge gap in the optimisation of latency-critical code, specifically focusing on high-frequency trading (HFT) systems. The research culminates in three main contributions: the creation of a Low-Latency Programming Repository, the optimisation of a market-neutral statistical arbitrage pairs trading strategy, and the implementation of the Disruptor pattern in C++. The repository serves as a practical guide and is enriched with rigorous statistical benchmarking, while the trading strategy optimisation led to substantial improvements in speed and profitability. The Disruptor pattern showcased significant performance enhancement over traditional queuing methods. Evaluation metrics include speed, cache utilisation, and statistical significance, among others. Techniques like Cache Warming and Constexpr showed the most significant gains in latency reduction. Future directions involve expanding the repository, testing the optimised trading algorithm in a live trading environment, and integrating the Disruptor pattern with the trading algorithm for comprehensive system benchmarking. The work is oriented towards academics and industry practitioners seeking to improve performance in latency-sensitive applications
CPython Internals
2024-06-12T13:06:13+00:00
Ibid
DBus and systemd
- dbus
- systemd
- ipc
- linux
2024-06-12T03:58:25+00:00
Systemd uses DBus as the mechanism to interact with it. This article introduces just enough DBus concepts and the usage of busctl to communicate with systemd. These concepts should be useful when using DBus libraries
How we scale our microVM infrastructure using low-latency memory decompression
2024-06-12T03:56:04+00:00
How we efficiently store memory snapshots for VMs, and how we lazily load them to resume VMs within a second.
Effects, capabilities, and boxes: from scope-based reasoning to type-based reasoning and back
2024-06-12T03:51:26+00:00
Reasoning about the use of external resources is an important aspect of many practical applications. Effect systems enable tracking such information in types, but at the cost of complicating signatures of common functions. Capabilities coupled with escape analysis offer safety and natural signatures, but are often overly coarse grained and restrictive. We present System C, which builds on and generalizes ideas from type-based escape analysis and demonstrates that capabilities and effects can be reconciled harmoniously. By assuming that all functions are second class, we can admit natural signatures for many common programs. By introducing a notion of boxed values, we can lift the restrictions of second-class values at the cost of needing to track degree-of-impurity information in types. The system we present is expressive enough to support effect handlers in full capacity. We practically evaluate System C in an implementation and prove its soundness.
FaaSwap: SLO-Aware, GPU-Efficient Serverless Inference via Model Swapping
- FaaS
- scheduling
2024-06-12T03:12:42+00:00
Serverless computing has become increasingly popular for machine learning inference. However, current serverless platforms lack efficient support for GPUs, limiting their ability to deliver low-latency inference. In this paper, we propose FaaSwap, a GPU-efficient serverless inference platform. FaaSwap employs a holistic approach to system and algorithm design. It maintains models in main memory and dynamically swaps them onto GPUs upon request arrivals (i.e., late binding), thereby enabling a large number of inference functions to efficiently share a node's GPUs. FaaSwap uses various techniques, including asynchronous API redirection, GPU runtime sharing, pipelined model execution, and efficient GPU memory management, to achieve the optimal performance. We also develop an interference-aware request scheduling algorithm that allows FaaSwap to meet the latency SLOs for individual inference functions. We have implemented FaaSwap as a prototype on a leading commercial serverless platform. Experimental evaluations demonstrate that, with model swapping, FaaSwap can concurrently serve hundreds of functions on a single worker node with 4 V100 GPUs, while achieving inference performance comparable to native execution (where each function runs on a dedicated GPU). When deployed on a 6-node production testbed, FaaSwap meets the latency SLOs for over 1k functions, the maximum that the testbed can handle concurrently.
bpftime: userspace eBPF Runtime for Uprobe, Syscall and Kernel-User Interactions
- ebpf
2024-06-12T03:03:22+00:00
In kernel-centric operations, the uprobe component of eBPF frequently encounters performance bottlenecks, largely attributed to the overheads borne by context switches. Transitioning eBPF operations to user space bypasses these hindrances, thereby optimizing performance. This also enhances configurability and obviates the necessity for root access or privileges for kernel eBPF, subsequently minimizing the kernel attack surface. This paper introduces bpftime, a novel user-space eBPF runtime, which leverages binary rewriting to implement uprobe and syscall hook capabilities. Through bpftime, userspace uprobes achieve a 10x speed enhancement compared to their kernel counterparts without requiring dual context switches. Additionally, this runtime facilitates the programmatic hooking of syscalls within a process, both safely and efficiently. Bpftime can be seamlessly attached to any running process, limiting the need for either a restart or manual recompilation. Our implementation also extends to interprocess eBPF Maps within shared memory, catering to summary aggregation or control plane communication requirements. Compatibility with existing eBPF toolchains such as clang and libbpf is maintained, not only simplifying the development of user-space eBPF without necessitating any modifications but also supporting CO-RE through BTF. Through bpftime, we not only enhance uprobe performance but also extend the versatility and user-friendliness of eBPF runtime in user space, paving the way for more efficient and secure kernel operations.
ttl.sh
- docker
- containers
2024-06-12T03:01:31+00:00
Anonymous & ephemeral Docker image registry
ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models
- FaaS
- scheduling
2024-06-12T02:56:28+00:00
This paper presents ServerlessLLM, a locality-enhanced serverless inference system for Large Language Models (LLMs). ServerlessLLM exploits the substantial capacity and bandwidth of storage and memory devices available on GPU servers, thereby reducing costly remote checkpoint downloads and achieving efficient checkpoint loading. ServerlessLLM achieves this through three main contributions: (i) fast LLM checkpoint loading via a novel loading-optimized checkpoint format design, coupled with an efficient multi-tier checkpoint loading system; (ii) locality-driven LLM inference with live migration, which allows ServerlessLLM to effectively achieve locality-driven server allocation while preserving the low latency of ongoing LLM inference; and (iii) locality-aware server allocation, enabling ServerlessLLM to evaluate the status of each server in a cluster and effectively schedule model startup time to capitalize on local checkpoint placement. Our comprehensive experiments, which include microbenchmarks and real-world traces, show that ServerlessLLM surpasses state-of-the-art systems by 10 - 200X in latency performance when running various LLM inference workloads
High-Speed Packet Transmission in Go: From net.Dial to AF_XDP
2024-06-12T02:50:45+00:00
Pushing limits in Go: from net.Dial to syscalls, AF_PACKET, and lightning-fast AF_XDP. Benchmarking packet sending performance.
Linux Crisis Tools
- linux
2024-06-12T02:48:30+00:00
When you have an outage caused by a performance issue, you don't want to lose precious time just to install the tools needed to diagnose it. Here is a list of "crisis tools" I recommend installing on your Linux servers by default
43 Years of Actors: A Taxonomy of Actor Models and Their Key Properties
- concurrency
- actors
2024-06-12T02:44:51+00:00
The Actor Model is a message passing concurrency model that was originally proposed by Hewitt et al. in 1973. It is now 43 years later and since then researchers have explored a plethora of variations on this model. This paper presents a history of the Actor Model throughout those years. The goal of this paper is not to provide an exhaustive overview of every actor system in existence but rather to give an overview of some of the exemplar languages and libraries that influenced the design and rationale of other actor systems throughout those years. This paper therefore shows that most actor systems can be roughly classified into four families, namely: Classic Actors, Active Objects, Processes and Communicating Event-Loops. This paper also defines the Isolated Turn Principle as a unifying principle across those four families. Additionally this paper lists some of the key properties along which actor systems can be evaluated and formulates some general insights about the design and rationale of the different actor families across those dimensions
KnapsackLB: Enabling Performance-Aware Layer-4 Load Balancing
- load-balancing
2024-06-12T02:41:38+00:00
Layer-4 load balancer (LB) is a key building block of online services. In this paper, we empower such LBs to adapt to different and dynamic performance of backend instances (DIPs). Our system, KNAPSACKLB, is generic (can work with variety of LBs), does not require agents on DIPs, LBs or clients, and scales to large numbers of DIPs. KNAPSACKLB uses judicious active probes to learn a mapping from LB weights to the response latency of each DIP, and then applies Integer Linear Programming (ILP) to calculate LB weights that optimize latency, using an iterative method to scale the computation to large numbers of DIPs. Using testbed experiments and simulations, we show that KNAPSACKLB load balances traffic as per the performance and cuts average latency by up to 45% compared to existing designs.
acme-dns
- acme
- dns
2024-06-12T02:39:35+00:00
Limited DNS server with RESTful HTTP API to handle ACME DNS challenges easily and securely
ssh authentication via Yubikeys
- integrity
- authentication
- yubi
- ssh
2024-06-12T02:35:42+00:00
Ssh public key authentication can be hardened to require a hardware token like the Yubikeys (series 5 onwards).
Dispatch
2024-06-12T02:33:46+00:00
You don’t have to rewrite your code to leverage Dispatch, nor there are any new APIs to learn. Our Python SDK exposes a single decorator to wrap your function with to add automatic retries, execution resumability, rate limiting and asynchronous execution.
SFVInt: Simple, Fast and Generic Variable-Length Integer Decoding using Bit Manipulation Instructions
- serialization
2024-06-12T02:30:56+00:00
The ubiquity of variable-length integers in data storage and communication necessitates efficient decoding techniques. In this paper, we present SFVInt, a simple and fast approach to decode the prevalent Little Endian Base-128 (LEB128) varints. Our approach effectively utilizes the Bit Manipulation Instruction Set 2 (BMI2) in modern Intel and AMD processors, achieving significant performance improvement while maintaining simplicity and avoiding overengineering. SFVInt, with its generic design, effectively processes both 32-bit and 64-bit unsigned integers using a unified code template, marking a significant leap forward in varint decoding efficiency. We thoroughly evaluate SFVInt's performance across various datasets and scenarios, demonstrating that it achieves up to a 2x increase in decoding speed when compared to varint decoding methods used in established frameworks like Facebook Folly and Google Protobuf.
Crimes with Python's Pattern Matching
- python
2024-06-04T15:00:54+00:00
That made me wonder if ABCs could “hijack” a pattern match ....
tantivy
- search
2024-05-27T20:28:01+00:00
Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
signify
- integrity
2024-05-16T22:30:35+00:00
OpenBSD tool to sign and verify signatures on files
Entity Attestation Token White Paper
2024-05-03T14:43:24+00:00
As the digital world has changed and adapted, we have found ourselves with many connected devices that are quite different to the typical laptop or phone that we are used to. To make use of these new devices, Cloud Service Providers (CSPs) need to integrate them into their cloud platforms and require a way to identify, characterize and authenticate them. The usual Internet protocols and services for security and authentication are not device-oriented and do not suit this purpose. This white paper introduces the idea of device attestation, the upcoming Entity Attestation Token (EAT) standard and how the PSA Certified ecosystem is planning to support it.
Common DB schema change mistakes
- postgresql
2024-04-29T16:53:46+00:00
In his article "Lesser Known PostgreSQL Features", @be_haki describes 18 Postgres features many people don't know. I enjoyed that article, and it inspired me to write about "anti-features" – things that everyone should avoid when working in probably the riskiest field of application development – so-called "schema migrations".
Python AppImage
- python
- deployment
- linux
2024-04-26T12:29:59+00:00
AppImage distributions of Python
Firejail
- linux
- sandbox
2024-04-26T12:27:42+00:00
Introduction Firejail is a SUID program that reduces the risk of security breaches by restricting the running environment of untrusted applications using Linux namespaces and seccomp-bpf. It allows a process and all its descendants to have their own private view of the globally shared kernel resources, such as the network stack, process table, mount table.…
RFC 9052 - CBOR Object Signing and Encryption (COSE)
2024-04-22T15:59:37+00:00
This specification describes how to create and process signatures, message authentication codes, and encryption using CBOR for serialization. This specification additionally describes how to represent cryptographic keys using CBOR.
Device Onboarding Using FDO and the Untrusted Installer Model
- authentication
- device-onboarding
2024-04-22T15:56:57+00:00
All the devices, in all these fields, share an important characteristic: They were all manufactured under some initial ownership, and they were all transferred into their target application, coming under another ownership. Only in the target context does the IoT device perform its intended function while interacting with supporting servers. The challenge is to set up this interaction in a manner that is fast, reliable, and secure. This process is referred to as onboarding.
Device Onboarding Overview
- authentication
- device-onboarding
2024-04-22T15:56:02+00:00
Device onboarding is the process of installing secrets and configuration data into a device so that the device is able to connect and interact securely with cloud and edge management platforms. The platform is used by the device owner to manage the device by: patching security vulnerabilities; installing or updating software; retrieving sensor data; interacting with actuators; etc. FIDO Device Onboard (FDO) is an automatic onboarding mechanism, meaning that it is invoked autonomously and performs only limited, specific, interactions with its environment to complete. FIDO Device Onboard permits late binding of device credentials, so that one manufactured device may onboarded, without modification, to many different cloud and edge management platforms.
Remote ATtestation ProcedureS (rats)
2024-04-22T15:54:53+00:00
In network protocol exchanges, it is often the case that one entity (a Relying Party) requires evidence about the remote peer (and system components [RFC4949] thereof), in order to assess the trustworthiness of the peer. Remote attestation procedures (RATS) determine whether relying parties can establish a level of confidence in the trustworthiness of remote peers, called Attesters. The objective is achieved by a two-stage appraisal procedure facilitated by a trusted third party, called Verifier, with trusted links to the supply chain.
capslock
- go
- capabilities
- sandbox
2024-04-22T00:05:56+00:00
Capslock is a capability analysis CLI for Go packages that informs users of which privileged operations a given package can access. This works by classifying the capabilities of Go packages by following transitive calls to privileged standard library operations
Historical records with PostgreSQL, temporal tables and SQL:2011
- postgresql
- bitemporal
2024-03-26T13:08:25+00:00
Sometimes you need to find out what a record looked like at some point in the past. This is known as the Slowly Changing Dimension problem. Most database models - by design - don’t keep the history of a record when it’s updated. But there are plenty of reasons why you might need to do this, such as audit/security purposes, implementing an undo functionality, showing a model’s change over time for stats or comparison. There are a few ways to do this in PostgreSQL, but this article is going to focus on the implementation provided by the SQL:2011 standard, which added support for temporal databases. It’s also going to focus on actually querying that historical data, with some real-life use cases. PostgreSQL doesn’t support these features natively, but this temporal tables approximates them
How we built a fair multi-tenant queuing system
2024-03-26T13:06:43+00:00
A fair, low-latency, multi-tenant queue which operates with multiple shared-nothing workers that claim jobs in an (almost) contention-free way.
The design and implementation of a lock-free ring-buffer with contiguous reservations
- non-blocking
- datastructure
2024-03-26T13:05:31+00:00
A BipBuffer is a bi-partite circular buffer that always supports writing a contiguous chunk of data, instead of potentially splitting a write in two chunks when it straddles the buffer's boundaries. Circular buffers are a common primitive for asynchronous (inter- or intra- thread) communication. Let's start with a very abstract, idealised view of the circular buffer interface, and then consider real-world constraints one by one, till we get to the BipBuffer design.
Sparrow: Distributed, Low Latency Scheduling
- scheduling
- load-balancing
2024-03-26T13:04:14+00:00
Large-scale data analytics frameworks are shifting towards shorter task durations and larger degrees of parallelism to provide low latency. Scheduling highly parallel jobs that complete in hundreds of milliseconds poses a major challenge for task schedulers, which will need to schedule millions of tasks per second on appropriate machines while offering millisecond-level latency and high availability. We demonstrate that a decentralized, randomized sampling approach provides near-optimal performance while avoiding the throughput and availability limitations of a centralized design. We implement and deploy our scheduler, Sparrow, on a 110-machine cluster and demonstrate that Sparrow performs within 12% of an ideal scheduler.
talc
- rust
- memory
2024-02-29T02:24:35+00:00
A fast and flexible allocator for no_std and WebAssembly
CRIU -- Checkpoint Restore in Userspace for computational simulations and scientific applications
- checkpointing
2024-02-29T02:16:29+00:00
Creating new materials, discovering new drugs, and simulating systems are essential processes for research and innovation and require substantial computational power. While many applications can be split into many smaller independent tasks, some cannot and may take hours or weeks to run to completion. To better manage those longer-running jobs, it would be desirable to stop them at any arbitrary point in time and later continue their computation on another compute resource; this is usually referred to as checkpointing. While some applications can manage checkpointing programmatically, it would be preferable if the batch scheduling system could do that independently. This paper evaluates the feasibility of using CRIU (Checkpoint Restore in Userspace), an open-source tool for the GNU/Linux environments, emphasizing the OSG's OSPool HTCondor setup. CRIU allows checkpointing the process state into a disk image and can deal with both open files and established network connections seamlessly. Furthermore, it can checkpoint traditional Linux processes and containerized workloads. The functionality seems adequate for many scenarios supported in the OSPool. However, some limitations prevent it from being usable in all circumstances.
DotSlash: Simplified executable deployment
- deployment
2024-02-29T02:15:39+00:00
With DotSlash, a set of platform-specific executables is replaced with a single script containing descriptors for the supported platforms. DotSlash handles transparently fetching, decompressing, and verifying the appropriate remote artifact for the current operating system and CPU
IfState
- linux
- python
- networking
2024-02-29T02:12:06+00:00
IfState is a python3 utility to configure the Linux network stack in a declarative manner. It is a frontend for the kernel’s netlink protocol
Multi-Objective Optimization of Consumer Group Autoscaling in Message Broker Systems
- queuing
- scaling
2024-02-29T02:11:12+00:00
Message brokers often mediate communication between data producers and consumers by adding variable-sized messages to ordered distributed queues. Our goal is to determine the number of consumers and consumer-partition assignments needed to ensure that the rate of data consumption keeps up with the rate of data production. We model the problem as a variable item size bin packing problem. As the rate of production varies, new consumer-partition assignments are computed, which may require rebalancing a partition from one consumer to another. While rebalancing a queue, the data being produced into the queue is not read leading to additional latency costs. As such, we focus on the multi-objective optimization cost of minimizing both the number of consumers and queue migrations. We present a variety of algorithms and compare them to established bin packing heuristics for this application. Comparing our proposed consumer group assignment strategy with Kafka's, a commonly employed strategy, our strategy presents a 90th percentile latency of 4.52s compared to Kafka's 217s with both using the same amount of consumers. Kafka's assignment strategy only improved the consumer group's performance with regards to latency with configurations that used at least 60% more resources than our approach.
pgtemp
- postgresql
- testing
2024-02-29T02:08:58+00:00
Pgtemp is a Rust library and cli tool that allows you to easily create temporary PostgreSQL servers for testing without using Docker. The pgtemp Rust library allows you to spawn a PostgreSQL server in a temporary directory and get back a full connection URI with the host, port, username, and password.
Virtio live migration technical deep dive
- virtualization
2024-02-29T02:08:07+00:00
This article explores the live migration steps QEMU performs and how it tracks the information it needs to make the process transparent. It explains how QEMU coordinates with vhost-kernel, the device already described in the vhost-net deep dive. I will show how the device can report all the data required for the destination QEMU to continue the device operation. I will also explain how the guest can switch device properties, such as MAC address or number of active queues, and resume the workload seamlessly in the destination.
A Systematic Literature Review on Task Allocation and Performance Management Techniques in Cloud Data Center
- scheduling
2024-02-29T02:06:42+00:00
As cloud computing usage grows, cloud data centers play an increasingly important role. To maximize resource utilization, ensure service quality, and enhance system performance, it is crucial to allocate tasks and manage performance effectively. The purpose of this study is to provide an extensive analysis of task allocation and performance management techniques employed in cloud data centers. The aim is to systematically categorize and organize previous research by identifying the cloud computing methodologies, categories, and gaps. A literature review was conducted, which included the analysis of 463 task allocations and 480 performance management papers. The review revealed three task allocation research topics and seven performance management methods. Task allocation research areas are resource allocation, load-Balancing, and scheduling. Performance management includes monitoring and control, power and energy management, resource utilization optimization, quality of service management, fault management, virtual machine management, and network management. The study proposes new techniques to enhance cloud computing work allocation and performance management. Short-comings in each approach can guide future research. The research's findings on cloud data center task allocation and performance management can assist academics, practitioners, and cloud service providers in optimizing their systems for dependability, cost-effectiveness, and scalability. Innovative methodologies can steer future research to fill gaps in the literature.
openzfs-nvme-databases
- database
- storage
- zfs
2024-02-29T02:06:14+00:00
This documents the settings we use at Let's Encrypt to create ZFS backing storage for MariaDB, and the tips and best practices that led us here.
Standard Webhooks
- web
- webhooks
2024-02-29T01:59:37+00:00
Standard Webhooks is a set of open source tools and guidelines to send webhooks easily, securely and reliably. Webhooks are becoming increasingly popular, though every webhooks provider implements them differently and with varying quality. This makes it hard for providers who need to reinvent the wheel every time and repeat the same costly mistakes, and annoying for consumers who need to have a different implementation for each provider. It's also holding back the ecosystem as a whole, as these incompatibilities mean that no tools are being built to help senders send, consumers consume, and for everyone to innovate on top.
OpenPubKey SSH
- openpubkey
- ssh
2024-02-29T01:55:24+00:00
The following guide covers how to install and deploy OpenPubkey SSH to enable SSH access without the use of SSH keys.
Schism: a workload-driven approach to database replication and partitioning
- sharding
- scalability
2024-02-29T01:52:54+00:00
We present Schism, a novel workload-aware approach for database partitioning and replication designed to improve scalability of shared-nothing distributed databases. Because distributed transactions are expensive in OLTP settings (a fact we demonstrate through a series of experiments), our partitioner attempts to minimize the number of distributed transactions, while producing balanced partitions. Schism consists of two phases: i) a workload-driven, graph-based replication/partitioning phase and ii) an explanation and validation phase. The first phase creates a graph with a node per tuple (or group of tuples) and edges between nodes accessed by the same transaction, and then uses a graph partitioner to split the graph into k balanced partitions that minimize the number of cross-partition transactions. The second phase exploits machine learning techniques to find a predicate-based explanation of the partitioning strategy (i.e., a set of range predicates that represent the same replication/partitioning scheme produced by the partitioner). The strengths of Schism are: i) independence from the schema layout, ii) effectiveness on n-to-n relations, typical in social network databases, iii) a unified and fine-grained approach to replication and partitioning. We implemented and tested a prototype of Schism on a wide spectrum of test cases, ranging from classical OLTP workloads (e.g., TPC-C and TPC-E), to more complex scenarios derived from social network websites (e.g., Epinions.com), whose schema contains multiple n-to-n relationships, which are known to be hard to partition. Schism consistently outperforms simple partitioning schemes, and in some cases proves superior to the best known manual partitioning, reducing the cost of distributed transactions up to 30%.
Choreographic Programming
- choreographic-programming
- concurrency
2024-02-29T01:47:54+00:00
Choreographies are coordination plans for concurrent and distributed systems. A choreography defines the roles of the involved participants and how they are supposed to work together. In the emerging paradigm of choreographic programming (CP), choreographies are programs that can be compiled to executable implementations