Htmx allows you to access AJAX, WebSockets and Server Sent Events directly in HTML, using attributes, so you can build modern user interfaces with the simplicity and power of hypertext
DuckDB is an embedded database designed to execute analytical SQL queries fast while embedded in another process
C library for accessing the PostgreSQL parser outside of the server. This library uses the actual PostgreSQL server source to parse SQL queries and return the internal PostgreSQL parse tree.
"Distroless" images contain only your application and its runtime dependencies. They do not contain package managers, shells or any other programs you would expect to find in a standard Linux distribution.
This page provides an overview of barebones drop-in minimal CSS boilerplate frameworks
For decades, applications deployed on a world-wide scale have been forced to give up at least one of (1) strict serializability (2) low latency writes (3) high transactional throughput. In this paper we discuss SLOG: a system that avoids this tradeoff for workloads which contain physical region locality in data access. SLOG achieves high-throughput, strictly serializable ACID transactions at geo-replicated distance and scale for all transactions submitted across the world, all the while achieving low latency for transactions that initiate from a location close to the home region for data they access. Experiments find that SLOG can reduce latency by more than an order of magnitude relative to state-of-the-art strictly serializable geo-replicated database systems such as Spanner and Calvin, while maintaining high throughput under contention.
The shared log paradigm is at the heart of modern distributed applications in the growing cloud computing industry. Often, application logs must be stored durably for analytics, regulations, or failure recovery, and their smooth operation depends closely on how the log is implemented. Scalog is a new implementation of the shared log abstraction that offers an unprecedented combination of features for continuous smooth delivery of service: Scalog allows applications to customize data placement, supports reconfiguration with no loss in availability, and recovers quickly from failures. At the same time, Scalog provides high throughput and total order.
Llvm-mca is a performance analysis tool that uses information available in LLVM (e.g. scheduling models) to statically measure the performance of machine code in a specific CPU. Performance is measured in terms of throughput as well as processor resource consumption. The tool currently works for processors with an out-of-order backend, for which there is a scheduling model available in LLVM.
An error kernel is the part of an application or system that should never fail. This is ideally as small as possible since the less surface area that it has, the less likely catastrophic failures will happen. When they do happen, the core system is small enough to reason about which greatly helps with triage and cauterizing of wounds.
C++ reactors, coros, datastructures, etc
Today I want to criticize the whole logging culture and provide a bunch of alternatives.
This is a collection of lectures and labs Linux kernel topics. The lectures focus on theoretical and Linux kernel exploration. The labs focus on device drivers topics and they resemble “howto” style documentation
MeiliSearch is a RESTful search API. It aims to be a ready-to-go solution for everyone who wants a fast and relevant search experience for their end-users
In this document, the internals of PostgreSQL for database administrators and system developers are described.
Quicly is a QUIC implementation, written from the ground up to be used within the H2O HTTP server.
Quiche is an implementation of the QUIC transport protocol and HTTP/3 as specified by the IETF. It provides a low level API for processing QUIC packets and handling connection state. The application is responsible for providing I/O (e.g. sockets handling) as well as an event loop with support for timers.
Random replication is widely used in data center storage systems to prevent data loss. However, random replication is almost guaranteed to lose data in the common scenario of simultaneous node failures due to cluster-wide power outages. Due to the high fixed cost of each incident of data loss, many data center operators prefer to minimize the frequency of such events at the expense of losing more data in each event. We present Copyset Replication, a novel general-purpose replication technique that significantly reduces the frequency of data loss events
This article discusses a new replica placement technique that reduces the probability of data loss.
I am a strong believer that Bourne-derived languages are extremely bad, on the same order of badness as Perl, for programming, and consider programming sh for any purpose other than as a super-portable, lowest-common-denominator platform for build or bootstrap scripts and the like, as an extremely misguided endeavor.
Lightweight, fault-tolerant message streams
Library for Restartable Sequences.
Data-oriented design is inspired by high-performance computing techniques, database design, and functional programming values. It provides a practical methodology that reduces complexity while improving performance of both your development team and your product. Understand the goal, understand the data, understand the hardware, develop the solution
Tracking, Benchmarking and Sharing Information about an open source embedded data storage engines, internals, architectures, data storage and transaction processing.
MsQuic is a Microsoft implementation of the IETF QUIC protocol. It is cross platform, written in C and designed to be a general purpose QUIC library
NPF is a layer 3 packet filter, supporting stateful packet inspection, IPv6, NAT, IP sets, extensions and many more. It uses BPF as its core engine and it was designed with a focus on high performance, scalability, multi-threading and modularity. NPF was written from scratch in 2009. It is written in C99 and distributed under the 2-clause BSD license. NPF is provided as a userspace library to be used in a bespoke application to process packets. It can run on Linux, typically, in combination with such frameworks like Data Plane Development Kit (DPDK) or netmap.
This informational document is an attempt to make the case for implementing network protocols without performing I/O, and to provide concrete assistance and instructions for doing so in Python
A small and efficient runtime for WebAssembly & WASI
WAVM is a WebAssembly virtual machine, designed for use in non-web applications.
Lucet is a native WebAssembly compiler and runtime. It is designed to safely execute untrusted WebAssembly programs inside your application
Krustlet acts as a Kubelet by listening on the event stream for new pods that the scheduler assigns to it based on specific Kubernetes tolerations. The default implementation of Krustlet listens for the architecture wasm32-wasi and schedules those workloads to run in a wasmtime-based runtime instead of a container runtime
I’m writing the book in Markdown, then use Pandoc to convert it to PDF and EPUB.
Setting up a new C++ project usually requires a significant amount of preparation and boilerplate code, even more so for modern C++ projects with tests, executables and contiguous integration. This template is the result of learnings from many previous projects and should help reduce the work required to setup up a modern C++ project.
Pg_auto_failover is an extension and service for PostgreSQL that monitors and manages automated failover for a Postgres cluster. It is optimized for simplicity and correctness and supports Postgres 10 and newer
The notion of consistency is used across different computer science disciplines from distributed systems to database systems to computer architecture. It turns out that consistency can mean quite different things across these disciplines, depending on who uses it and in what context it appears. We identify two broad types of consistency, state consistency and operation consistency, which differ fundamentally in meaning and scope. We explain how these types map to the many examples
This is a list of resources related to LD_PRELOAD, a mechanism for changing application behavior at run-time. Libraries can override specified functions with another, for example, making time(3) always return 0. This is often useful for testing or modifying application behavior without source code changes.
Distributed consensus is a fundamental primitive for constructing fault-tolerant, strongly-consistent distributed systems. Though many distributed consensus algorithms have been proposed, just two dominate production systems: Paxos, the traditional, famously subtle, algorithm; and Raft, a more recent algorithm positioned as a more understandable alternative to Paxos. In this paper, we consider the question of which algorithm, Paxos or Raft, is the better solution to distributed consensus? We analyse both to determine exactly how they differ by describing a simplified Paxos algorithm using Raft's terminology and pragmatic abstractions. We find that both Paxos and Raft take a very similar approach to distributed consensus, differing only in their approach to leader election. Most notably, Raft only allows servers with up-to-date logs to become leaders, whereas Paxos allows any server to be leader provided it then updates its log to ensure it is up-to-date. Raft's approach is surprisingly efficient given its simplicity as, unlike Paxos, it does not require log entries to be exchanged during leader election. We surmise that much of the understandability of Raft comes from the paper's clear presentation rather than being fundamental to the underlying algorithm being presented.
It is a reliable replication protocol inspired by the multiprocessor's cache-coherence protocols. Hermes combines a broadcast-based, invalidating design with logical timestamps and the idea of early value propagation to achieve the following desired properties for reliable datastores.
KCP is a fast and reliable protocol that can achieve the transmission effect of a reduction of the average latency by 30% to 40% and reduction of the maximum delay by a factor of three, at the cost of 10% to 20% more bandwidth wasted than TCP. It is implemented by using the pure algorithm, and is not responsible for the sending and receiving of the underlying protocol (such as UDP), requiring the users to define their own transmission mode for the underlying data packet, and provide it to KCP in the way of callback. Even the clock needs to be passed in from the outside, without any internal system calls.
SSH is a powerful tool which often grants a lot of access to anyone using it to log into a server. In this post, I’m going to talk about a few different ways that you can easily improve the security of your SSH model without needing to deploy a new application or make any huge changes to user experience.
Nested Intervals generalize Nested Sets. They are immune to hierarchy reorganization problem. They allow answering ancestor path hierarchical queries algorithmically - without accessing the stored hierarchy relation.
Io_uring is a clever new, high-performance interface for asynchronous I/O for Linux without the drawbacks of the aio set of APIs. In this 3-part article series, we look at how to use io_uring to get the most common programming tasks done under Linux
People often ask us for an overview of how Tailscale works. We’ve been putting off answering that, because we kept changing it! But now things have started to settle down. Let’s go through the entire Tailscale system from bottom to top, the same way we built it (but skipping some zigzags we took along the way).
And one of the most difficult and subtle topics in all of networking
Draw AWS, Azure, GCP, Kubernetes diagrams
He C++ programming language offers a strong exception mechanism for error handling at the language level, improving code readability, safety, and maintainability. However, current C++ implementations are targeted at general-purpose systems, often sacrificing code size, memory usage, and resource determinism for the sake of performance. This makes C++ exceptions a particularly undesirable choice for embedded applications where code size and resource determinism are often paramount. Consequently, embedded coding guidelines either forbid the use of C++ exceptions, or embedded C++ tool chains omit exception handling altogether. In this paper, we develop a novel implementation of C++ exceptions that eliminates these issues, and enables their use for embedded systems. We combine existing stack unwinding techniques with a new approach to memory management and run-time type information (RTTI). In doing so we create a compliant C++ exception handling implementation, providing bounded runtime and memory usage, while reducing code size requirements by up to 82%, and incurring only a minimal runtime overhead for the common case of no exceptions.
Serverless containers and functions are widely used for deploying and managing software in the cloud. Their popularity is due to reduced cost of operations, improved utilization of hardware, and faster scaling than traditional deployment methods. The economics and scale of serverless applications demand that workloads from multiple customers run on the same hardware with minimal overhead, while preserving strong security and performance isolation. The traditional view is that there is a choice between virtualization with strong security and high overhead, and container technologies with weaker security and minimal overhead. This tradeoff is unacceptable to public infrastructure providers, who need both strong security and minimal overhead. To meet this need, we developed Fire-cracker, a new open source Virtual Machine Monitor (VMM)specialized for serverless workloads, but generally useful for containers, functions and other compute workloads within a reasonable set of constraints. We have deployed Firecracker in two publically available serverless compute services at Amazon Web Services (Lambda and Fargate), where it supports millions of production workloads, and trillions of requests per month. We describe how specializing for serverless in-formed the design of Firecracker, and what we learned from seamlessly migrating Lambda customers to Firecracker.
Debezium is an open source distributed platform for change data capture. Start it up, point it at your databases, and your apps can start responding to all of the inserts, updates, and deletes that other apps commit to your databases. Debezium is durable and fast, so your apps can respond quickly and never miss an event, even when things go wrong.
Recutils is a collection of tools, like recins, recdel, and recsel used to manage these recfiles/databases. They allow for all the normal basic relational database operations, typing, auto-incrementing, and even field-level crypto. All of this power is yours with the bonus that your database is a human-readable text file that you can grep/awk/sed freely, and a line-oriented structure makes it perfect for version control systems.
Function-as-a-Service (FaaS) platforms and "serverless" cloud computing are becoming increasingly popular. Current FaaS offerings are targeted at stateless functions that do minimal I/O and communication. We argue that the benefits of serverless computing can be extended to a broader range of applications and algorithms. We present the design and implementation of Cloudburst, a stateful FaaS platform that provides familiar Python programming with low-latency mutable state and communication, while maintaining the autoscaling benefits of serverless computing. Cloudburst accomplishes this by leveraging Anna, an autoscaling key-value store, for state sharing and overlay routing combined with mutable caches co-located with function executors for data locality. Performant cache consistency emerges as a key challenge in this architecture. To this end, Cloudburst provides a combination of lattice-encapsulated state and new definitions and protocols for distributed session consistency. Empirical results on benchmarks and diverse applications show that Cloudburst makes stateful functions practical, reducing the state-management overheads of current FaaS platforms by orders of magnitude while also improving the state of the art in serverless consistency
Bloom filters (BF) are widely used for approximate membership queries over a set of elements. BF variants allow removals, sets of unbounded size or querying a sliding window over an unbounded stream. However, for this last case the best current approaches are dictionary based (e.g., based on Cuckoo Filters or TinyTable), and it may seem that BF-based approaches will never be competitive to dictionary-based ones. In this paper we present Age-Partitioned Bloom Filters, a BF-based approach for duplicate detection in sliding windows that not only is competitive in time-complexity, but has better space usage than current dictionary-based approaches (e.g., SWAMP), at the cost of some moderate slack. APBFs retain the BF simplicity, unlike dictionary-based approaches, important for hardware-based implementations, and can integrate known improvements such as double hashing or blocking. We present an Age-Partitioned Blocked Bloom Filter variant which can operate with 2-3 cache-line accesses per insertion and around 2-4 per query, even for high accuracy filters.
We are open-sourcing Rezolus, our high-resolution systems performance telemetry agent. Rezolus began in an effort to help uncover performance anomalies and utilization spikes that were too brief to be captured through our normal observability and metrics systems. It has proven to be useful to help us quantify workload characteristics, provide data to drive optimization efforts, and has been used to diagnose runtime performance issues
In this blog post, we will walk through a new client-side load balancing technique we’ve developed and deployed widely at Twitter which has allowed our microservice architecture to efficiently scale clusters to thousands of instances. We call this new technique deterministic aperture
This paper discusses the ways in which automation of industrial processes may expand rather than eliminate problems with the human operator.
While a lot of work has gone into optimizing TCP implementations as much as possible over the years, including building offloading capabilities in both software (like in operating systems) and hardware (like in network interfaces), UDP hasn't received quite as much attention as TCP, which puts QUIC at a disadvantage. In this post we'll look at a few tricks that help mitigate this disadvantage for UDP, and by association QUIC.
Faker is a Python package that generates fake data for you. Whether you need to bootstrap your database, create good-looking XML documents, fill-in your persistence to stress test it, or anonymize data taken from a production service, Faker is for you.
Resilience engineering papers. Contribute to lorin/resilience-engineering development by creating an account on GitHub.
The three fundamental building blocks of computers, CPU, I/O and RAM can come under pressure due to contention and this is not uncommon. To both accurately size workloads and to increase hardware utilization, having information on how much pressure workloads are causing can come in very handy
A fully-featured event store and message store implemented in PostgreSQL for Pub/Sub, Event Sourcing, Messaging, and Evented Microservices applications.
When writing a bump allocator, always bump downwards. That is, allocate from high addresses, down towards lower addresses by decrementing the bump pointer. Although it is perhaps less natural to think about, it is more efficient than incrementing the bump pointer and allocating from lower addresses up to higher ones
Flat combining is a concurrency threaded technique whereby one thread performs all the operations in batch by scanning a queue of operations to-be-done and performing them together. Flat combining makes sense as long as k operations each taking O(n) separately can be batched together and done in less than O(k*n). Red black tree is a balanced binary search tree with permanent balancing warranties. Operations in red black tree are hard to batch together: for example inserting nodes in two different branches of the tree affect different areas of the tree. In this paper we investigate alternatives to making a flat combine approach work for red black trees.
In this article, I’d like to share some of our firsthand experience in designing a large-scale distributed storage system based on the Raft consensus algorithm.
How about building a tiny virtual machine manager (VMM) and a super tiny “operating system” to understand how KVM really works? That’s exactly what we’ll be doing with Sparkler.
We present Taiji, a new system for managing user traffic for large-scale Internet services that accomplishes two goals: 1) balancing the utilization of data centers and 2) minimizing network latency of user requests.
This paper presents our design and experience with a microkernel-inspired approach to host networking called Snap. Snap is a userspace networking system that supports Google’s rapidly evolving needs with flexible modules that implement a range of network functions, including edge packet switching, virtualization for our cloud platform, traffic shaping policy enforcement, and a high-performance reliable messaging and RDMA-like service
Channels are the main synchronization and communication primitive in Go, they need to be fast and scalable
The concurrent 2-trie is a dictionary-like data structure that is designed and optimized specifically to be used for translation tables in file buffer pools. When the concurrent 2-trie replaced the lock-striped hop-scotch hash tables in the Neo4j file buffer pool, file buffer accesses became 30% faster. The concurrent 2-trie optimizes for the domain of file buffer translation tables by making the following observations: First, the file page identifiers form a sequence, starting from zero until the last page in the file. Second, files can only grow by being extended at the end. Third, most database deployments are able to fit the majority of their data in memory, and often the data set fits entirely in memory. This means the translation tables are usually densely packed.
Detecting that effective “learning” from an incident has taken place is quite difficult to do. Making progress in learning from incidents is difficult to capture and characterize. However, there are a number of potential indicators that, taken together, could provide evidence of progress in learning from incidents.
An event loop for C using io_uring
Jupyter kernel for TLA⁺ and Pluscal specification languages