> **This page is a list of papers, articles, and code that I’m reading. My goal is to update it several times a week. Summaries included at my discretion.** **Topics Include:** - Systems, ML, Database papers - Tech / VC Industry - Engineering Blogs / HackerNews Links - Interesting GitHub projects - Economics, Literature, Culture ## Backlog - [Billion-scale similarity search with GPUs (PDF)](https://arxiv.org/pdf/1702.08734.pdf) - [FreshDiskANN: A Fast and Accurate Graph-Based ANN Index](https://arxiv.org/abs/2105.09613) - [Photon: A Fast Query Engine for Lakehouse Systems](https://15721.courses.cs.cmu.edu/spring2023/papers/20-databricks/sigmod_photon.pdf) - [AST vs Bytecode: Interpreters in the Age of Meta-Compilation (PDF)](https://stefan-marr.de/downloads/oopsla23-larose-et-al-ast-vs-bytecode-interpreters-in-the-age-of-meta-compilation.pdf) - [Netherite: Efficient Execution of Serverless Workflows](https://www.microsoft.com/en-us/research/uploads/prod/2022/07/p1591-burckhardt.pdf) - [“One Size Fits All”: An Idea Whose Time Has Come and Gone](https://cs.brown.edu/~ugur/fits_all.pdf) - [Cloud-Native Transactions and Analytics in SingleStore](https://dl.acm.org/doi/pdf/10.1145/3514221.3526055) - [We built a new SQL Engine on Arrow and DataFusion](https://www.arroyo.dev/blog/why-arrow-and-datafusion) --- # 2025 ## May 12 - https://read.engineerscodex.com/p/how-cursor-indexes-codebases-fast ## March 30 - https://lucumr.pocoo.org/2025/3/27/any-upcast/ ```rust trait AnyDebug: Debug + Any {} struct BoxAnyDebug(Box<dyn AnyDebug>); impl<T: Any + Debug + 'static> AnyDebug for T {} ``` # 2024 ## September 2 - [https://datatracker.ietf.org/doc/html/rfc8878](https://datatracker.ietf.org/doc/html/rfc8878) ## June 2 - [[Minimizing S3 API Cost with distributed mmap]] ## May 30 - [[The Original sin of Cloud Infrastructure (WarpStream)]] ## May 29 - [[Tiered storage won't fix Kafka]] ## May 27 - [How We Migrated Our Static Analyzer From Java To Rust](https://www.datadoghq.com/blog/engineering/how-we-migrated-our-static-analyzer-from-java-to-rust/) - Overview of the process of migrating a complex Java project to Rust from Datadog team. ## May 23 - [[Parallelism without row groups]] ## May 21 https://xcancel.com/richardartoul/status/1793007476612206747?s=46 #### [The worst bug we faced at Antithesis](https://antithesis.com/blog/worst_bug/) - Antithesis is a software testing company that runs your distributed systems through millions of hours of simulated tests. To achieve this, they built their own hypervisor, running on top of AWS EC2 Bare Metal instances which are running FreeBSD - They started receiving weird intermittent machine crashes, which traced back to the `send` syscall - This error was being thrown from the AWS Elastic Network Adapter kernel driver. The issue was that every 30 minutes, a DHCP lease was getting refreshed, which would reset the MTU of the device. The MTU reset caused a driver restart, and during the few µs of restart time, there’d be a concurrent `send` call that would observe a disabled driver, leading to the `ENODEV` error being returned. ## May 17 - [Golang is evil on shitty networks](https://withinboredom.info/2022/12/29/golang-is-evil-on-shitty-networks/) - A very timely and interesting counterpoint to the below. Golang [by default has TCP_NODELAY enabled](https://github.com/golang/go/blob/master/src/net/tcpsock.go#L259-L271) for all TCP sockets, and the author has a shitty home router. - There were a large number of tinygrams generated by the app, which overwhelmed the router buffers and led to frequent dropped packets. This caused the TCP connection to continually revert back to slow-start. - The target application the author measured was `git-lfs`, which is a CLI app written in Go. While it probably makes sense for RPC servers to turn on `TCP_NODELAY` for CLIs or other software where you don’t have reasonable expectations of the network, you may want to leave Nagle on. - [[It's always TCP_NODELAY. Every Damn Time.]] ## May 2 - [Rust 1.78 announcement](https://blog.rust-lang.org/2024/05/02/Rust-1.78.0.html) - Most exciting thing here seems to be the ability to declare your own attribute macros that let you specialize compiler messages for things like trait impls ## May 1 - [How Postgres Makes Transactions Atomic](https://brandur.org/postgres-atomicity) - Oldie (2017) but goodie ## April 10 Thread from the author on Meta’s new Nimble file format. [https://twitter.com/andreweduffy/status/1778054712517857458](https://twitter.com/andreweduffy/status/1778054712517857458) ## March 31 - [MonetDB/X100: Hyper-Pipelining Query Execution](https://15721.courses.cs.cmu.edu/spring2024/papers/04-execution1/boncz-cidr2005.pdf) This is a 2005 paper from the CMU Advanced Database Systems syllabus. The authors are from CWI (same research lab behind DuckDB). The purpose of the paper is to demonstrate how DBMS of the time were poorly suited for modern super-scalar CPUs, and gives an overview of the design and implementation of new execution engine for MonetDB that has better mechanical sympathy. What are super-scalar CPUs? - They are the processor we’re most familiar with today (Intel, ARM) - Designed with multiple **pipelines**, which are able to reorder the order of code execution - Having more pipelines can be better than having a faster clock speed in many cases - Data dependencies (e.g. a load following a store) create **pipeline breaks** which force execution to stall, adding clock cycles Why were contemporary DBs poorly designed for super-scalar CPUs? - They weren’t built with pipeline awareness. This led to some bad outcomes - Filters were implemented with branching, which cause pipeline breaks - Tuple-at-a-time processing forces you to do LD, OP, STR in sequence. This creates a pipeline blocker for every single tuple - Column-at-a-time has the opposite problem: it leads to wasteful materialization, because you end up materializing a lot of data that doesn’t need to make it to the final result set → **Memory Bandwidth constraint**! The authors built X100 to incorporate ideas to make MonetDB work on modern hardware: - Vectorized processing engine. Each operator receives and populates a batch of rows, which they call a vector - Gives developers ability to define custom operators, and build fused operators which inline a sequence of operations into a single vectorized operation - Minimizes branching and allows the CPU to run operators without jumps wherever possible How do you get around the wasteful materialization problem? - Every vector emitted by a node will have a **selection vector** attached to it. This is a set/bitmap that indicates which of the fields from the tuple should be ignored - Why do this instead of having dynamically-sized tuples? - Because we want to pre-allocate memory and let the CPU do straight-line execution. - Mapping operators that transform data element `i` will write a result to output element `i` - Aggregations/final materialization ignores elements that are not in the selection vector/bitmap How does it do? - They compare against DB2, and the original version of MonetDB with the old execution engine - X100 did ~100x better than the old MonetDB engine for most queries at small scales, and 5-100x better than DB2 at high scale ## March 29 - Twitter thread by Your’s Truly about SIMD for optimizing a `memchr` -like algorithm [https://twitter.com/andreweduffy/status/1773835645762281852](https://twitter.com/andreweduffy/status/1773835645762281852) ## March 26 - [Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics](https://15721.courses.cs.cmu.edu/spring2024/papers/01-modern/armbrust-cidr21.pdf) (2021 CIDR) - From the Databricks founders - Main idea: proposal a new era for data systems named _**Lakehouse Architectures**_ - What are its distinguishing features? - (1) Provides **direct I/O access to storage** - (2) Fast SQL support (competitive with purpose-built data warehouses) - (3) Better support for ML/analytics workloads than traditional data warehouses - Where does Lakehouse fit into existing taxonomy? - Data Lake → large schemaless store of files - Examples: all of the cloud storage systems - Data is shipped there via ETL jobs from OLTP systems - No idea of governance or transactions - Just a bunch of files. If you want to query it, use Spark, Presto, Hive - Data Warehouses → curated data usable for BI and analytics - Examples: Redshift, Greenplum, - Implements transactionality - Often uses proprietary storage formats - Data is perpetually stale b/c it is 2 ETL/ELT jobs away from the production data source - **Lakehouse → Provides the best of both worlds** - Use open storage formats for landing data - Implements a **metadata layer** on top of the files abstraction to allow for row-level transactionality - By using open file formats, provides advanced workflows (ML, Pandas) with ability to read directly (with transaction support). - How do Lakehouses provide reliability? - There is a Metadata Layer, which provides a consistent, transactional view on top of the objects stored in the cheap cloud storage layer (S3, ADLS, GCS, etc.) - This metadata layer can be stored either alongside the data files (Delta Lake) or in a separate transactional DB (Snowflake) - Also will perform schema enforcement on ingestion to make sure you can only land valid data into the table - How do Lakehouses ensure competitive SQL performance? - **Caching**, providing faster local access to whole files or subsets of objects that are stored in s3 - **Storing Auxiliary Data**, things like column stats that can be used by the planner to reduce remote cloud storage I/O - **Data Layout Optimization**. this could be clustering the data in a special way (space filling curves as well as sorting are provided as examples) as well as projecting different columns in different orders in the object store - How will data systems research adapt with Lakehouse dominance? - Need new innovations in storage formats. Parquet and Orc were from the previous era, don’t account for GPUs, newer hardware - New research into query engines that abstract away storage systems instead - For ML and advanced analytics workloads, providing a DataFrame API is the gold standard - The authors point out existing ML-centric data stack tools such as Feature Stores and claim that these can easily be implemented as configuration on top of a proper Lakehouse system. - [An Empirical Evaluation of Columnar Storage Formats](https://15721.courses.cs.cmu.edu/spring2024/papers/02-data1/p148-zeng.pdf) - Starts with a recognition: Parquet and Orc, the dominant data storage formats in modern data stack, were built a decade ago. - Goal of the paper is to evaluate how well the formats hold up against modern workloads and newer hardware (e.g. ML workloads, GPU decoding), and use that information to inform how next-generation columnar formats may improve on the status quo - Using a number of real-world datasets (no synthetics) they run a series of experiments pitting Parquet against ORC in a variety of scenarios - **Experiment**: 1MM rows dataset. Full sequential table scan. - **Outcome:** Parquet is modestly faster than ORC due to lightweight encoding schemes (ORC’s aggressive use of RLE slows decoding) - **Upshot:** It’s important to keep the encoding scheme simple so decoding can be done quickly - **Experiment**: Compare block compression algos (Zstd VS Snappy) - **Outcome**: Zstd achieves better compression ratio than Snappy. But, block compression slows down reads considerably when using fast I/O devices. - **Upshot:** Block compression is not worth it, future formats should use lightweight compression (Dictionary, Bitpacking, etc.) - **Experiment**: Generate a dataset with up to 10,000 columns. Sequential scan while selecting 10 columns. This is meant to simulate ML training/inference scenarios where you are selecting a small number of potential features from large event payloads. - **Outcome:** Decoding time scaled with the number of _total_ columns and not with the number of selected columns. This is b/c footer metadata describing columns cannot be randomly accessed in either format. - **Upshot:** Future formats should enable efficient random access to per-column metadata. - **Experiment**: Store ML embeddings as a `List<Float>` columns, simulate vector DB lookups by trying to read random rows by ID. - **Outcome**: ORC was able to select faster than Parquet when serving from NVMe b/c of its fine-grained zone maps. However, it was considerably slower against S3 b/c it requires 4x more random reads - **Upshot**: Fine-grained indexes (zone maps) are critical for latency guarantees. Limiting small I/Os is important to avoid slowness when hosting in cloud. - Experiment: Store large images as `BINARY` columns alongside other structured data, perform a filtered scan with varying levels of selectivity - **Outcome**: Smaller row groups led to more efficient queries - **Upshot**: It’s better to separate unstructured from structured data in the physical format. ## Feb 20 - [The Absolute Minimum Every Software Developer Must Know About Unicode, No Excuses!](https://tonsky.me/blog/unicode/) - UTF-8 represents strings as a sequence of _codepoints_ - UTF-8 works in a space of 1.1 million potential codepoints. Currently, only ~15% are allocated for all currently supported human languages - Codepoints are variable length from 1-4 bytes. The leading 4 bits of the first byte is a bitset determining how long the codepoint is. - Latin alphabet all fits in the bottom 7 bits of first byte, i.e. UTF-8 can be read directly as ASCII for Latin-only strings - From a random location inside the string, you can always find the first byte of the next codepoint by looking at the top bits - Single Byte Codepoint → First byte is `0xxxxxxx` - Two Byte Codepoint → First byte is `110xxxxx` - Three Byte Codepoint → First byte is `1110xxxx` - Four byte Codepoint → First byte is `11110xxx` - Subsequent bytes always begin with `10xxxxxx` - Humans care about “Grapheme Clusters”. é is a single grapheme. So is 🚀. But both are multiple codepoints, and each language’s library will measure their lengths differently - Solution: Use a Unicode library for your language. Author provides several examples - **Normalization**: There are 2 ways to express many graphemes, the multi-codepoint form and a “composed form” - NFD → Attempt to normalize all graphemes to their multi-codepoint form - NFC → Normalize using composed form wherever possible - “**Before comparing strings or searching for a substring, normalize!”** ## Feb 17 - [Thread from Neon founder about the economics of free tiers for DbaaS](https://x.com/nikitabase/status/1758639571414446415?s=20) - His thesis: the only profitable way to do this is to run serverless, decouple compute from storage. Else you’re stuck paying AWS some fixed minimum per free account. - He's a bit biased: they do this. There are other providers doing the exact same thing though such as [Xata](https://xata.io/) - They're unveiling a new pricing model. The common theme is you pay less but get considerably less storage. Eg in current pricing model, $39/mo gets you 100GB of storage, now $19/mo gets you 0.5GB. That's steep! - [RAG is Dead! Long Live RAG!](https://vectorize.io/2024/02/16/rag-is-dead-long-live-rag/) - Author is responding to many comments that Gemini Ultra’s 1M context window might kill RAG - Gemini experimental Setup: needle in haystack test. An LLM has its context window packed with many many tokens, and somewhere randomly inside of the “haystack” is a “needle” (fact) that the model is asked to retrieve. - Gemini is able to retrieve a single needle 99% of the time in their tests - However, they can **retrieve multiple needles only 60% of the time** - Latency is extremely high (transformers still scale quadratically with context length) VS RAG using a smaller model - Cost balloons as well b/c compute cost is higher - [Video generation models as world simulators](https://openai.com/research/video-generation-models-as-world-simulators) (OpenAI Sora Technical Report) - The model is a diffusion transformer (DiT). Diffusion = start from noisy “_patches”_ and every iteration make the picture clearer and more fluid - The model supports arbitrary resolutions and time lengths - Their training pipeline involved text and visual steps: - They took training videos and ran them through a compression network to create lower dimensional “patches” representations, which were (X, Y, Time) 3-D representation - They reused the image-to-text captioning model from DALL-E to generate verbose captions for every video. They claimed this was critical to the detail in generated video - The model can sample backwards or forwards in time - “Emergent simulation capabilities” - The model was not trained to have 3D representation, but purely from training scale Sora is able to support dynamic camera movements in 3D space, with realistic portrayals of people and objects in the scene - The model achieves object permanence. From experience, this is something incredibly hard to do with 2D video tracking models like the ones you might pair with an object detector. It’s clear that Sora is learning something about how objects persist and survive in 3D space by seeing enough video. - Sora can simulate video games and their mechanics by simple prompting. E.g. you can generate a clip of Minecraft and get a first-person view of a character acting with a basic, valid policy in the game - _These capabilities suggest that continued scaling of video models is a promising path towards the development of **highly-capable simulators of the physical and digital world**, and the objects, animals and people that live within them._ - More convo on Twitter: [https://twitter.com/sainingxie/status/1758433682086330496](https://twitter.com/sainingxie/status/1758433682086330496). The researcher estimates that Sora may have ~3B parameters. That’s something that can easily fit on-device and let you generate videos from your phone! You could extend any photo from your library into a video, or extend any video forward/backward in time. ## Feb 16 - [Zero Trust Networking definition · Tailscale Docs](https://tailscale.com/kb/1123/zero-trust#incremental-zero-trust) ## Feb 4 - [Prompting Is Programming: A Query Language for Large Language Models](https://arxiv.org/pdf/2212.06094.pdf) (PDF) - [https://github.com/denoland/rusty_v8](https://github.com/denoland/rusty_v8) - Rust bindings to V8 JS runtime. Provides a usable abstraction in front of V8 Isolates, which makes it easy to embed untrusted JavaScript execution directly into your Rust apps. ## Feb 1 - [Meta Velox (CMU Advanced Databases / Spring 2023) - YouTube](https://www.youtube.com/watch?v=Zx4caucPF7s) // ([Slides Link](https://15721.courses.cs.cmu.edu/spring2023/slides/23-velox.pdf)) - [Tyler Cowen - State Capacity Libertarianism](https://marginalrevolution.com/marginalrevolution/2020/01/what-libertarianism-has-become-and-will-become-state-capacity-libertarianism.html) ## Jan 31 - [https://blog.a10y.dev/posts/building-with-rust/](https://blog.a10y.dev/posts/building-with-rust/) - My own personal journey with Rust. I’m very convinced it’s the future of building excellent, maintainable software and excited to see how the next few months of my journey with it go! ## Jan 15 - [How we switched to Java 21 virtual threads and got a deadlock in TPC-C](https://blog.ydb.tech/how-we-switched-to-java-21-virtual-threads-and-got-deadlock-in-tpc-c-for-postgresql-cca2fe08d70b) - Java 21 introduced Virtual Threads, similar to Goroutines in Go. - The implementation of Virtual Threads maps a larger number of virtual threads onto a much smaller number of Carrier Threads, i.e. native threads. - The authors used a Postgres connection pool library which uses `Object.wait` which uses JVM’s builtin monitor locks (i.e. the `synchronized` keyword). Monitor locks will pin the carrier thread, making it impossible for the JVM to unmount the virtual thread back to the runtime. - The solution: use a Semaphore in front of their DB connection pool. Semaphores are vThreads-aware, so they manage the concurrency and remove the opportunity for deadlock. - The upshot: Virtual Threads are cool and shiny, but many popular ecosystem libraries are not aware of them, and any of them can introduce an opportunity for deadlock if not being careful. ## Jan 14 - [Towards Modern Development of Cloud Applications](https://dl.acm.org/doi/pdf/10.1145/3593856.3595909) ## Jan 10 - [The Case for Learned Index Structures](https://ar5iv.labs.arxiv.org/html/1712.01208) (2017) - The paper, from Google, examines the application of ML models to one of data management’s most fundamental problems: indexing. B-Tree indexes are a dominant strategy for building range indexes (think your normal primary key index in Postgres or MySQL) and provide logarithmic lookup times. The author’s key insight: BTrees complexity scales with the data size, and have no mechanism to adapt to the data distribution. ML models on the other hand, are by their definition learners of data distributions. - By recasting Indexes as a Prediction function of keys → Positions in the database file, they open up the door to plugin new indexes based on simple and semi-complex neural network architectures. - A trained ML model takes constant space/time to execute, which makes them a great fit for large indexes as they easily fit the index in memory in that case. - The authors define the **Recursive-Model Index (RMI)** as a series of models that operate over recursively smaller ranges of the keyspace. This is derived from Mixture of Experts research and starts with a top-level (Stage 1) model, which takes a **key** and predicts the **offset**. Based on where the Stage 1 prediction falls in the keyspace, the relevant Stage 2 model is selected, prediction occurs, and you select a Stage 3 model. This continues until you reach a bottom level-model. - Models can be neural networks, linear regression models, or BTrees. This means that in the worst case, the retrieval performance is lower-bounded by the retrieval performance of a binary search over a BTree. - The authors trained several of these model across 3 datasets - Weblogs: 200M rows of HTTP request logs, predicting TimeStamp → row - Map Data: 200M map pins indexed by longitude - A synthetic dataset of numbers - They found that the Learned index was **orders of magnitude faster and cheaper** than a BTree covering the full keyspace, accounting for various page sizes. - They didn’t share a table of training speeds. They said that indexes could be learned in “seconds to minutes” depending on the complexity of the model but don’t have specifics. I have not searched thoroughly, but I don’t know of any data processing systems that make use of learned indexes, despite their apparent superiorities. - One idea: this feels mostly applicable for large, centralized databases that change infrequently. It’s a lot less useful for something like a Spark dataset, which already shreds many records up into Parquet files that have simple embedded statistics and are meant to be processed in massively parallel fashion. - **Pro-tip**: when [Jeff Dean](https://scholar.google.com/citations?user=NMS69lQAAAAJ&hl=en) is a co-author on your systems paper, you know it’s going to be a top-1% ## Jan 8 - [Convex: Life Without a Backend Team (James Cowling) - YouTube](https://www.youtube.com/watch?v=iizcidmSwJ4&list=PLSE8ODhjZXjbDOFN4U4-Uv95-N8sgzs5D) - Walk through of [Convex](https://www.convex.dev/) from the CTO. Convex is a serverless company that has a few really exciting things going for it: - It’s designed for full-stack devs, and the usable interface is TypeScript. Define tables, indexes, and migrations with full support for type inference. This makes fluid APIs really easy. - The code you write is exported to their centralized cloud server. The server fronts a Postgres database, which is fronted by [an embedded V8 instance](https://github.com/denoland/rusty_v8) that executes each user-defined DB query/mutation in an [V8 Isolate](https://dzx.cz/2023-03-08/how_do_cloudflare_workers_work/) - The Isolate is required to be deterministic, so any attempts to access Date.now or RNG is substituted with a value that is fixed from the start of the execution. This determinism allows their runtime to retry in the case of failures, which fits their optimistic concurrency model. ## Jan 3 - [The I in LLM stands for intelligence | daniel.haxx.se](https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-for-intelligence/) - The cURL project has a bug bounty program, and every bug bounty report requires a maintainer’s time to review. They’ve been inundated recently with seemingly good but actually gibberish reports generated by AI. - The author likens this to the early days of spam and that there needs to be tools to help people receiving potentially AI-generated output to sift through it. - The author notes that there are various legitimate uses of AI for bug bounty reports, e.g. for non-native English speakers trying to get help writing the report. However, in the medium term he expects to just reject most reports that smell like they were AI-generated which is unfortunate. ## Jan 2 - [GitHub - a10y/ddotcli](https://github.com/a10y/ddotcli) - CLI using Go’s [Bubbletea](https://github.com/charmbracelet/bubbletea) toolkit for making pretty CLI apps. This one connects to Washington DC’s open MQTT server to get a list of traffic cameras, and streams/records them using `ffmpeg` - [Sandboxing and Workload Isolation · The Fly Blog](https://fly.io/blog/sandboxing-and-workload-isolation/) - Overview of isolation techniques used for serverless workloads, by Thomas Ptacek of security engineering fame. - **Problem**: for multi-tenant code workflows, Docker on its own is a good but not wholly sufficient isolation mechanism, as there’s nothing between you and the kernel. He thinks many complaints about Docker are overstated based on the way most people run containers - **Option 1**: Use a language runtime such as V8 Isolates. - Pros: The best option, works very well (assuming you trust the V8 developers). - Cons: Inflexible for users who want the flexibility to run any POSIX-compatible program - **Option 2**: Emulation / gVisor. Google reimplemented the entire kernel ABI in userland, and exposes it as a container runtime `runsc`. - Pros: This is fantastic for security because all syscalls are implemented on top of ~20 auditable Go functions - Cons: 10%+ perf hit, especially bad for IO-heavy workloads - **Option 3**: Firecracker - Pros: container runtime. Succinct Rust implementation. Completely segments off the kernel - Cons: Extra setup overhead, e.g. need to learn Kata containers or K8s to use it - **Option 4**: Tuning [nsjail](https://github.com/google/nsjail), author recommends this as a great entrypoint before going all-in on containers, especially if you’re running non-malicious code on your own hardware. # 2023 ## December 30 - [37C3 - Back in the Drivers Seat: Recovering Critical Data from Tesla Autopilot](https://www.youtube.com/watch?v=AgC9OiFrIPk) - Researchers used a voltage glitch to get the Turbo chip into recovery mode, which exposes a passwordless root SSH console. This provided access to the encryption keys used to connect to Tesla’s Autopilot API, which is generally locked down with mutual TLS - Overall, Tesla seems to have a really advanced security baseline: - encrypted boot for all computers - mTLS for connections to Tesla cloud services - full telemetry logging. SQLite files define a custom query format that is used to determine when sensor/video events are worth uploading to the server which makes it really easy to tweak these settings on the fly via very compact OTA updates ## December 28 - [Intentional Design for AI Applications | AI Talks for DevOps](https://www.youtube.com/watch?v=VCTkgWZFX-0) - First deep dive on PineCone’s architecture that I’ve seen recorded. Not extensive but sheds some light. - Pinecone Graph Algorithm (PGA) is the underpinning of their P2 Index type - Based on [FreshDiskANN](https://arxiv.org/abs/2105.09613) from Microsoft. Uses a dense graph (i.e. no levels or hierarchy, the entire connected graph is one unit) - They use integer quantization to reduce memory size and fit all vectors into memory. c.f. quantization in things like llama.cpp - Indexing is built on top of Kafka, with backpressure builtin to avoid any stage of the indexer being overwhelmed ## December 22 - [GraphCast: Learning skillful medium-range global weather forecasting](https://www.science.org/doi/epdf/10.1126/science.adi2336) - _“Here, we introduce an MLWP approach for global medium-range weather forecasting called GraphCast, which produces an accurate **10-day forecast in under a minute on a single Google Cloud TPU** (Tensor Processing Unit) v4 device“_ - This is a huge advancement! The US global forecasting system is lower resolution than Europe’s, and only covers a 5 day (120 hour) period and is produced every 6 hours. - Some caveats to call out: - Friendly reminder that 0.25deg is ~27km. This is great for the kinds of weather prediction they call out: “tropical cyclone tracks, atmospheric rivers”, but there are a lot of applications where this is still not really that useful. People planning overhead collection from space-based platforms (read: commercial and spy satellites) care a lot more about cloud cover, knowing “it’s gonna be fairly cloudy in this 27km x 27km box” is actually not helpful for that, so this is really only applicable for things like civilian aviation. Which is still important! But what you can do with this is fairly limited. There are various regional weather providers that produce short-range ultrahigh resolution collection (e.g. The German Weather Service has one that covers all of Europe down to 1km resolution). - There is likely still a place for NWP ## December 18 - [React Query as a State Manager | TkDodos blog](https://tkdodo.eu/blog/react-query-as-a-state-manager) - [Redux Essentials, Part 1: Redux Overview and Concepts | Redux](https://redux.js.org/tutorials/essentials/part-1-overview-concepts) - [We stand to save $7m over five years from our cloud exit](https://world.hey.com/dhh/we-stand-to-save-7m-over-five-years-from-our-cloud-exit-53996caa) - From DHH, they successfully completed the exit earlier this year. - [Behind the scenes scaling ChatGPT](https://www.youtube.com/watch?v=PeKMEXUrlq4) - Overview from OpenAI senior engineer on scaling challenges that came with deploying ChatGPT - H100 characteristics: in one second, you can read 3.3TB of HBM memory, and execute 1.98 PFLOPs of compute. This creates a **“591:1 ratio”** that means you can only saturate the GPUs when you’re hitting this ops:bytes ratio. Anything more and you end up being bottlenecked by compute. - GPU global availability continues to be the biggest bottleneck to user growth - Latency of token-at-a-time generation is such a bottleneck that they don’t even worry about routing users to the nearest datacenter. - Skews buy over build, references things like security and auth ## December 17 - [Issue when auto commit is false. · Issue #1039 · jdbi/jdbi · GitHub](https://github.com/jdbi/jdbi/issues/1039) - Cost me a good bit of time today. BLAB: just leave autoCommit as true when using JDBI. ## December 14 - [Open and portable Postgres-as-a-service. Also available on Hetzner](https://www.ubicloud.com/blog/open-and-portable-managed-postgresql-avail-hetzner) ## December 12 - [2023 LLVM Dev Mtg - Mojo 🔥: A system programming language for heteroge...](https://www.youtube.com/watch?v=SEwTjZvy8vw&t=203s) - [In the Pipeline: AlphaFold’s Place in the World](https://www.science.org/content/blog-post/alphafold-s-place-world) - Another stunning example of Google (or in this case, DeepMind) state-of-the-art AI algorithm failing to gain traction. This leaves me wondering: how much research revolves around discovery of entirely new structures (think the crazy move in the AlphaGo game that let the computer beat Lee Sedol, a move that was counterintuitive to a person). ## December 11 - [Developer Documentation MLX 0.0.4 documentation](https://ml-explore.github.io/mlx/build/html/dev/extensions.html) - [https://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/slides/lec10.pdf](https://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/slides/lec10.pdf) - [Law Firm Billing: Ultimate Guide and Best Practices for 2021 | Clio](https://www.clio.com/blog/law-firm-billing/) - [Did Google fake their Gemini Video? - YouTube](https://www.youtube.com/watch?v=zut38E-BHH0) - Survey says: yes - Yannic digs through the evaluation sections, calls out that they used a questionable technique (“CoT @ 32”) to compare Gemini with GPT-4 on the MMLU benchmark. Critically, they don’t use the reported results from the GPT-4 technical report, but instead compare it based on their own usage of the API. ****So the headline number that lets them claim “better performing than GPT-4” is based on a questionable test they did. - Goes on to provide a bunch of other arguments about the sales-y nature of the technical report and laments the end of AI giants publishing real academic papers. ## December 9 - [AI for Hypothesis Testing](https://www.science.org/content/blog-post/ai-hypotheses) - _The idea is that you look over a lot of experimental results and come up with a "What If" idea that might explain them...what if there were another regulatory pathway, hooked up to this thing over here, that also affects the pathway that I'm looking at? What if the reaction that I've found is actually being catalyzed by trace impurities in the stuff that I've been thinking is the real catalyst?...What if those little fuzzballs in the telescope view are actually huge galaxies all their own, incredibly distant, and not nearby nebulae? What if doctors in the maternity ward are spreading childbirth fever from their uncleaned hands...These proposals had great and immediate explanatory power, and they could all be put to make-it-or-break it tests, some of them very quickly...The question is whether AI systems can look over data sets and come up with such questions. I don't think it's fair to ask them to invent quantum mechanics, but we don't have to be at that level._ - [Cloudflare Gen 12 Server: Bigger, Better, Cooler in a 2U1N form factor](https://blog.cloudflare.com/cloudflare-gen-12-server-bigger-better-cooler-in-a-2u1n-form-factor/) - 2U means the server form factor is 2 Rack Units, which is 3.5" in height; 1N means there is one server node per chassis - Cloudflare is upgrading its chassis, using AMD EPYC rather than Intel (they've done this since their Gen10 build). Most of the article touches on the constraints and their hardware team's evals. - For the first time, the new chassis must [support GPUs](https://blog.cloudflare.com/workers-ai/). - The biggest challenge they face with servers is increasing temperatures: every new generation has a higher TDP, which means more heat dissipation which means better cooling systems need to be used. - Their workload is seemingly CPU-bound enough to warrant the more powerful hardware: “_**We found that Cloudflare services continue to scale with cores effectively up to 128 cores or 256 hardware threads, resulting in significant performance gain**_” - Their solution was to increase the chassis size from 1U → 2U to allow for larger fans and heat sinks. This was okay because (A) they were power bound on each rack (limitation imposed by their colo hosts) and (B) they needed the extra space available for NVIDIA cards. ## December 8 - [Tensor Considered Harmful](https://nlp.seas.harvard.edu/NamedTensor) - Tensors are a leaky abstraction in code that require the user who loads one to understand dimensions, batching, channels etc. author proposes a new NamedTensor that has named dimensions, with more literate broadcasting rules that rely on matching names in code paths using tensors. ## December 4 - [Augmenting Long-term Memory](https://augmentingcognition.com/ltm.html) - Strategy for reading and understanding papers in new domains via spaced repetition and taking multiple passes over the paper. One thing to clarify: the author is clear he’s not reading for mastery, but rather for deep abstract understanding. - Critical excerpt (emphasis **mine**)_This entire process took a few days of my time, spread over a few weeks. That's a lot of work. However, the payoff was that **I got a pretty good basic grounding** in modern deep reinforcement learning. This is an immensely important field, of great use in robotics, and many researchers believe it will play an important role in achieving general artificial intelligence. With a few days work I'd gone from knowing nothing about deep reinforcement learning to a durable understanding of a key paper in the field, a paper that made use of many techniques that were used across the entire field. Of course, I was still a long way from being an expert. There were many important details about AlphaGo I hadn't understood, and I would have had to do far more work to build my own system in the area. But **this foundational kind of understanding is a good basis on which to build deeper expertise**.It's notable that I was reading the AlphaGo paper in support of a creative project of my own, namely, writing an article for Quanta Magazine. This is important: **I find Anki works much better when used in service to some personal creative project**._ ## November 29 - [Rust std fs slower than Python!? No, it's hardware!](https://xuanwo.io/2023/04-rust-std-fs-slower-than-python/) - Gripping tale of debugging from application code down to microcode, and discovering an issue with AMD CPUs in the process. Concrete takeway: “_Rust developers might consider switching to `jemallocator` for improved performance”._ ## November 4 #### [The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes (PDF)](https://arxiv.org/pdf/2012.14210.pdf) - This paper is the first to formalize a scaling law for embedding search, similar to what Chinchilla did for LLMs. They examine sparse term search (ElasticSearch’s BM25 implementation) against BERT based dense models for retrieval, using MS MARCO as an example dataset - The authors (who also is the lead scientist now at Cohere) do some funky proof that I only half followed to demonstrate that as the embedding dimension increases, the likelihood of false positives in an embedding search decreases quadratically - Using MS MARCO (benchmark dataset of Bing queries and corresponding relevant documents (based in user clicks)) they demonstrate that actually, DistillRoBERTa scales more poorly than what would theoretically be expected. ElasticSearch actually beats it for large indexes over 1mm documents - Big takeaways - Dense models will work best with smaller indexes - Models that are SOTA against a small benchmark dataset may be completely useless for large real world indexes - Empirical error rates exceed theoretical lower bounds, meaning the models we have today are far from perfect ## November 2 ### [Cohere: Introducing Embed v3](https://txt.cohere.com/introducing-embed-v3/) - Cohere releases an improved embeddings API - They seem to have 4 supported embedding models trained - Search (document + query): One model for documents and for queries. Note: they have different models because indexed documents and user-entered queries are very different, but want to be paired into the same vector space. Discounts things like sentiment which are less useful for search than for the other two tasks - Classification: Embedding space that is easy to build a classifier on top of, e.g. through logistic regression or an FC net. - Clustering: Similar to the classification model but gives more weight to sentiment - Their big headline is that the search_document model now considers document quality as part of the embedding model: - _But in many real-world applications, you have redundant information with varying content quality. Some documents provide little insight into topics, while others are very detailed. **Sadly, models that measure topic similarity only tend to retrieve the least informative content**, leading to a bad user experience._ - They seem to introduce this into the modeling objective somehow: *We achieve this capability by measuring the topic match and content quality in the vector space. At query time, we can look for content that matches the topic ***_(COVID-19 symptoms) and provides the most information._ * [BERT for Information Retrieval](https://www.sbert.net/examples/applications/information-retrieval/README.html) (and [this](https://www.sbert.net/examples/applications/semantic-search/README.html)) - [[Big Data is Dead]] (Video) ## October 30 - [https://www.typescriptlang.org/docs/handbook/modules/theory.html#scripts-and-modules-in-javascript](https://www.typescriptlang.org/docs/handbook/modules/theory.html#scripts-and-modules-in-javascript) - A recent [PR check failure](https://github.com/IntrinsicLabsAI/gbnfgen/actions/runs/6693879062/job/18185997576) made me dig into something I’ve tried to avoid learning for awhile: how TypeScript/JS modules work - It’s actually a lot less complicated than I feared. Fundamentally, TypeScript’s approach is: - There are **bundlers** (like Webpack and esbuild) and Runtimes (like Node, Bun, Deno, etc.) - `tsc` transpilation requires knowing the host that will be running the JS, to know what module formats are supported, so e.g. it can translate `import * as "X"` to CommonJS `require` statements as necessary. - Node.js has supported ES-style `import` statements directly for a long time, so setting the tsconfig value of `module` to `nodenext` is the best option that should work in the most cases, because most bundlers also support these kinds of import statements - `tsc` type-checking requires understanding how the host executing the transpiled JS will resolve modules. - Explicitly: `tsc` will never rewrite the module specifier (e.g. `utils.js` or `react-dom`) in an import. To enforce soundness of the compiled output, it needs to simulate how the runtime/bundler will resolve the module specifiers. - Once again `moduleResolution: "nodenext"` is the ideal configuration choice because most environments support Node.js style module resolution, so your code will work both in Node.js backends and in bundler-packaged frontend code. ## October 27 - [Java HNSWLib bindings](https://github.com/jelmerk/hnswlib) - There’s a section that makes use of the JEP414 Vectorization API in Java. You need to run your code with `-enable-preview -add-modules jdk.incubator.vector` . The author hasn’t attached any benchmarks to illustrate speedups so would recommend doing your own before deciding on an implementation to use. - [https://doc.rust-lang.org/std/boxed/#memory-layout](https://doc.rust-lang.org/std/boxed/#memory-layout) - `Box` layout semantics when using the C ABI for exchange with Java, relevant to some JNA things I’m doing today. - [Has Humane Created the Next iPhone—or the Next Google Glass?](https://www.theinformation.com/articles/has-humane-created-the-next-iphone-or-the-next-google-glass?rc=56yqmz) - After raising Series C and having never released a consumer product, Humane is going to be releasing their AI Pin in early November. They’ve also submitted an FCC filing to be licensed as an MVNO reseller, so they will sell mobile plans that buyers of the Pin will need to purchase from them as well. ## October 26 - [A remote sensing-based investigation of the wall among Rome’s eastern frontier](https://www.cambridge.org/core/journals/antiquity/article/wall-or-a-road-a-remote-sensingbased-investigation-of-fortifications-on-romes-eastern-frontier/8FE59FB0D5476EA329614EEC6DC414FD) - Archaeologists have looked through 300,000 km^2$ of newly declassified imagery from the CIA’s CORONA satellite (operating 1960-1972) over Syria and Iraq, documenting nearly 400 previously unknown Roman forts - Notably, it doesn’t appear they used any kind of AI/ML assistance, they just took the entire dataset, broke it into $25km^2$ chunks, and looked at all 12,000 images by hand. I guess there are enough grad students/undergraduate RAs to make this feasible. - [Databases in 2022 year in review (Andy Pavlo)](https://ottertune.com/blog/2022-databases-retrospective) - Most DB fundraising was focused on early stage (up to Series A). Not as many large fundraising rounds: “_The only path to continue forward for these companies with billion-dollar valuations is going IPO or bankrupt. They are too expensive for acquisition…Furthermore, the major tech companies (e.g., Amazon, Google, Microsoft) that do large M&A’s already have their own cloud database offerings.”_ - I’ve personally never understood the business model of companies like CockroachDB, who spun out of Google’s Spanner team to build a commercially available version of Spanner…which is [already commercially available](https://cloud.google.com/spanner?hl=en). They raised an additional $278mm in the 2021 boom days. I’m not sure how this company survives realistically. - In general, selling cloud-hosted DBs with no “secret sauce” seems like a recipe for being outcompeted by any of the Big 3 Cloud providers. You need something like [Supabase](https://supabase.com/) or [Convex](https://www.convex.dev/) (who raised $26mm Series A last year) to rise above the chaff. - Additional tidbit of wisdom: _“if there is no compelling use case for something by the time IBM starts advertising about it, then there will never be one”_ (in reference to blockchain databases) ## October 25 - [[Hayek "The Use of Knowledge in Society"]] ## October 23 - [Climate Science Special Report 2017 - Executive Summary (PDF)](https://science2017.globalchange.gov/downloads/CSSR2017_PRINT_Executive_Summary.pdf) - This is part of National Climate Assessment 4 (NCA4) published in 2017. NCA5 is set to be published later this year. In the front it includes a breakdown of how they define likelihood, with “very high confidence” as the strongest statement that depicts facts backed up by consensus across several studies. It was notable how few things garnered “very high confidence” ratings, the notable ones were - Global surface temperatures have risen 1ºC since 1901, and natural causes (solar and atmospheric cycles) only account as minor factors, the rest is man-made - Global average sea levels are set to **rise between 1-4ft by the end of the century**, and there will be particular **concentration in the East Coast and Gulf Coast** of the US - Nearly 40% of recorded global sea level rise (3inches) has occurred since 1993 - [OAuth 2.0 Simplified](https://www.oauth.com/) - Good refresher doc for the intricacies of the ubiquitous protocol ## October 22 - [Java Native Access](https://github.com/java-native-access/jna) - Simpler FFI to JNI, does not require bridging headers to use - [Rust <> JNA examples](https://github.com/drrb/java-rust-example) - Notable takeaways: for Rust to return an owned heap pointer to Java, you return `Box<T>`rather than `T`. Then you use [this drop trick](https://github.com/drrb/java-rust-example/blob/master/src/main/rust/com/github/drrb/javarust/lib/greetings.rs#L171-L174) to free the memory from Java. ## October 20 - [How Microsoft is Trying to Lessen Its Addiction to OpenAI as AI Costs Soar](https://www.theinformation.com/articles/how-microsoft-is-trying-to-lessen-its-addiction-to-openai-as-ai-costs-soar?rc=56yqmz) - As Microsoft pushes AI into its core products like Bing and Office, it’s trying to cut costs by removing its dependence on expensive OpenAI models. - Part of its licensing deal with OpenAI means that it alone has the ability to do Alpaca-style training for commercial use, i.e. train a smaller model based on the outputs of larger GPT-4. - They’ve published papers of Orca and Phi, two models they’ve trained using GPT-4 outputs - [Orca: Progressive Learning from Complex Explanation Traces of GPT-4 (PDF)](https://arxiv.org/pdf/2306.02707.pdf) - Rather than train directly on simple inputs/outputs, which give good benchmark results but don’t generalize well, Orca is trained on CoT reasoning traces from GPT-4 - “_Orca surpasses conventional state-of-the-art instruction-tuned models such as Vicuna-13B by more than 100% in complex zero-shot reasoning benchmarks like BigBench Hard (BBH) and 42% on AGIEval. Moreover, Orca reaches parity with ChatGPT on the BBH benchmark and shows competitive performance…in professional and academic examinations like the SAT, LSAT, GRE, and GMAT, both in zero-shot settings without CoT; while trailing behind GPT-4”_ ## October 17 - [OpenAI Dropped Work on New ‘Arrakis’ AI Model in Rare Setback](https://www.theinformation.com/articles/openai-dropped-work-on-new-arrakis-ai-model-in-rare-setback?rc=56yqmz) - OpenAI was relying on a new model architecture “Arrakis”, that they’ve spent over a year developing. It was going to rely on sparsity techniques to reduce the hardware cost of running GPT-4. - It seems like there was a weird disconnect between research and management, as after a year the final product did not work as well as the original tests. - Parts of the project will be incorporated into their next “Gobi” model architecture. ## October 6 - [Thread-per-core in Rust](https://without.boats/blog/thread-per-core/) ## October 2 https://x.com/ggerganov/status/1708805121721700788