Normally I just fire off a tweet when I spot a nice performance PR landing in Ruby. Lately I’ve been catching up on a backlog of Ruby performance work I’d bookmarked and never gotten around to – so some of what’s below isn’t brand new, with a few PRs dating back to 2025. There were so many of them – some headline-grabbing, some small but delightfully clever – that a thread won’t cut it. So here’s a roundup instead, both the recent landings and the ones I’m late to.
A few ground rules: every PR below ships a concrete benchmark number, so when I say “Nx faster” it’s the author’s own measurement, not vibes. Numbers come from different machines and workloads, so treat them as “here’s the win on the benchmark that motivated the change,” not cross-comparable lab results. Click through to any PR for the full picture – most authors document their methodology beautifully.
Let’s go.
Strings & text
String#scrubskips ASCII runs – Instead of decoding a string character-by-character, scrub now jumps over ASCII runs using the samesearch_nonasciitrickvalid_encoding?uses. On English HTML it’s up to 45.55x faster, on Japanese HTML 22.71x, and ~3.5x on the general case – with no regression on the worst case. Beautiful work by FletcherDares, who’s been on a string-performance tear.String#codepointsASCII hot path – Same author, same instinct: add a local fast path for ASCII bytes inside mostly-ASCII UTF-8 strings. Result: ~1.9x faster on mixed ASCII content, neutral on pure multibyte.String#gsub!stops copying on no-match –gsub!was eagerly copying shared backing storage even when nothing matched. Defer that copy until the first real match (likesub!already does) and you get 2.33x faster no-match calls – and the allocation on a 100k-char shared string drops from 100,041 bytes to 40 bytes.
Files & directories (the byroot file-IO spree)
byroot (Jean Boussier) went on a tear through Ruby’s file primitives, and the numbers are spicy:
File.joincommon case – Optimistically handle the common “two UTF-8 strings” case and scan backwards for the separator. Up to 18.81x faster for many-string joins, 7.80x for two strings.File.extnamefor common encodings – Skip multibyte handling for known-safe encodings. Up to 6.17x faster on long paths.File.expand_pathsingle-byte fast path – A single-byte-encoding fast path nets 2.67x faster.Dir.scanyields entry type – Yield each child’s type straight fromstruct dirent‘sd_type, avoiding a separatestatper child. Recursive directory walks come out 2.12x faster (“twice as fast”).dir.ccaches the working directory – Cache and cheaply revalidatepwdwith a stack buffer instead of always heap-allocating. Up to 1.33x fasterDir.pwdon Linux.
GC & object allocation
- Clear page bits in one shot – jhawthorn (John Hawthorn) turned age bits into a bit plane so age +
wb_unprotectedbits clear for a whole 64-slot page at once during sweep. ~14% off object-new. - Move
rb_class_allocate_instanceinto gc.c – Also jhawthorn: relocating the function lets allocation helpers inline with newobj. ~10–15% fasterObject.allocate(1.15x). - Remove the class alloc check – jhawthorn again, demoting a runtime allocation-class check to a debug-only assert and unlocking tail-call optimization. ~10% faster
Object.new(1.12x).
Concurrency & core classes
- Speed up
TypedData_Get_Struct– byroot added an inlinable fast path torb_check_typeddata, which makesMutex#synchronizeandMonitor#synchronize~1.54x / ~1.55x faster respectively. Thread::Queueuses a ring buffer – Swapping the backing array for a ring buffer removes array-function overhead: ~23% faster (1.24x). byroot.- Give the hot thread scheduler priority – jpl-coconut reworked thread switching to avoid an intermediate monitor-thread hop. On a 2-core setup the motivating benchmark went from 1.455s to 0.231s (and a heavier scenario from 36.7s to 4.1s).
Parser & build
- Parallelize bundled gem tests – Not a runtime win, but st0012 (Stan Lo) made CI run gem tests through a thread pool tied into the make jobserver, shaving ~40% off that CI step across platforms.
- Prism parser optimizations – kddnewton (Kevin Newton) packed in fast/slow path splitting, scope bloom filters, SIMD/SWAR
strpbrk, a wyhash word-at-a-time constant pool, and a parser arena. ~22% faster parsing at roughly the same memory. (The matching ruby/ruby side is #16418.) - Optimize the Prism Ruby visitor – Replace the array-allocating
compact_child_nodeswith aneach_child_nodethat yields directly. Visiting the Rails codebase came out ~21% faster on the interpreter and roughly 2.3x faster under YJIT. - Lazily deserialize
DefNode– DeferDefNodedeserialization in the Java loader so JRuby/TruffleRuby don’t pay for method bodies up front: ~1.5x faster on the parsing-core metric.
BigDecimal goes brrr
tompng (Tomoya Ishida) has been quietly doing extraordinary things to BigDecimal:
- NTT multiplication + Newton-Raphson division – O(n log n) multiplication via a three-prime Number Theoretic Transform. The headline is almost comical: up to 800,000x faster multiplication. A squaring that was estimated at 270 days now runs in 29 seconds. This is the kind of PR you frame on a wall.
- Increase
VpMultbatch size – Bumping the divmod batch from 8 to 16 makes mid-size multiplications ~1.8x faster. tompng. - Optimize
BigDecimal#to_s– byroot replaced twosnprintfcalls with a lean integer-to-ASCII routine: ~2.6x faster for small numbers, ~3.8x for large ones.
JIT corner
- Fix
RCLASS_EXT_WRITABLEperf – luke-gruber swappedFL_TEST/FL_SETfor their_RAWvariants, dropping a YJIT getivar benchmark from 60ms to 40ms (~1.5x). - Rewrite
Array#findin Ruby – swebb reimplementedArray#findin Ruby so the JIT can chew on it: ~1.96x faster under YJIT, neutral on the interpreter. - ZJIT recompiles getivar on shape-guard failure – k0kubun (Takashi Kokubun) cut
guard_shape_failureside exits on the lobsters benchmark from 22.5% down to 3.0%, keeping more code in ZJIT. - Annotate Float predicates – Teaching ZJIT about
Float#nan?/finite?/infinite?lets it emit the fast C-call path: ~21–27% faster on those predicates in a tight loop. - ZJIT also keeps growing its instruction set – specializing
Method#calland adding anArrayAsetHIR instruction for array element assignment – each shaving a few percent off the relevant wall-clock benchmarks.
Quick hits
A few more that are smaller in scope but very much worth a click – and a thank-you to each author:
Integer#to_stwo-digit lookup table – emit two digits per loop iteration; up to ~33% faster on large Fixnums.NilClassmethods moved to Ruby – Hartley McGuire madenil.to_c/to_rJIT-friendly: up to 3.5x faster.OPTIMIZED_CMPinr_less– speeds upRange#cover?/Range#overlap?by up to ~3x.- Declaring weak references – a new
rb_gc_declare_weak_referencesAPI trims WeakMap overhead: ~60% fasterWeakMap#[]=. - Remove a wasted allocation in BER integer packing – khasinski (Chris Hasiński) shaved ~50% off
Array#packwith the'w'format. - Optimize Lrama – the parser generator gets faster, cutting Ruby’s own
parse.yprocessing from 2.84s to 1.60s (~1.78x).
Closing
If you like performance magic, go read these. And if you maintain a gem, read them twice – a lot of what’s here (back-to-front scanning, single-byte fast paths, deferring copies, avoiding stat) is worth learning from.
Thanks to everyone credited here for the work.
The post Small PRs, big speedups: The Ruby performance work you almost missed appeared first on Closer to Code.
Discover more from Ubuntu-Server.com
Subscribe to get the latest posts sent to your email.
