Every developer who maintains Ruby gems knows that sinking feeling when a user reports an error that shouldn’t be possible. Not “difficult to reproduce”, but truly impossible according to everything you know about how your code works.
That’s exactly what hit me when Karafka user’s error tracker logged 2,700 identical errors in a single incident:
NoMethodError: undefined method 'default' for an instance of String
vendor/bundle/ruby/3.4.0/gems/karafka-rdkafka-0.22.2-x86_64-linux-musl/lib/rdkafka/consumer/topic_partition_list.rb:112 FFI::Struct#[] The error was because something was calling #default on a String. I had never used a #default method anywhere in Karafka or rdkafka-ruby. Suddenly, there were 2,700 reports in rapid succession until the process restarted and everything went back to normal.
The user added casually: “No worry, no harm done since this hasn’t occurred on prod yet.”
Yet. That word stuck with me.
Something had to change. Fast.
TL;DR: FFI
I opened the rdkafka-ruby code at line 112:
native_tpl[:cnt].times do |i|
ptr = native_tpl[:elems] + (i * Rdkafka::Bindings::TopicPartition.size)
elem = Rdkafka::Bindings::TopicPartition.new(ptr)
# Line 112 - Where everything exploded
if elem[:partition] == -1 The crash happened when accessing elem[:partition]. But elem is an FFI::Struct – a foreign function interface structure that bridges Ruby and C code and partition was declared as an integer:
class TopicPartition I dove into FFI's internals to understand what was happening. FFI doesn't use many Hashes, neither in Ruby nor in its C extension - there are only a few critical data structures. The most important one is rbFieldMap, an internal Hash that every struct layout maintains to store field definitions. When you access elem[:partition], FFI looks up :partition in this Hash to find the field's type, offset, and size.
This Hash is the heart of the FFI's struct system. Without it, FFI can't map field names to their C memory locations.
Why would it be calling default on a String?
I searched the entire codebase. No calls to #default anywhere in my code. I checked FFI's Ruby code. No calls to #default there either.
But #default is a Hash method. Ruby's Hash implementation calls hash#default when you access a key that might not exist.
I stared at the backtrace. After billions of messages processed successfully, something in FFI's internals had fundamentally broken. An internal Hash that should contain field definitions was somehow... a String.
The gem was precompiled: karafka-rdkafka-0.22.2-x86_64-linux-musl. That suffix made me immediately suspicious. The user was running ruby:3.4.5-alpine in Docker, which uses musl libc instead of glibc.
I've debugged enough production issues to know that precompiled gems and Alpine Linux make a notorious combination. Different libc versions, different struct alignment assumptions, different CPU architecture quirks.
"This has to be musl," I thought. I spent some time building diagnostic scripts:
require 'ffi'
# Check FFI integer type sizes
module Test
extend FFI::Library
class IntTest The response came back:
FFI :int size: 4 bytes
FFI :int32 size: 4 bytes
Match: Yes The sizes matched. That ruled out basic type mismatches. But maybe alignment?
I sent another diagnostic to check struct padding:
# Check actual struct field offsets
module AlignTest
extend FFI::Library
class WithInt Response:
Struct alignment: :err offset 48 vs 48 Perfect alignment. Now let's check the actual compiled struct from the gem:
actual_size = Rdkafka::Bindings::TopicPartition.size
actual_err_offset = Rdkafka::Bindings::TopicPartition.offset_of(:err)
puts "Actual gem struct: size=#{actual_size}, err_offset=#{actual_err_offset}"
expected_size = 64
expected_err_offset = 48
puts "Expected: size=#{expected_size}, err_offset=#{expected_err_offset}" Response:
Actual gem struct: size=64, err_offset=48
Expected: size=64, err_offset=48 Everything matched.
Every "obvious" explanation had failed. The struct definitions were perfect. The memory layout was correct. There was no ABI mismatch, no musl-specific quirk, no CPU architecture issue.
And yet the undefined method 'default' for an instance of Stringoccurred.
I went back to that error message with fresh eyes. Why default specifically?
In Ruby, when you access a Hash with hash[key], the implementation can call hash.default to check for a default value if the key doesn't exist. So if FFI is trying to call #default on a String, this would mean thatrbFieldMap - the internal Hash that stores field definitions - is actually a String.
Sounds crazy, but wait! What if there was a case where Ruby could replace a Hash with a String at runtime? Not corrupt the Hash's data, but literally free the Hash and allocate a String in the same memory location?
That would explain everything. The C code would still have a pointer to memory address 0x000078358a3dfd28, thinking it points to a Hash. But Ruby's GC would have freed that Hash, and the memory allocator could create a String at the exact same address. The pointer would be valid. The memory would contain valid data. Just... the wrong type of data.
An object changing type at runtime. That shouldn't be possible unless... I searched FFI's GitHub issues and found #1079: "Crash with [BUG] try to mark T_NONE object" - about segfaults, not this specific error. But buried in the comments, KJ mentioned "missing write barriers" in FFI's C extension.
A write barrier is a mechanism that tells Ruby's garbage collector about references between objects. When C code stores a Ruby object pointer without using RB_OBJ_WRITE, the GC doesn't know that reference exists. The GC can then free the object, thinking nothing needs it anymore.
That's when it clicked. If FFI's rbFieldMap Hash was being freed by the GC, then Ruby could allocate a String in that exact memory location.
But first, I needed to understand the #1079 issue better. I wrote a simple reproduction:
require 'ffi'
puts "Ruby: #{RUBY_VERSION} | FFI: #{FFI::VERSION}"
# Enable aggressive GC to trigger the bug faster
GC.stress = 0x01 | 0x04
i = 0
loop do
i += 1
# Create transient struct class that immediately goes out of scope
struct_class = Class.new(FFI::Struct) do
layout :field1, :int32,
:field2, :int64,
:field3, :pointer,
:field4, :string,
:field5, :double,
:field6, :uint8,
:field7, :uint32,
:field8, :pointer
end
instance = struct_class.new
instance[:field1] = rand
instance[:field2]
# ... access various fields
field = struct_class.layout[:field5]
field.offset
field.size
print "." if i % 1000 == 0
end This reproduced the #1079 segfaults beautifully - the "T_NONE object" errors where the GC frees objects so aggressively that Ruby tries to access null pointers.
rb_obj_info_dump:
/3.4.0/gems/ffi-1.16.3/lib/ffi/struct_layout_builder.rb:171: [BUG] try to mark T_NONE object
ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [x86_64-linux]
-- Control frame information -----------------------------------------------
c:0044 p:---- s:0246 e:000245 CFUNC :initialize
c:0043 p:---- s:0243 e:000242 CFUNC :new
c:0042 p:0033 s:0236 e:000235 METHOD /gems/3.4.0/gems/ffi-1.16.3/lib/ffi/struct_layout_builder.rb:171 But my production bug wasn't a segfault. It was a magical transformation. The timing had to be different.
With GC.stress = true, the GC runs after every possible allocation. That causes immediate segfaults because objects get freed before Ruby can even allocate new objects in their memory slots.
But for a Hash to become a String, you need:
I couldn't use GC.stress. I needed natural GC timing with precise memory pressure.
I dove deeper into FFI's C extension code. In ext/ffi_c/StructLayout.c, I found the vulnerable code:
static VALUE
struct_layout_initialize(VALUE self, VALUE fields, VALUE size, VALUE align)
{
StructLayout* layout;
// ... initialization code ...
layout->rbFieldMap = rb_hash_new(); // ← NO WRITE BARRIER
layout->rbFields = rb_ary_new();
layout->rbFieldNames = rb_ary_new();
// Without RB_OBJ_WRITE, the GC doesn't know about these references!
// ...
} When FFI creates a struct layout, it allocates three Ruby objects:
It stores pointers to these objects in a C struct.
But it didn't use RB_OBJ_WRITE to register these references with Ruby garbage collector in FFI 1.16.3.
From the GC's perspective, the following is happening:
0x000078358a3dfd28.0x000078358a3dfd28.undefined method 'default' for String.The fix in FFI 1.17.0 added proper write barriers:
static VALUE
struct_layout_initialize(VALUE self, VALUE fields, VALUE size, VALUE align)
{
StructLayout* layout;
// ... initialization code ...
RB_OBJ_WRITE(self, &layout->rbFieldMap, rb_hash_new()); // ← FIXED!
RB_OBJ_WRITE(self, &layout->rbFields, rb_ary_new());
RB_OBJ_WRITE(self, &layout->rbFieldNames, rb_ary_new());
// Now the GC knows: "self owns these objects, don't free them"
// ...
} This single macro call, RB_OBJ_WRITE, tells Ruby's garbage collector: "This C struct holds a reference to this Ruby object. Don't free it while the struct is alive."
Without it, you have a use-after-free vulnerability where C thinks that it has a valid pointer, but Ruby has freed the memory and reused it for something else entirely.
Understanding the bug wasn't enough. I needed to reproduce it. Not the #1079 segfaults - the specific case where a Hash becomes something else.
The requirements were precise:
GC.stress which causes segfaults).Here's what I have built:
#!/usr/bin/env ruby
require 'ffi'
# Unbuffer stdout so we see output immediately
$stdout.sync = true
$stderr.sync = true
2.times do
Thread.new do
loop do
# Create an array to hold references temporarily
# This creates more allocation pressure
arr = []
# Allocate many strings rapidly
5000.times do
arr e
puts "n" + "=" * 60
puts "🐛 BUG REPRODUCED! 🐛"
puts "=" * 60
puts "Error: #{e.message}"
puts "nBacktrace:"
puts e.backtrace[0..10]
exit 1
end
end
# Clear old strings to increase memory churn
if garbage_strings.size > 50_000
garbage_strings.shift(25_000)
end
end
end
ars.each(&:join) Key differences from typical FFI tests:
I wrapped it in a Docker container with memory constraints:
FROM ruby:3.4.5-alpine
RUN apk add --no-cache build-base
RUN gem install ffi -v 1.16.3
WORKDIR /app
COPY poc.rb .
CMD ["ruby", "poc.rb"] Then I created a bash script to run it in a loop, filtering for the specific error:
#!/bin/bash
run_count=0
log_dir="./logs"
mkdir -p "$log_dir"
echo "Building Docker image..."
docker build -t ffi-bug-poc .
echo "Running POC in a loop until bug is reproduced..."
echo "Looking for exit code 1 with 'undefined' in output"
echo
while true; do
run_count=$((run_count + 1))
timestamp=$(date +%Y%m%d_%H%M%S)
log_file="${log_dir}/run_${run_count}_${timestamp}.log"
echo -n "Run #${run_count} at $(date +%H:%M:%S)... "
# Run with memory constraints to increase GC pressure
docker run --rm
--memory=512m
--memory-swap=0m
ffi-bug-poc > "$log_file" 2>&1
exit_code=$?
# Filter: only care about exit code 1 with "undefined" in output
# Ignore segfaults (exit 139) - those are from #1079
if [ $exit_code -eq 1 ] && grep -qi "undefined" "$log_file"; then
echo ""
echo "🐛 BUG REPRODUCED on run #${run_count}! 🐛"
cat "$log_file"
exit 0
elif [ $exit_code -eq 0 ]; then
echo "completed successfully (no bug)"
rm "$log_file"
else
echo "exit code $exit_code (segfault) - continuing..."
fi
sleep 0.1
done I hit Enter and watched the terminal:
Building Docker image...
Running POC in a loop until bug is reproduced...
Looking for exit code 1 with 'undefined' in output
Run #1 at 14:32:15... completed successfully (no bug)
Run #2 at 14:32:18... completed successfully (no bug)
Run #3 at 14:32:21... exit code 139 (segfault) - continuing...
Run #4 at 14:32:24... completed successfully (no bug) Lots of segfaults - those were the #1079 issue. I was hunting for the specific undefined method error.
After realizing I needed even more memory churn, I opened multiple terminals and ran the loop script several times in parallel. Within minutes:
Run #23 at 15:18:42... exit code 139 (segfault) - continuing...
Run #24 at 15:18:45... completed successfully (no bug)
Run #25 at 15:18:48...
============================================================
🐛 BUG REPRODUCED! 🐛
============================================================
Error: undefined method 'default' for an instance of String
Backtrace:
poc.rb:82:in `[]'
poc.rb:82:in `block (2 levels) in '
poc.rb:80:in `each'
poc.rb:80:in `each_with_index'
poc.rb:80:in `block in '
:237:in `times'
poc.rb:50:in `'
============================================================ There!
Not a segfault. Not the T_NONE error from #1079. There it is, the exact error from production: undefined method 'default' for an instance of String
An FFI internal Hash had been freed by the GC and replaced by a String object in the same memory location!
Here's what happens in those microseconds when the bug triggers:
The Hash didn't get corrupted. It ceased to exist. A String was born in its place, wearing the Hash's memory address like a stolen identity.
This bug reveals something fundamental about how Ruby manages memory at the lowest level.
Objects don't have permanent identities. They're data structures at the memory addresses. When the garbage collector frees memory, Ruby will reuse it. If you're holding a C pointer to that address without proper write barriers, you're now pointing at whatever Ruby decided to create there next.
No warning. No error. Just different methods that make no sense.
It's like coming home to find a stranger living in your house, wearing your clothes, answering to your name. And when you say "but you're supposed to respond to default," they look at you confused: "I'm a String. I don't have that method."
This is why write barriers exist. They're not optional extras for C extension authors. They're how you tell the garbage collector: "I'm holding a reference. Don't free this. Without them, you have use-after-free bugs that can manifest as objects changing identity at runtime.
If you're using FFI
# Gemfile
gem 'ffi', '~> 1.17.0' That's it. Upgrade and the bug goes from million-to-one to zero.
The fix made by KJ adds proper write barriers throughout FFI's C codebase. The garbage collector now knows not to free rbFieldMap while it's still needed. Your Hashes stay Hashes. Your Strings stay Strings. Reality remains consistent.
But here's why this matters beyond just FFI users.
This bug requires perfect timing:
Create transient struct CLASS definitions (not just instances):
Precise GC timing (not too aggressive, not too passive)
Multi-threaded execution creating memory churn:
The exact microsecond window where:
rbFieldMap.In typical production Karafka processing:
~1 in 1,000,000 process restarts
But in high-churn environments:
That's how a million-to-one bug causes 2,700 errors in production. Not because it's common, but because when timing finally aligns, it stays aligned. The corrupted state persists until a restart, causing every subsequent operation to fail.
After spending days debugging what seemed impossible, here's what stayed with me:
Sometimes the obvious answer is wrong. I spent a lot of time convinced this was a musl issue. Every diagnostic came back green. Sizes matched. Offsets matched. Alignment matched. Everything. Matched. But the bug wasn't in the data layout - it was in object identity.
The timing of GC matters as much as whether GC happens. GC.stress finds immediate use-after-free (segfaults). Natural GC timing finds delayed reuse (object transformations). They're different symptoms of the same root cause, requiring different reproduction strategies.
Million-to-one bugs are real, not theoretical. They happen during initialization and restart, not runtime. When they trigger, they cascade - 2,500 errors from one root cause. In high-restart environments, rare becomes routine.
Diagnostic scripts can test the wrong layer. My scripts verified static struct layouts perfectly. But they couldn't detect that FFI's internal Hash could be freed by GC and replaced by a String at runtime. The tests passed because they checked structure, not behavior.
Patience and persistence matter. Groing from the impossible theory to a reproducible bug took days of work. Building the elegant #1079 reproduction helped me understand the timing requirements. Manual write barrier removal showed me the difference between immediate and delayed failures. Multi-threaded stress tests with process farms compressed million-to-one odds into minutes.
Initially, I blamed myself. I guess, that's what maintainership feels like sometimes. You own the stack, even when the bug is deeper than your code.
The fix was already in FFI 1.17.0 when this incident happened. The user just hadn't upgraded yet. Sometimes the "impossible" error you're debugging has already been solved. You just don't know it yet.
The root cause - missing write barriers in FFI FFI issue #1079 by KJ, who has been my invaluable rubber duck throughout this debugging journey.
If you're running FFI
They're inevitabilities waiting to happen.
And somewhere, right now, a developer is staring at an error log showing many identical crashes, running every diagnostic script they have, seeing all green checkmarks, and thinking: "That's impossible. Everything matches."
But it's not impossible.
It's just waiting for the right microsecond.
The post When Your Hash Becomes a String: Hunting Ruby’s Million-to-One Memory Bug appeared first on Closer to Code.
I am excited to announce the release of llm-docs-builder, a library that transforms Markdown documentation…
On October 23rd, we announced the beta availability of silicon-optimized AI models in Ubuntu. Developers…
At NVIDIA GTC Washington D.C., Canonical is pleased to support the arrival of the NVIDIA…
How Anbox Cloud streamlines localization testing Wherever users are based, they expect apps to just…
Ubuntu now runs natively on the Thundercomm RUBIK Pi 3 developer board – a lightweight…
Validate your skills and advance your career with recognized qualifications from the publishers of Ubuntu…