The actual algorithm is largely unchanged, just allowed to use
singlebyte checks for common encodings.
It could certainly be optimized much further, as here again it often
scans from the front of the string when we're interested in the back of
it. But the algorithm as many Windows only corner cases so I'd rather
ship a good improvement now and eventually come back to it later.
Most of improvement here is from the reduced setup cost (avodi double
null checks, avoid duping the argument, etc), and skipping the multi-byte
checks.
```
compare-ruby: ruby 4.1.0dev (2026-01-19T03:51:30Z master 631bf19b37) +PRISM [arm64-darwin25]
built-ruby: ruby 4.1.0dev (2026-01-21T08:21:05Z opt-basename 7eb11745b2) +PRISM [arm64-darwin25]
```
| |compare-ruby|built-ruby|
|:----------|-----------:|---------:|
|long | 3.412M| 18.158M|
| | -| 5.32x|
|long_name | 1.981M| 8.580M|
| | -| 4.33x|
|withext | 3.200M| 12.986M|
| | -| 4.06x|
Similar optimizations to the ones performed in GH-15907.
- Skip the expensive multi-byte encoding handling for the common
encodings that are known to be safe.
- Use `CheckPath` to save on copying the argument and only scan it for
NULL bytes once.
- Create the return string with rb_enc_str_new instead of rb_str_subseq
as it's going to be a very small string anyway.
This could be optimized a little bit further by searching for both `.` and `dirsep`
in one pass,
```
compare-ruby: ruby 4.1.0dev (2026-01-19T03:51:30Z master 631bf19b37) +PRISM [arm64-darwin25]
built-ruby: ruby 4.1.0dev (2026-01-20T07:33:42Z master 6fb50434e3) +PRISM [arm64-darwin25]
```
| |compare-ruby|built-ruby|
|:----------|-----------:|---------:|
|long | 3.606M| 22.229M|
| | -| 6.17x|
|long_name | 2.254M| 13.416M|
| | -| 5.95x|
|short | 16.488M| 29.969M|
| | -| 1.82x|
- `str_null_check` was performed twice, once by `FilePathStringValue`
and a second time by `StringValueCStr`.
- `StringValueCStr` was checking for the terminator presence, but we
don't care about that.
- `FilePathStringValue` calls `rb_str_new_frozen` to ensure `fname`
isn't mutated, but that's costly for such a check. Instead we
can do it in debug mode only.
- `rb_enc_get` is slow because it accepts arbitrary objects, even immediates,
so it has to do numerous type checks. Add a much faster `rb_str_enc_get`
when we know we're dealing with a string.
- `rb_enc_copy` is slow for the same reasons, since we already have the
encoding, we can use `rb_enc_str_new` instead.
`File.join` is a hotspot for common libraries such as Zeitwerk
and Bootsnap. It has a fairly flexible signature, but 99% of
the time it's called with just two (or a small number of) UTF-8 strings.
If we optimistically optimize for that use case we can cut down a large
number of type and encoding checks, significantly speeding up the method.
The one remaining expensive check we could try to optimize is `str_null_check`.
Given it's common to use the same base string for joining, we could memoize it.
Also we could precompute it for literal strings.
```
compare-ruby: ruby 4.1.0dev (2026-01-17T14:40:03Z master 00a3b71eaf) +PRISM [arm64-darwin25]
built-ruby: ruby 4.1.0dev (2026-01-18T12:10:38Z spedup-file-join 069bab58d4) +PRISM [arm64-darwin25]
warming up....
| |compare-ruby|built-ruby|
|:-------------|-----------:|---------:|
|two_strings | 2.475M| 9.444M|
| | -| 3.82x|
|many_strings | 551.975k| 2.346M|
| | -| 4.25x|
|array | 514.946k| 522.034k|
| | -| 1.01x|
|mixed | 621.236k| 633.189k|
| | -| 1.02x|
```
Since `BUILTIN_TYPE` and `RCLASS_SINGLETON_P` are both stored in
`RBasic.flags`, we can combine these two checks in a single bitmask.
This rely on `T_ICLASS` and `T_CLASS` not overlapping, and assume
`klass` is always either of these types.
Just combining the masks brings a small but consistent 1.08x speedup on the simple case benchmark.
```
compare-ruby: ruby 3.5.0dev (2025-08-30T01:45:42Z obj-class 01a57bd6cd) +YJIT +PRISM [arm64-darwin24]
built-ruby: ruby 3.5.0dev (2025-08-30T09:56:24Z obj-class 2685f8dbb4) +YJIT +PRISM [arm64-darwin24]
| |compare-ruby|built-ruby|
|:----------|-----------:|---------:|
|obj | 444.410| 478.895|
| | -| 1.08x|
|extended | 135.139| 140.206|
| | -| 1.04x|
|singleton | 165.155| 155.832|
| | 1.06x| -|
|immediate | 380.103| 432.090|
| | -| 1.14x|
```
But with the RB_UNLIKELY compiler hint, it's much more significant, however
the singleton and enxtended cases are slowed down.
However we can assume the simple case is way more common than the other two.
```
compare-ruby: ruby 3.5.0dev (2025-08-30T01:45:42Z obj-class 01a57bd6cd) +YJIT +PRISM [arm64-darwin24]
built-ruby: ruby 3.5.0dev (2025-08-30T09:51:01Z obj-class 12d01a1b02) +YJIT +PRISM [arm64-darwin24]
| |compare-ruby|built-ruby|
|:----------|-----------:|---------:|
|obj | 444.951| 556.191|
| | -| 1.25x|
|extended | 136.836| 113.871|
| | 1.20x| -|
|singleton | 166.335| 167.747|
| | -| 1.01x|
|immediate | 379.642| 509.515|
| | -| 1.34x|
```
While accessing the ivars of other types is too complicated to
realistically generate the ASM for it, we can at least provide
the ivar index as to not have to lookup the shape tree every
time.
```
compare-ruby: ruby 3.5.0dev (2025-08-27T14:58:58Z merge-vm-setivar-d.. 5b749d8e53) +YJIT +PRISM [arm64-darwin24]
built-ruby: ruby 3.5.0dev (2025-08-28T17:58:32Z yjit-get-exivar efaa8c9b09) +YJIT +PRISM [arm64-darwin24]
| |compare-ruby|built-ruby|
|:--------------------------|-----------:|---------:|
|vm_ivar_get_on_obj | 930.458| 936.865|
| | -| 1.01x|
|vm_ivar_get_on_class | 134.471| 431.622|
| | -| 3.21x|
|vm_ivar_get_on_generic | 146.679| 284.408|
| | -| 1.94x|
```
Co-Authored-By: Aaron Patterson <tenderlove@ruby-lang.org>
`vm_setinstancevariable` had a codepath to try to match the inline
cache for types other than T_OBJECT, but the cache population path
in `vm_setivar_slowpath` was exclusive to T_OBJECT, so `vm_setivar_default`
would never match anything.
This commit improves `vm_setivar_slowpath` so that it is capable of
filling the cache for all types, and adds a `vm_setivar_class` codepath
for `T_CLASS` and `T_MODULE`.
`vm_setivar`, `vm_setivar_default` and `vm_setivar_class` could be unified,
but based on the very explicit `NOINLINE` I assume they were split to minimize
codesize.
```
compare-ruby: ruby 3.5.0dev (2025-08-27T14:58:58Z merge-vm-setivar-d.. 5b749d8e53) +PRISM [arm64-darwin24]
built-ruby: ruby 3.5.0dev (2025-08-27T16:30:31Z setivar-cache-gene.. 4fe78ff296) +PRISM [arm64-darwin24]
| |compare-ruby|built-ruby|
|:------------------------|-----------:|---------:|
|vm_ivar_set_on_instance | 161.809| 164.688|
| | -| 1.02x|
|vm_ivar_set_on_generic | 58.769| 115.638|
| | -| 1.97x|
|vm_ivar_set_on_class | 70.034| 141.042|
| | -| 2.01x|
```
It's not rare for structs to have additional ivars, hence are one
of the most common, if not the most common type in the `gen_fields_tbl`.
This can cause Ractor contention, but even in single ractor mode
means having to do a hash lookup to access the ivars, and increase
GC work.
Instead, unless the struct is perfectly right sized, we can store
a reference to the associated IMEMO/fields object right after the
last struct member.
```
compare-ruby: ruby 3.5.0dev (2025-08-06T12:50:36Z struct-ivar-fields-2 9a30d141a1) +PRISM [arm64-darwin24]
built-ruby: ruby 3.5.0dev (2025-08-06T12:57:59Z struct-ivar-fields-2 2ff3ec237f) +PRISM [arm64-darwin24]
warming up.....
| |compare-ruby|built-ruby|
|:---------------------|-----------:|---------:|
|member_reader | 590.317k| 579.246k|
| | 1.02x| -|
|member_writer | 543.963k| 527.104k|
| | 1.03x| -|
|member_reader_method | 213.540k| 213.004k|
| | 1.00x| -|
|member_writer_method | 192.657k| 191.491k|
| | 1.01x| -|
|ivar_reader | 403.993k| 569.915k|
| | -| 1.41x|
```
Co-Authored-By: Étienne Barrié <etienne.barrie@gmail.com>
Set has been an autoloaded standard library since Ruby 3.2.
The standard library Set is less efficient than it could be, as it
uses Hash for storage, which stores unnecessary values for each key.
Implementation details:
* Core Set uses a modified version of `st_table`, named `set_table`.
than `s/st_/set_/`, the main difference is that the stored records
do not have values, making them 1/3 smaller. `st_table_entry` stores
`hash`, `key`, and `record` (value), while `set_table_entry` only
stores `hash` and `key`. This results in large sets using ~33% less
memory compared to stdlib Set. For small sets, core Set uses 12% more
memory (160 byte object slot and 64 malloc bytes, while stdlib set
uses 40 for Set and 160 for Hash). More memory is used because
the set_table is embedded and 72 bytes in the object slot are
currently wasted. Hopefully we can make this more efficient and have
it stored in an 80 byte object slot in the future.
* All methods are implemented as cfuncs, except the pretty_print
methods, which were moved to `lib/pp.rb` (which is where the
pretty_print methods for other core classes are defined). As is
typical for core classes, internal calls call C functions and
not Ruby methods. For example, to check if something is a Set,
`rb_obj_is_kind_of` is used, instead of calling `is_a?(Set)` on the
related object.
* Almost all methods use the same algorithm that the pure-Ruby
implementation used. The exception is when calling `Set#divide` with a
block with 2-arity. The pure-Ruby method used tsort to implement this.
I developed an algorithm that only allocates a single intermediate
hash and does not need tsort.
* The `flatten_merge` protected method is no longer necessary, so it
is not implemented (it could be).
* Similar to Hash/Array, subclasses of Set are no longer reflected in
`inspect` output.
* RDoc from stdlib Set was moved to core Set, with minor updates.
This includes a comprehensive benchmark suite for all public Set
methods. As you would expect, the native version is faster in the
vast majority of cases, and multiple times faster in many cases.
There are a few cases where it is significantly slower:
* Set.new with no arguments (~1.6x)
* Set#compare_by_identity for small sets (~1.3x)
* Set#clone for small sets (~1.5x)
* Set#dup for small sets (~1.7x)
These are slower as Set does not currently use the AR table
optimization that Hash does, so a new set_table is initialized for
each call. I'm not sure it's worth the complexity to have an AR
table-like optimization for small sets (for hashes it makes sense,
as small hashes are used everywhere in Ruby).
The rbs and repl_type_completor bundled gems will need updates to
support core Set. The pull request marks them as allowed failures.
This passes all set tests with no changes. The following specs
needed modification:
* Modifying frozen set error message (changed for the better)
* `Set#divide` when passed a 2-arity block no longer yields the same
object as both the first and second argument (this seems like an issue
with the previous implementation).
* Set-like objects that override `is_a?` such that `is_a?(Set)` return
`true` are no longer treated as Set instances.
* `Set.allocate.hash` is no longer the same as `nil.hash`
* `Set#join` no longer calls `Set#to_a` (it calls the underlying C
function).
* `Set#flatten_merge` protected method is not implemented.
Previously, `set.rb` added a `SortedSet` autoload, which loads
`set/sorted_set.rb`. This replaces the `Set` autoload in `prelude.rb`
with a `SortedSet` autoload, but I recommend removing it and
`set/sorted_set.rb`.
This moves `test/set/test_set.rb` to `test/ruby/test_set.rb`,
reflecting that switch to a core class. This does not move the spec
files, as I'm not sure how they should be handled.
Internally, this uses the st_* types and functions as much as
possible, and only adds set_* types and functions as needed.
The underlying set_table implementation is stored in st.c, but
there is no public C-API for it, nor is there one planned, in
order to keep the ability to change the internals going forward.
For internal uses of st_table with Qtrue values, those can
probably be replaced with set_table. To do that, include
internal/set_table.h. To handle symbol visibility (rb_ prefix),
internal/set_table.h uses the same macro approach that
include/ruby/st.h uses.
The Set class (rb_cSet) and all methods are defined in set.c.
There isn't currently a C-API for the Set class, though C-API
functions can be added as needed going forward.
Implements [Feature #21216]
Co-authored-by: Jean Boussier <jean.boussier@gmail.com>
Co-authored-by: Oliver Nutter <mrnoname1000@riseup.net>
This inverse table is only useful if `ObjectSpace._id2ref` is used,
which is extremely rare. The only notable exception is the `drb` gem
and even then it has an option not to rely on `_id2ref`.
So if we assume this table will never be looked up, we can just
not maintain it, and if it turns out `_id2ref` is called, we
can lock the VM and re-build it.
```
compare-ruby: ruby 3.5.0dev (2025-04-10T09:44:40Z master 684cfa42d7) +YJIT +PRISM [arm64-darwin24]
built-ruby: ruby 3.5.0dev (2025-04-10T10:13:43Z lazy-id-to-obj d3aa9626cc) +YJIT +PRISM [arm64-darwin24]
warming up..
| |compare-ruby|built-ruby|
|:----------|-----------:|---------:|
|baseline | 26.364M| 25.974M|
| | 1.01x| -|
|object_id | 10.293M| 14.202M|
| | -| 1.38x|
```
The following method call:
```ruby
a(*nil)
```
A method call such as `a(*nil)` previously allocated an array, because
it calls `nil.to_a`, but I have determined this array allocation is
unnecessary. The instructions in this case are:
```
0000 putself ( 1)[Li]
0001 putnil
0002 splatarray false
0004 opt_send_without_block <calldata!mid:a, argc:1, ARGS_SPLAT|FCALL>
0006 leave
```
The method call uses `ARGS_SPLAT` without `ARGS_SPLAT_MUT`, so the
returned array doesn't need to be mutable. I believe all cases where
`splatarray false` are used allow the returned object to be frozen,
since the `false` means to not duplicate the array. The optimization
in this case is to have `splatarray false` push a shared empty frozen
array, instead of calling `nil.to_a` to return a newly allocated array.
There is a slightly backwards incompatibility with this optimization,
in that `nil.to_a` is not called. However, I believe the new behavior
of `*nil` not calling `nil.to_a` is more consistent with how `**nil`
does not call `nil.to_hash`. Also, so much Ruby code would break if
`nil.to_a` returned something different from the empty hash, that it's
difficult to imagine anyone actually doing that in real code, though
we have a few tests/specs for that.
I think it would be bad for consistency if `*nil` called `nil.to_a`
in some cases and not others, so this changes other cases to not
call `nil.to_a`:
For `[*nil]`, this uses `splatarray true`, which now allocates a
new array for a `nil` argument without calling `nil.to_a`.
For `[1, *nil]`, this uses `concattoarray`, which now returns
the first array if the second array is `nil`.
This updates the allocation tests to check that the array allocations
are avoided where possible.
Implements [Feature #21047]
If the provided Hash doesn't have a default proc, we know for
sure that we'll never call into user provided code, hence the
string we allocate to access the Hash can't possibly escape.
So we don't actually have to allocate it, we can use a fake_str,
AKA a stack allocated string.
```
compare-ruby: ruby 3.5.0dev (2025-02-10T13:47:44Z master 3fb455adab) +PRISM [arm64-darwin23]
built-ruby: ruby 3.5.0dev (2025-02-10T17:09:52Z opt-gsub-alloc ea5c28958f) +PRISM [arm64-darwin23]
warming up....
| |compare-ruby|built-ruby|
|:----------------|-----------:|---------:|
|escape | 3.374k| 3.722k|
| | -| 1.10x|
|escape_bin | 5.469k| 6.587k|
| | -| 1.20x|
|escape_utf8 | 3.465k| 3.734k|
| | -| 1.08x|
|escape_utf8_bin | 5.752k| 7.283k|
| | -| 1.27x|
```
* Fix WASM bullet/code indentation
* Use `console` code highlighting where appropriate
… which handles the prefix `$` correctly.
* Migrate feature proposal template to MarkDown
* Set language on code blocks
Profiling revealed that we were spending lots of time growing the buffer.
Buffer operations is definitely something we want to optimize, but for
this specific benchmark what we're interested in is UTF-8 scanning performance.
Each iteration of the two scaning benchmark were producing 20MB of JSON,
now they only produce 5MB.
Now:
```
== Encoding mostly utf8 (5001001 bytes)
ruby 3.4.0dev (2024-10-18T19:01:45Z master https://github.com/ruby/json/commit/7be9a333ca) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
json 35.000 i/100ms
oj 36.000 i/100ms
rapidjson 10.000 i/100ms
Calculating -------------------------------------
json 359.161 (± 1.4%) i/s (2.78 ms/i) - 1.820k in 5.068542s
oj 359.699 (± 0.6%) i/s (2.78 ms/i) - 1.800k in 5.004291s
rapidjson 99.687 (± 2.0%) i/s (10.03 ms/i) - 500.000 in 5.017321s
Comparison:
json: 359.2 i/s
oj: 359.7 i/s - same-ish: difference falls within error
rapidjson: 99.7 i/s - 3.60x slower
```
https://github.com/ruby/json/commit/1a338532d2
Now that we've inlined the eden_heap into the size_pool, we should
rename the size_pool to heap. So that Ruby contains multiple heaps, with
different sized objects.
The term heap as a collection of memory pages is more in memory
management nomenclature, whereas size_pool was a name chosen out of
necessity during the development of the Variable Width Allocation
features of Ruby.
The concept of size pools was introduced in order to facilitate
different sized objects (other than the default 40 bytes). They wrapped
the eden heap and the tomb heap, and some related state, and provided a
reasonably simple way of duplicating all related concerns, to provide
multiple pools that all shared the same structure but held different
objects.
Since then various changes have happend in Ruby's memory layout:
* The concept of tomb heaps has been replaced by a global free pages list,
with each page having it's slot size reconfigured at the point when it
is resurrected
* the eden heap has been inlined into the size pool itself, so that now
the size pool directly controls the free_pages list, the sweeping
page, the compaction cursor and the other state that was previously
being managed by the eden heap.
Now that there is no need for a heap wrapper, we should refer to the
collection of pages containing Ruby objects as a heap again rather than
a size pool