This affects string formatting as well as bytes and bytearray formatting.
* For errors in the format string, always include the position of the
start of the format unit.
* For errors related to the formatted arguments, always include the number
or the name of the formatted argument.
* Suggest more probable causes of errors in the format string (stray %,
unsupported format, unexpected character).
* Provide more information when the number of arguments does not match
the number of format units.
* Raise more specific errors when access of arguments by name is mixed with
sequential access and when * is used with a mapping.
* Add tests for some uncovered cases.
Optimize bytes.translate() by deferring change detection
Move the equality check out of the hot loop to allow better compiler
optimization. Instead of checking each byte during translation, perform
a single memcmp at the end to determine if the input can be returned
unchanged.
This allows compilers to unroll and pipeline the loops, resulting in ~2x
throughput improvement for medium-to-large inputs (tested on an AMD zen2).
No change observed on small inputs.
It will also be faster for bytes subclasses as those do not need change
detection.
If we are specializing to `LOAD_GLOBAL_MODULE` or `LOAD_ATTR_MODULE`, try
to enable deferred reference counting for the value, if the object is owned by
a different thread. This applies to the free-threaded build only and should
improve scaling of multi-threaded programs.
Speed up conversion from `bytes-like` objects like `bytearray` while
keeping conversion from `bytes` stable.
Co-authored-by: Sergey B Kirpichev <skirpichev@gmail.com>
Co-authored-by: Victor Stinner <vstinner@python.org>
When comparing negative non-integer float and int with the same number
of bits in the integer part, __neg__() in the int subclass returning
not an int caused an assertion error.
Now the integer is no longer negated. Also, reduced the number of
temporary created Python objects.
The PyObject header reference count fields must be initialized using
atomic operations because they may be concurrently read by another
thread (e.g., from `_Py_TryIncref`).
`pycore_optimizer.h` was included redundantly in
Objects/frameobject.c and Python/instrumentation.c.
Both includes are unnecessary and can be safely removed.
No functional change.
Signed-off-by: Yongtao Huang <yongtaoh2022@gmail.com>
Co-authored-by: Ken Jin <28750310+Fidget-Spinner@users.noreply.github.com>
Co-authored-by: Brandt Bucher <brandt@python.org>
Co-authored-by: Hugo van Kemenade <1324225+hugovk@users.noreply.github.com>
The use of memmove and _Py_memory_repeat were not thread-safe in the
free threading build in some cases. In theory, memmove and
_Py_memory_repeat can copy byte-by-byte instead of pointer-by-pointer,
so concurrent readers could see uninitialized data or tearing.
Additionally, we should be using "release" (or stronger) ordering to be
compliant with the C11 memory model when copying objects within a list.
This makes generator frame state transitions atomic in the free
threading build, which avoids segfaults when trying to execute
a generator from multiple threads concurrently.
There are still a few operations that aren't thread-safe and may crash
if performed concurrently on the same generator/coroutine:
* Accessing gi_yieldfrom/cr_await/ag_await
* Accessing gi_frame/cr_frame/ag_frame
* Async generator operations
Now that we specialize range iteration in the interpreter for the common
case where the iterator has only one reference, there's not a
significant performance cost to making the iteration thread-safe.
This combines most _PyStackRef functions and macros between the free
threaded and default builds.
- Remove Py_TAG_DEFERRED (same as Py_TAG_REFCNT)
- Remove PyStackRef_IsDeferred (same as !PyStackRef_RefcountOnObject)
When a `str` is encoded in `bytearray.__init__` the encoder tends to
create a new unique bytes object. Rather than allocate new memory and
copy the bytes use the already created bytes object as bytearray
backing. The bigger the `str` the bigger the saving.
Mean +- std dev: [main_encoding] 497 us +- 9 us -> [encoding] 14.2 us +- 0.3 us: 34.97x faster
```python
import pyperf
runner = pyperf.Runner()
runner.timeit(
name="encode",
setup="a = 'a' * 1_000_000",
stmt="bytearray(a, encoding='utf8')")
```
We need to use release/acquire ordering for the 'mask' member of the set
structure. Without this, `set_lookkey_threadsafe()` could be looking at
the old value of `table` but the new value of `mask`.
This roughly follows what was done for dictobject to make a lock-free
lookup operation. With this change, the set contains operation scales much
better when used from multiple-threads. The frozenset contains performance
seems unchanged (as already lock-free).
Summary of changes:
* refactor set_lookkey() into set_do_lookup() which now takes a function
pointer that does the entry comparison. This is similar to dictobject and
do_lookup(). In an optimized build, the comparison function is inlined and
there should be no performance cost to this.
* change set_do_lookup() to return a status separately from the entry value
* add set_compare_frozenset() and use if the object is a frozenset. For the
free-threaded build, this avoids some overhead (locking, atomic operations,
incref/decref on key)
* use FT_ATOMIC_* macros as needed for atomic loads and stores
* use a deferred free on the set table array, if shared (only on free-threaded
build, normal build always does an immediate free)
* for free-threaded build, use explicit for loop to zero the table, rather than memcpy()
* when mutating the set, assign so->table to NULL while the change is a
happening. Assign the real table array after the change is done.
There are places we use "relaxed" loads where C11 requires "consume" or
stronger. Unfortunately, compilers don't really implement "consume" so
fake it for our use in a way that avoids upsetting TSan.