The symptom was:
> [..]/expat/tests/alloc_tests.c:326:26: error: narrowing conversion from 'unsigned int' to signed type 'int' is implementation-defined [bugprone-narrowing-conversions,-warnings-as-errors]
> 326 | g_allocation_count = i;
> | ^
> [..]/expat/tests/alloc_tests.c:437:26: error: narrowing conversion from 'unsigned int' to signed type 'int' is implementation-defined [bugprone-narrowing-conversions,-warnings-as-errors]
> 437 | g_allocation_count = i;
> | ^
> [..]/expat/tests/basic_tests.c:415:47: error: narrowing conversion from 'unsigned int' to signed type 'int' is implementation-defined [bugprone-narrowing-conversions,-warnings-as-errors]
> 415 | if (_XML_Parse_SINGLE_BYTES(g_parser, text, first_chunk_bytes, XML_FALSE)
> | ^
> [..]/expat/tests/basic_tests.c:421:34: error: narrowing conversion from 'unsigned long' to signed type 'int' is implementation-defined [bugprone-narrowing-conversions,-warnings-as-errors]
> 421 | sizeof(text) - first_chunk_bytes - 1,
> | ^
> [..]/expat/tests/handlers.c:92:37: error: narrowing conversion from 'XML_Size' (aka 'unsigned long') to signed type 'int' is implementation-defined [bugprone-narrowing-conversions,-warnings-as-errors]
> 92 | StructData_AddItem(storage, name, XML_GetCurrentColumnNumber(g_parser),
> | ^
> [..]/expat/tests/handlers.c:93:22: error: narrowing conversion from 'XML_Size' (aka 'unsigned long') to signed type 'int' is implementation-defined [bugprone-narrowing-conversions,-warnings-as-errors]
> 93 | XML_GetCurrentLineNumber(g_parser), STRUCT_START_TAG);
> | ^
> [..]/expat/tests/handlers.c:99:37: error: narrowing conversion from 'XML_Size' (aka 'unsigned long') to signed type 'int' is implementation-defined [bugprone-narrowing-conversions,-warnings-as-errors]
> 99 | StructData_AddItem(storage, name, XML_GetCurrentColumnNumber(g_parser),
> | ^
> [..]/expat/tests/handlers.c💯22: error: narrowing conversion from 'XML_Size' (aka 'unsigned long') to signed type 'int' is implementation-defined [bugprone-narrowing-conversions,-warnings-as-errors]
> 100 | XML_GetCurrentLineNumber(g_parser), STRUCT_END_TAG);
> | ^
> [..]/expat/tests/handlers.c:1279:26: error: narrowing conversion from 'unsigned int' to signed type 'int' is implementation-defined [bugprone-narrowing-conversions,-warnings-as-errors]
> 1279 | g_allocation_count = i;
> | ^
> [..]/expat/tests/misc_tests.c:73:26: error: narrowing conversion from 'unsigned int' to signed type 'int' is implementation-defined [bugprone-narrowing-conversions,-warnings-as-errors]
> 73 | g_allocation_count = i;
> | ^
> [..]/expat/tests/misc_tests.c:93:26: error: narrowing conversion from 'unsigned int' to signed type 'int' is implementation-defined [bugprone-narrowing-conversions,-warnings-as-errors]
> 93 | g_allocation_count = i;
> | ^
> [..]/expat/tests/nsalloc_tests.c:86:26: error: narrowing conversion from 'unsigned int' to signed type 'int' is implementation-defined [bugprone-narrowing-conversions,-warnings-as-errors]
> 86 | g_allocation_count = i;
> | ^
> [..]/expat/tests/nsalloc_tests.c:526:28: error: narrowing conversion from 'unsigned int' to signed type 'int' is implementation-defined [bugprone-narrowing-conversions,-warnings-as-errors]
> 526 | g_reallocation_count = i;
> | ^
Please see commit 60dffa148c3ce26799cb933afdb0dc3581ad2098
("tests: Use normal XML_Parse in test_suspend_resume_internal_entity")
for more details on the related issue.
Related tests are:
- test_repeated_stop_parser_between_char_data_calls
- test_reset_in_entity
- test_resume_entity_with_syntax_error
- test_suspend_parser_between_cdata_calls
- test_suspend_parser_between_char_data_calls
- test_suspend_xdecl
In reaction to a finding by Berkay Eren Ürün.
Use of g_parser means risk of cross-test interference
and hence risk of hard-to-catch bugs in the test suite,
and so we want to get rid of g_parser altogether midterm.
This removes the dependency on CLOCKS_PER_SEC that prevented this test
from running properly on some platforms, as well as the inherent
flakiness of time measurements.
Since later commits have introduced g_bytesScanned (and before that,
g_parseAttempts), we can use that value as a proxy for parse time
instead of clock().
The bypass works on the assumption that the application uses a
consistent fill size. Let's make some assertions about what should
happen when the application doesn't do that -- most importantly,
that parsing does happen eventually, and that the number of scanned
bytes doesn't explode.
The key is to have __attribute__((noreturn)) somewhere that clang-tidy
can see it. In this case, this is the _fail() function, which is
conditionally called from the assert_true() macro.
This will ensure that clang-tidy doesn't complain about NULL values
that we've asserted against in tests.
...instead of only when approaching the maximum buffer size INT/2+1.
We'd like to give applications a chance to finish parsing a large token
before buffer reallocation, in case the reallocation fails.
By bypassing the reparse deferral heuristic when getting close to the
filling the buffer, we give them this chance -- if the whole token is
present in the buffer, it will be parsed at that time.
This may come at the cost of some extra reparse attempts. For a token
of n bytes, these extra parses cause us to scan over a maximum of
2n bytes (... + n/8 + n/4 + n/2 + n). Therefore, parsing of big tokens
remains O(n) in regard how many bytes we scan in attempts to parse. The
cost in reality is lower than that, since the reparses that happen due
to the bypass will affect m_partialTokenBytesBefore, delaying the next
ratio-based reparse. Furthermore, only the first token that "breaks
through" a buffer ceiling takes that extra reparse attempt; subsequent
large tokens will only bypass the heuristic if they manage to hit the
new buffer ceiling.
Note that this cost analysis depends on the assumption that Expat grows
its buffer by doubling it (or, more generally, grows it exponentially).
If this changes, the cost of this bypass may increase. Hopefully, this
would be caught by test_big_tokens_take_linear_time or the new test.
The bypass logic assumes that the application uses a consistent fill.
If the app increases its fill size, it may miss the bypass (and the
normal heuristic will apply). If the app decreases its fill size, the
bypass may be hit multiple times for the same buffer size. The very
worst case would be to always fill half of the remaining buffer space,
in which case parsing of a large n-byte token becomes O(n log n).
As an added bonus, the new test case should be faster than the old one,
since it doesn't have to go all the way to 1GiB to check the behavior.
Finally, this change necessitated a small modification to two existing
tests related to reparse deferral. These tests are testing the deferral
enabled setting, and assume that reparsing will not happen for any other
reason. By pre-growing the buffer, we make sure that this new deferral
does not affect those test cases.
For huge tokens, we may end up in a situation where the partial token
parse deferral heuristic demands more bytes than Expat's maximum buffer
size (currently ~half of INT_MAX) could fit.
INT_MAX/2 is 1024 MiB on most systems. Clearly, a token of 950 MiB could
fit in that buffer, but the reparse threshold might be such that
callProcessor() will defer it, allowing the app to keep filling the
buffer until XML_GetBuffer() eventually returns a memory error.
By bypassing the heuristic when we're getting close to the maximum
buffer size, it will once again be possible to parse tokens in the size
range INT_MAX/2/ratio < size < INT_MAX/2 reliably.
We subtract the last buffer fill size as a way to detect that the next
XML_GetBuffer() call has a risk of returning a memory error -- assuming
that the application is likely to keep using the same (or smaller) fill.
We subtract XML_CONTEXT_BYTES because that's the maximum amount of bytes
that could remain at the start of the buffer, preceding the partial
token. Technically, it could be fewer bytes, but XML_CONTEXT_BYTES is
normally small relative to INT_MAX, and is much simpler to use.
Co-authored-by: Sebastian Pipping <sebastian@pipping.org>
The test is essentially a copy of the existing test for the setter,
adapted to run on the external parser instead of the original one.
Suggested-by: Sebastian Pipping <sebastian@pipping.org>
CI-fighting-assistance-by: Sebastian Pipping <sebastian@pipping.org>
len=0 was previously OK if there had previously been a non-zero call.
It makes sense to allow an application to work the same way on a
newly-created parser, and not have to care if its incoming buffer
happens to be 0.
If we always run with the heuristic enabled, it may hide some bugs by
grouping up input into bigger parse attempts.
CI-fighting-assistance-by: Sebastian Pipping <sebastian@pipping.org>
When the parse buffer contains the starting bytes of a token but not
all of them, we cannot parse the token to completion. We call this a
partial token. When this happens, the parse position is reset to the
start of the token, and the parse() call returns. The client is then
expected to provide more data and call parse() again.
In extreme cases, this means that the bytes of a token may be parsed
many times: once for every buffer refill required before the full token
is present in the buffer.
Math:
Assume there's a token of T bytes
Assume the client fills the buffer in chunks of X bytes
We'll try to parse X, 2X, 3X, 4X ... until mX == T (technically >=)
That's (m²+m)X/2 = (T²/X+T)/2 bytes parsed (arithmetic progression)
While it is alleviated by larger refills, this amounts to O(T²)
Expat grows its internal buffer by doubling it when necessary, but has
no way to inform the client about how much space is available. Instead,
we add a heuristic that skips parsing when we've repeatedly stopped on
an incomplete token. Specifically:
* Only try to parse if we have a certain amount of data buffered
* Every time we stop on an incomplete token, double the threshold
* As soon as any token completes, the threshold is reset
This means that when we get stuck on an incomplete token, the threshold
grows exponentially, effectively making the client perform larger buffer
fills, limiting how many times we can end up re-parsing the same bytes.
Math:
Assume there's a token of T bytes
Assume the client fills the buffer in chunks of X bytes
We'll try to parse X, 2X, 4X, 8X ... until (2^k)X == T (or larger)
That's (2^(k+1)-1)X bytes parsed -- e.g. 15X if T = 8X
This is equal to 2T-X, which amounts to O(T)
We could've chosen a faster growth rate, e.g. 4 or 8. Those seem to
increase performance further, at the cost of further increasing the
risk of growing the buffer more than necessary. This can easily be
adjusted in the future, if desired.
This is all completely transparent to the client, except for:
1. possible delay of some callbacks (when our heuristic overshoots)
2. apps that never do isFinal=XML_TRUE could miss data at the end
For the affected testdata, this change shows a 100-400x speedup.
The recset.xml benchmark shows no clear change either way.
Before:
benchmark -n ../testdata/largefiles/recset.xml 65535 3
3 loops, with buffer size 65535. Average time per loop: 0.270223
benchmark -n ../testdata/largefiles/aaaaaa_attr.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 15.033048
benchmark -n ../testdata/largefiles/aaaaaa_cdata.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 0.018027
benchmark -n ../testdata/largefiles/aaaaaa_comment.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 11.775362
benchmark -n ../testdata/largefiles/aaaaaa_tag.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 11.711414
benchmark -n ../testdata/largefiles/aaaaaa_text.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 0.019362
After:
./run.sh benchmark -n ../testdata/largefiles/recset.xml 65535 3
3 loops, with buffer size 65535. Average time per loop: 0.269030
./run.sh benchmark -n ../testdata/largefiles/aaaaaa_attr.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 0.044794
./run.sh benchmark -n ../testdata/largefiles/aaaaaa_cdata.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 0.016377
./run.sh benchmark -n ../testdata/largefiles/aaaaaa_comment.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 0.027022
./run.sh benchmark -n ../testdata/largefiles/aaaaaa_tag.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 0.099360
./run.sh benchmark -n ../testdata/largefiles/aaaaaa_text.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 0.017956
When the parser is suspended, _XML_Parse_SINGLE_BYTES() will return
early. At that point, there could be some amount of bytes that haven't
been fed into Expat at all yet. This leaves us with an incomplete
document.
Furthermore, the last internal XML_Parse() call with isFinal=XML_TRUE
will not have happened, so the parser will not know that no more input
is to be expected. This is what allowed the test to pass when it was
originally changed to use SINGLE_BYTES.
With the new partial token heuristic, the lack of a final parse call
means that we don't even reach the "Ho" text, and fail the test.
The simplest solution is to go back to using XML_Parse() in this test.
Another option would be to let SINGLE_BYTES expose how far it got in
its loop, allowing for later continuation, but it doesn't seem worth the
extra complexity.
Until now, the buffer size to grow to has been calculated based on the
distance from the current parse position to the end of the buffer. This
means that the size of any already-parsed data was not considered,
leading to inconsistent buffer growth.
There was also a special case in XML_Parse() when XML_CONTEXT_BYTES was
zero, where the buffer size would be set to twice the incoming string
length. This patch replaces this with an XML_GetBuffer() call.
Growing the buffer based on its total size makes its growth consistent.
The commit includes a test that checks that we can reach the max buffer
size (usually INT_MAX/2 + 1) regardless of previously parsed content.
GitHub CI couldn't allocate the full 1GiB with MinGW/wine32, though it
works locally with the same compiler and wine version. As a workaround,
the test tries to malloc 1GiB, and reduces `maxbuf` to 512MiB in case
of failure.
All tests now run one instance where SINGLE_BYTES is equivalent to a
single XML_Parse call. Using SINGLE_BYTES therefore gives more coverage,
as evidenced by the new failure we now have to avoid in the test, until
it can be fixed.
All tests now run one instance where SINGLE_BYTES is equivalent to a
single XML_Parse call. There is no longer a need for individual tests
to switch between them.
- Start treating -DXML_CONTEXT_BYTES=0 as "no context"
rather than "context of size 0". Was documented as
"must be set to a positive integer", previously.
- Enforce that macro XML_CONTEXT_BYTES is defined at build time to
avoid accidental misbuilds lacking context in environments that
bypass both of Expats official build systems.
- Detect and reject use of negative context size at compile time.
Before a parse call with isFinal=XML_TRUE, there is no guarantee that
all supplied data has been parsed. Removing the first comment count
check removes the test's assumption of such a guarantee.
...instead of a full-string match.
These tests were depending on getting handler callbacks with exactly
one character of data at a time. For example, if test_abort_epilog got
"\n\r\n" in one callback, it would fail to match on the '\r', and would
not abort parsing as expected.
By searching the callback arg for the magic character rather than
expecting a full match, the test no longer depends on exact callback
timing.
`userData` is never NULL in these tests, so that check was left out of
the new version.
Instead of testing the exact number and sequence of callbacks, we now
test that we get the exact data lengths and sequence of callbacks. The
checks become much more verbose, but will now accept any buffer fill
strategy -- single bytes, multiple bytes, or any combination thereof.