Followup to #2318 which accidentally made zlib required.
Tested locally by increasing the version in CMakeLists.txt to 1.4.1
(which does not exist yet), and confirming that the build reports that a
suitable version of zlib was not found, while the build continued..
Before writing a zip entry, its' pathname might be modified for two
reasons:
1. Path using Windows path separators will be converted to POSIX style.
2. Path using local encoding will be transcoded if a target charset is
set.
Must make sure these two mechanisms can coexist without overwriting each
other.
zlib 1.2.0 added this improvement for inflate:
"Raw inflate no longer needs an extra dummy byte at end"
libarchive does not feed zlib extra data beyond end of stream, so it
does not work with zlib < 1.2.0.
With 26 and 27, the sub-test is pushing 2G and 4G memory respectively.
There is no particular reason why we need to push for higher limits
here, so let's pick 23 with weights around 0.25G. The test suite overall
is in the 0.25 - 0.5G range and this fits perfectly.
Closes: https://github.com/libarchive/libarchive/issues/2080
Signed-off-by: Emil Velikov <emil.l.velikov@gmail.com>
Fix `test_write_format_zip_stream` failure when `HAVE_ZLIB_H` is not
defined.
If `libz` is present, `zip` archives would be compressed by default,
which requires `zip_version=20`. Otherwise, the archive is not
compressed and only requires `zip_version=10`. I'm building libarchive
on a machine not intended for developing, so basicly there's no optional
dependencies like `libz` available, guess that's why nobody else has
reported this issue.
Pax introduced new headers that appear _before_ the legacy
headers. So pax archives require earlier properties to
override later ones.
Originally, libarchive handled this by storing the early
headers in memory so that it could do the actual parsing
from back to front. With this scheme, properties from
early headers were parsed last and simply overwrote
properties from later headers.
PR #2127 reduced memory usage by parsing headers in the
order they appear in the file, which requires later headers
to avoid overwriting already-set properties. Apparently,
when I made this change, I did not fully consider how charset
translations get handled on Windows, so failed to consistently
recognize when the path or linkname properties were in fact
actually set. As a result, the legacy path/link values (which have
no charset information) overwrote the pax path/link values (which
are known to be UTF-8), leading to the behavior observed in
#2248. This PR corrects this bug by adding additional
tests to see if the wide character path or linkname properties
are set.
Related: This bug was exposed by a new test added in #2228
which does a write/read validation to ensure round-trip filename
handling. This was modified in #2248 to avoid tickling the bug above.
I've reverted the change from #2248 since it's no longer necessary.
I have also added some additional validation to this test to
help ensure that the intermediate archive actually is a pax
format that includes the expected path and linkname properties
in the expected places.
It would seem as though #2127 conflicted with my change #2228.
I previously thought that the writer was putting info into the archive
that strings were encoded in UTF-8, but I'm not so sure of that
anymore... In any case, explicitly setting `hdrcharset` on the reader as
well is a reasonable alternative and something we do already.
The RAR5 reader is using a small stack of cached pointers to submit the
rendered data to the caller. In malformed files it's possible for this
pointer cache to be desynchronized with the memory buffer those pointers
are pointing to, making libarchive crash on invalid memory access.
OSS-Fuzz Issue: 70024
In particular, this ensures that we cannot overflow rounding-up
calculations. Recent tar changes put in a lot of sanity limits on the
sizes of particular kinds of data, but the usual behavior in most cases
was to skip over-large values. The skipping behavior required
rounding-up and accumulating values that could potentially overflow
64-bit integers. This adds some coarser checks that fail more directly
when an entry claims to be more than 1 exbibyte (2^60 bytes), avoiding
any possibility of numeric overflow along these paths.
OSS-Fuzz Issue: 70062
This is somewhat academic, since we don't actually expose any of the
ISO9660 header information that is stored in 17-byte date format, but
inspection revealed an off-by-one error in the parsing here.
This also proved a nice motivation to fill in some verification in our
most basic ISO9660 test case.
Currently updating archivemount which does
```c
pwd = getpwuid(st.st_uid);
if(pwd)
archive_entry_set_uname(node->entry, strdup(pwd->pw_name));
grp = getgrgid(st.st_gid);
if(grp)
archive_entry_set_gname(node->entry, strdup(grp->gr_name));
```
and I'm assuming the strdups are actually leaks? The manual is silent on
this.
Previous code added `.XXXXXX` to the end of the filename to write the
mac metadata. This is a problem if the filename is at or near the
filesystem max path length. This reuses the same code used by
create_tempdatafork to ensure that the filename is not too long.
The fuzzer constructed an AFIO (CPIO variant) archive that had a
rediculously large ino value, which caused an overflow of a signed
64-bit intermediate.
There are really three issues here:
* The CPIO parser was using a signed int64 as an intermediate type for
parsing numbers in all cases. I've addressed the overflow here by using
a uint64_t in the parser core, but left the resulting values as int64_t.
* The AFIO header parsing had no guards against rediculously large
values; it now rejects an archive when the ino or size fields (which are
allowed to be up to 16 hex digits long) overflow int64_t to produce a
negative value.
* The archive_entry would accept negative values for gid/uid/size/ino.
I've altered those so that these fields treat any negative value as zero
for these fields.
There was one test that actually verified that we could read a field
with size = -1. I've updated that to verify that the resulting size is
zero instead.
OSS-Fuzz Issue: 70019
The Rar5 reader would read the name size, then read the name, then check
whether the name size was beyond the maximum size allowed. This can
result in a very large memory allocation to read a name. Instead, check
the name size before trying to read the name in order to avoid excessive
allocation.
OSS-Fuzz Issue: 70017
I went through ~50 findings of SAST reports and identified a few of them
as true positives. I might still have missed some intended uses or some
magic in the code so please provide feedback if you think some of these
shouldn't be applied and why.
I explained the changes in the separate comments.
Microsoft's static analysis tool found some vulnerabilities from
unguarded null references that I changed in
[microsoft/cmake](https://github.com/microsoft/cmake). Pushing these
changes upstream so they can be added to
[kitware/cmake](https://github.com/Kitware/CMake).
The code currently uses `archive_entry_hardlink` to determine if an
entry is a hardlink, however on Windows, this call will fail if the path
cannot be represented in the current locale. This instead checks to see
if any entry in the `archive_mstring` is set.
All three parts of this change effectively stem from the same
assumption: most of the code in `archive_string.c` assumes that MBS <->
UTF-8 string conversion can be done directly and efficiently. This is
not quite true on Windows, where conversion looks more like MBS <-> WCS
<-> UTF-8. This results in a few inefficiencies currently present in the
code.
First, if the caller is asking for either the MBS or UTF-8 string, but
it's not currently set on the `archive_mstring`, then on Windows, it's
more efficient to first check if the WCS is set and do the conversion
with that. Otherwise, we'll end up doing a wasteful intermediate step of
converting either the MBS or UTF-8 string to WCS, which we already have.
Second, in the `archive_mstring_update_utf8` function, it's more
efficient on Windows to first convert to WCS and use that result to
convert to MBS, as opposed to the fallback I introduced in a previous
change, which converts UTF-8 to MBS first and disposes of the
intermediate WCS, only to re-calculate it.
We noticed an issue where we had an archive that, if you skipped the
first entry and tried to extract the second, you'd get a failure saying
`Truncated 7-Zip file body`. Turns out that this is because the first
file in the archive is a multiple of 65,536 bytes (the size of the
uncompressed buffer) and therefore after `read_stream` skipped all of
the first file, `uncompressed_buffer_bytes_remaining` was set to zero
(because all data was consumed) and then it calls
`get_uncompressed_data` with `minimum` set to zero. This then saw that
`minimum > zip->uncompressed_buffer_bytes_remaining` evaluated to false,
causing us to read zero bytes, which got interpreted as a truncated
archive.
The fix here is simple: we now always call `extract_pack_stream` when
`uncompressed_buffer_bytes_remaining` is zero before exiting the
skipping loop.
In order to match cpio output, format the reference date with _at least_
12 bytes instead of _exactly_ 12 bytes. This should fix a gratuitous
test failure on certain systems that default to multi-byte locales.
The CSRG ISOs have a non-standard PVD layout with a 68-byte root
directory record (rather than the 34-byte record required by
ECMA119/ISO9660). I built a test image with this change and modified the
ISO9660 reader to accept it.
While I was working on the bid logic to recognize PVDs, I added a number
of additional correctness checks that should make our bidding a bit more
accurate. In particular, this should more than compensate for the
weakened check of the root directory record size.
Resolves#2232
This rebuilds the tar reader to parse all header data incrementally as
it appears in the stream.
This definitively fixes a longstanding issue with unsupported pax
attributes. Libarchive must limit the amount of data that it reads into
memory, and this has caused problems with large unknown attributes. By
scanning iteratively, we can instead identify an attribute by name and
then decide whether to read it into memory or whether to skip it without
reading.
This design also allows us to vary our sanity limits for different pax
attributes (e.g., an attribute that is a single number can be limited to
a few dozen bytes while an attribute holding an ACL is allowed to be a
few hundred kilobytes). This allows us to be a little more resistant to
malicious archives that might try to force allocation of very large
amounts of memory, though there is still work to be done here.
This includes a number of changes to archive_entry processing to allow
us to consistently keep the _first_ appearance of any given value
instead of the original architecture that recursively cached data in
memory in order to effectively process all the data from back-to-front.
Resolves#1855Resolves#1939
Note: this is a partial cherry-pick from
https://github.com/libarchive/libarchive/pull/2095, which I'm going to
go through and break into smaller pieces in hopes of getting some things
in while discussion of other things can continue.
There's basically two fixes here:
The first is to check for the presence of the WCS pathname on Windows
before failing since the conversion from WCS -> MBS might fail. Later
execution already handles such paths correctly.
The second is to set the converted link name on the target entry where
relevant. Note that there has been prior discussion on this here:
https://github.com/libarchive/libarchive/pull/2095/files#r1531599325
certain rar files seem to have the lowest possible address here, so flip
the argument order to correctly evaluate this instead of invoking UB
(caught via sanitize=undefined)
---
the backtrace looks something like:
```
* frame #0: 0x00007a1e3898727b libarchive.so.13`execute_filter [inlined] execute_filter_e8(filter=<unavailable>, vm=<unavailable>, pos=<unavailable>, e9also=<unavailable>) at archive_read_support_format_rar.c:3640:47
frame #1: 0x00007a1e3898727b libarchive.so.13`execute_filter(a=<unavailable>, filter=0x00007a1e39e2f090, vm=0x00007a1e31b1efd0, pos=<unavailable>) at archive_read_support_format_rar.c:0
frame #2: 0x00007a1e38983ac3 libarchive.so.13`read_data_compressed [inlined] run_filters(a=0x00007a1e34209700) at archive_read_support_format_rar.c:3395:8
frame #3: 0x00007a1e38983a9e libarchive.so.13`read_data_compressed(a=0x00007a1e34209700, buff=0x00007a1e31a01fd8, size=0x00007a1e31a01fd0, offset=0x00007a1e31a01fc0, looper=1) at archive_read_support_format_rar.c:2083:12
frame #4: 0x00007a1e38981b10 libarchive.so.13`archive_read_format_rar_read_data(a=0x00007a1e34209700, buff=0x00007a1e31a01fd8, size=0x00007a1e31a01fd0, offset=0x00007a1e31a01fc0) at archive_read_support_format_rar.c:1130:11
frame #5: 0x00006158bc5d30d3 file-roller`extract_archive_thread(result=0x00007a1e3711e2b0, object=<unavailable>, cancellable=0x00007a1e3870bf20) at fr-archive-libarchive.c:999:17
frame #6: 0x00007a1e39928d6d libgio-2.0.so.0`run_in_thread(job=<unavailable>, c=<unavailable>, _data=0x00007a1e326e9740) at gsimpleasyncresult.c:899:5
frame #7: 0x00007a1e3990614e libgio-2.0.so.0`io_job_thread(task=<unavailable>, source_object=<unavailable>, task_data=0x00007a1e2307fc20, cancellable=<unavailable>) at gioscheduler.c:75:16
frame #8: 0x00007a1e399433bf libgio-2.0.so.0`g_task_thread_pool_thread(thread_data=0x00007a1e35c18ab0, pool_data=<unavailable>) at gtask.c:1583:3
frame #9: 0x00007a1e39db77e8 libglib-2.0.so.0`g_thread_pool_thread_proxy(data=<unavailable>) at gthreadpool.c:336:15
frame #10: 0x00007a1e39db5bfb libglib-2.0.so.0`g_thread_proxy(data=0x00007a1e378147d0) at gthread.c:835:20
frame #11: 0x00007a1e3a0b5c7b ld-musl-x86_64.so.1`start(p=0x00007a1e31a02170) at pthread_create.c:208:17
frame #12: 0x00007a1e3a0b8a8b ld-musl-x86_64.so.1`__clone + 47
```
note the 0xd which is 14 which is NegateOverflow in ubsan:
```
(lldb) x/1i $pc
-> 0x7a1e3898727b: 67 0f b9 40 0d other ud1l 0xd(%eax), %eax
```
for reference, the totally legal rar file is
https://img.ayaya.dev/05WYGFOcRPN9 , and this seems to only crash when
extracted via file-roller (or inside nautilus)
It appears that there are xar archives (in the form of Apple .pkg files)
that contain TOCs with duplicated name elements:
```xml
<file id="25">
<data> ... </data>
<type>file</type>
<name>PackageInfo</name>
<name>PackageInfo</name>
<name>PackageInfo</name>
</file>
```
When libarchive encounters one such file, it will produce an
archive_entry named PackageInfoPackageInfoPackageInfo.
To produce a test archive, the XAR writer was modified to emit two name
elements.
On Windows, the MBS pathname might be null if the string was set with a
WCS that can't be represented by the current locale. This is handled
properly by the rest of the code, but there's a sanity check that does
not make the proper distinction.
Note: this is a partial cherry-pick from
https://github.com/libarchive/libarchive/pull/2095, which I'm going to
go through and break into smaller pieces in hopes of getting some things
in while discussion of other things can continue.
There's no bug fix here - this just adds a test to verify that zip
creation when using the _w functions works as expected on Windows.
Note: this is a partial cherry-pick from
https://github.com/libarchive/libarchive/pull/2095, which I'm going to
go through and break into smaller pieces in hopes of getting some things
in while discussion of other things can continue.
Hey,
the fuzzing infrastructure over at OSSFuzz builds libarchive with the
CMake option `-DDONT_FAIL_ON_CRC_ERROR=1`.
e4643b64b3/projects/libarchive/build.sh (L35)
This, unfortunatly, does not do anything since it's never been defined
as an option.
Building the fuzzers with CRC checks disabled should improve fuzzing
efficacy a bunch.
Thanks!
This ensures that the buffer is properly initialized and does not
contain any leftover data from previous operations. It is used later in
the `archive_entry_copy_hardlink_l` function call and could be
uninitialized.
On Windows, if you are using `archive_entry_link_resolver` and give it
an entry that links to past entry whose pathname was set using a "wide"
string that cannot be represented by the current locale (i.e. WCS -> MBS
conversion fails), this code will crash due to a null pointer read. This
updates to use the `_w` function instead on Windows.
Note: this is a partial cherry-pick from
https://github.com/libarchive/libarchive/pull/2095, which I'm going to
go through and break into smaller pieces in hopes of getting some things
in while discussion of other things can continue.
On legacy systems the OS supplied `sys/queue.h` may lack the required
macros, so to avoid having to verify if the version of queue.h is of
use, opt to always to `la_queue.h` which will match expectations.
Allows libarchive to build on legacy Darwin where `STAILQ_FOREACH` would
be missing from `sys/queue.h`.
Resolves#2220
When using Clang in "MSVC mode" (i.e. clang-cl), command line arguments
are interpreted as MSVC would interpret them, at least when there are
conflicts. This means that `-Wall` - potentially among other switches -
is interpreted _dramatically_ differently by clang-cl compared to
"normal" Clang.
In CMake, this can be detected by testing for `if (MSVC)` in addition to
compiler id test, which is what I do here.
Note: this is a partial cherry-pick from #2095, which I'm going to go
through and break into smaller pieces in hopes of getting some things in
while discussion of other things can continue.
The tar utility reads from stderr to receive user input even when stdin
is a pipe. That is unfortunately unsupported on Windows.
The nearest equivalent is to reopen and read from the console input
handle.
Closes#2215