| Age | Commit message (Collapse) | Author |
|
When moving extents, mext_check_validity() performs some basic file
system and file checks. However, some essential checks need to be
performed after acquiring the i_rwsem are still scattered in
mext_check_arguments(). Move those checks into mext_check_validity() and
make it executes entirely under the i_rwsem to make the checks clearer.
Furthermore, rename mext_check_arguments() to mext_check_adjust_range(),
as it only performs checks and length adjustments on the move extent
range. Finally, also change the print message for the non-existent file
check to be consistent with other unsupported checks.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-ID: <20251013015128.499308-8-yi.zhang@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Currently, the basic validation checks during the move extent operation
are scattered across __ext4_ioctl() and ext4_move_extents(), which makes
the code somewhat disorganized. Introduce a new helper,
mext_check_validity(), to handle these checks. This change involves only
code relocation without any logical modifications.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-ID: <20251013015128.499308-7-yi.zhang@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Switch to using EXT4_B_TO_LBLK() to calculate the EOF position of the
origin and donor inodes, instead of using open-coded calculations.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-ID: <20251013015128.499308-6-yi.zhang@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When creating or querying mapping blocks using the ext4_map_blocks() and
ext4_map_{query|create}_blocks() helpers, also pass out the extent
sequence number of the block mapping info through the ext4_map_blocks
structure. This sequence number can later serve as a valid cookie within
iomap infrastructure and the move extents procedure.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-ID: <20251013015128.499308-5-yi.zhang@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When querying extents in the extent status tree, we should hold the
data_sem if we want to obtain the sequence number as a valid cookie
simultaneously. However, currently, ext4_map_blocks() calls
ext4_es_lookup_extent() without holding data_sem. Therefore, we should
acquire i_es_lock instead, which also ensures that the sequence cookie
and the extent remain consistent. Consequently, make
ext4_es_lookup_extent() to pass out the sequence number when necessary.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-ID: <20251013015128.499308-4-yi.zhang@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
In the iomap_write_iter(), the iomap buffered write frame does not hold
any locks between querying the inode extent mapping info and performing
page cache writes. As a result, the extent mapping can be changed due to
concurrent I/O in flight. Similarly, in the iomap_writepage_map(), the
write-back process faces a similar problem: concurrent changes can
invalidate the extent mapping before the I/O is submitted.
Therefore, both of these processes must recheck the mapping info after
acquiring the folio lock. To address this, similar to XFS, we propose
introducing an extent sequence number to serve as a validity cookie for
the extent. After commit 24b7a2331fcd ("ext4: clairfy the rules for
modifying extents"), we can ensure the extent information should always
be processed through the extent status tree, and the extent status tree
is always uptodate under i_rwsem or invalidate_lock or folio lock, so
it's safe to introduce this sequence number. The sequence number will be
increased whenever the extent status tree changes, preparing for the
buffered write iomap conversion.
Besides, this mechanism is also applicable for the moving extents case.
In move_extent_per_page(), it also needs to reacquire data_sem and check
the mapping info again under the folio lock.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-ID: <20251013015128.499308-3-yi.zhang@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The move extent operation should return -EOPNOTSUPP if any of the inodes
is a quota inode, rather than requiring both to be quota inodes.
Fixes: 02749a4c2082 ("ext4: add ext4_is_quota_file()")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-ID: <20251013015128.499308-2-yi.zhang@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Fix the issue detected by the smatch tool.
fs/ext4/inode.c:3583 ext4_map_blocks_atomic_write_slow() error: uninitialized symbol 'next_pblk'.
fs/ext4/namei.c:1776 ext4_lookup() error: uninitialized symbol 'de'.
fs/ext4/namei.c:1829 ext4_get_parent() error: uninitialized symbol 'de'.
fs/ext4/namei.c:3162 ext4_rmdir() error: uninitialized symbol 'de'.
fs/ext4/namei.c:3242 __ext4_unlink() error: uninitialized symbol 'de'.
fs/ext4/namei.c:3697 ext4_find_delete_entry() error: uninitialized symbol 'de'.
These changes enhance code clarity, address static analysis tool errors.
Signed-off-by: Ranganath V N <vnranganath.20@gmail.com>
Message-ID: <20251011063830.47485-1-vnranganath.20@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Currently __ext4fs_dirhash() returns -1 (-EPERM) if fscrypt doesn't
have encryption key, which may confuse users. Make the error code here
consistent with existing error code.
Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
Message-ID: <20251010095257.3008275-1-sunjunchao@bytedance.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Link: https://patch.msgid.link/20251104-work-guards-v1-5-5108ac78a171@kernel.org
Acked-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This is one instruction more efficient than open-coding folio_pos() +
folio_size(). It's the equivalent of (x + y) << z rather than
x << z + y << z.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251024170822.1427218-5-willy@infradead.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The logic in wbc_to_tag() is widely used in file systems, so modify this
function to be inline and use it in file systems.
This patch has only passed compilation tests, but it should be fine.
Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
coccinelle
Nothing to look at apart from iput_final().
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 bug fixes from Ted Ts'o:
- Fix regression caused by removing CONFIG_EXT3_FS when testing some
very old defconfigs
- Avoid a BUG_ON when opening a file on a maliciously corrupted file
system
- Avoid mm warnings when freeing a very large orphan file metadata
- Avoid a theoretical races between metadata writeback and checkpoints
(it's very hard to hit in practice, since the race requires that the
writeback take a very long time)
* tag 'ext4_for_linus-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
Use CONFIG_EXT4_FS instead of CONFIG_EXT3_FS in all of the defconfigs
ext4: free orphan info with kvfree
ext4: detect invalid INLINE_DATA + EXTENTS flag combination
ext4, doc: fix and improve directory hash tree description
ext4: wait for ongoing I/O to complete before freeing blocks
jbd2: ensure that all ongoing I/O complete before freeing blocks
|
|
Orphan info is now getting allocated with kvmalloc_array(). Free it with
kvfree() instead of kfree() to avoid complaints from mm.
Reported-by: Chris Mason <clm@meta.com>
Fixes: 0a6ce20c1564 ("ext4: verify orphan file size is not too big")
Cc: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Message-ID: <20251007134936.7291-2-jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
syzbot reported a BUG_ON in ext4_es_cache_extent() when opening a verity
file on a corrupted ext4 filesystem mounted without a journal.
The issue is that the filesystem has an inode with both the INLINE_DATA
and EXTENTS flags set:
EXT4-fs error (device loop0): ext4_cache_extents:545: inode #15:
comm syz.0.17: corrupted extent tree: lblk 0 < prev 66
Investigation revealed that the inode has both flags set:
DEBUG: inode 15 - flag=1, i_inline_off=164, has_inline=1, extents_flag=1
This is an invalid combination since an inode should have either:
- INLINE_DATA: data stored directly in the inode
- EXTENTS: data stored in extent-mapped blocks
Having both flags causes ext4_has_inline_data() to return true, skipping
extent tree validation in __ext4_iget(). The unvalidated out-of-order
extents then trigger a BUG_ON in ext4_es_cache_extent() due to integer
underflow when calculating hole sizes.
Fix this by detecting this invalid flag combination early in ext4_iget()
and rejecting the corrupted inode.
Cc: stable@kernel.org
Reported-and-tested-by: syzbot+038b7bf43423e132b308@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=038b7bf43423e132b308
Suggested-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Message-ID: <20250930112810.315095-1-kartikey406@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When freeing metadata blocks in nojournal mode, ext4_forget() calls
bforget() to clear the dirty flag on the buffer_head and remvoe
associated mappings. This is acceptable if the metadata has not yet
begun to be written back. However, if the write-back has already started
but is not yet completed, ext4_forget() will have no effect.
Subsequently, ext4_mb_clear_bb() will immediately return the block to
the mb allocator. This block can then be reallocated immediately,
potentially causing an data corruption issue.
Fix this by clearing the buffer's dirty flag and waiting for the ongoing
I/O to complete, ensuring that no further writes to stale data will
occur.
Fixes: 16e08b14a455 ("ext4: cleanup clean_bdev_aliases() calls")
Cc: stable@kernel.org
Reported-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Closes: https://lore.kernel.org/linux-ext4/a9417096-9549-4441-9878-b1955b899b4e@huaweicloud.com/
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-ID: <20250916093337.3161016-3-yi.zhang@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"New ext4 features:
- Add support so tune2fs can modify/update the superblock using an
ioctl, without needing write access to the block device
- Add support for 32-bit reserved uid's and gid's
Bug fixes:
- Fix potential warnings and other failures caused by corrupted /
fuzzed file systems
- Fail unaligned direct I/O write with EINVAL instead of silently
falling back to buffered I/O
- Correectly handle fsmap queries for metadata mappings
- Avoid journal stalls caused by writeback throttling
- Add some missing GFP_NOFAIL flags to avoid potential deadlocks
under extremem memory pressure
Cleanups:
- Remove obsolete EXT3 Kconfigs"
* tag 'ext4_for_linus-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix checks for orphan inodes
ext4: validate ea_ino and size in check_xattrs
ext4: guard against EA inode refcount underflow in xattr update
ext4: implemet new ioctls to set and get superblock parameters
ext4: add support for 32-bit default reserved uid and gid values
ext4: avoid potential buffer over-read in parse_apply_sb_mount_options()
ext4: fix an off-by-one issue during moving extents
ext4: increase i_disksize to offset + len in ext4_update_disksize_before_punch()
ext4: verify orphan file size is not too big
ext4: fail unaligned direct IO write with EINVAL
ext4: correctly handle queries for metadata mappings
ext4: increase IO priority of fastcommit
ext4: remove obsolete EXT3 config options
jbd2: increase IO priority of checkpoint
ext4: fix potential null deref in ext4_mb_init()
ext4: add ext4_sb_bread_nofail() helper function for ext4_free_branches()
ext4: replace min/max nesting with clamp()
fs: ext4: change GFP_KERNEL to GFP_NOFS to avoid deadlock
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs workqueue updates from Christian Brauner:
"This contains various workqueue changes affecting the filesystem
layer.
Currently if a user enqueue a work item using schedule_delayed_work()
the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies
to schedule_work() that is using system_wq and queue_work(), that
makes use again of WORK_CPU_UNBOUND.
This replaces the use of system_wq and system_unbound_wq. system_wq is
a per-CPU workqueue which isn't very obvious from the name and
system_unbound_wq is to be used when locality is not required.
So this renames system_wq to system_percpu_wq, and system_unbound_wq
to system_dfl_wq.
This also adds a new WQ_PERCPU flag to allow the fs subsystem users to
explicitly request the use of per-CPU behavior. Both WQ_UNBOUND and
WQ_PERCPU flags coexist for one release cycle to allow callers to
transition their calls. WQ_UNBOUND will be removed in a next release
cycle"
* tag 'vfs-6.18-rc1.workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: WQ_PERCPU added to alloc_workqueue users
fs: replace use of system_wq with system_percpu_wq
fs: replace use of system_unbound_wq with system_dfl_wq
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode updates from Christian Brauner:
"This contains a series I originally wrote and that Eric brought over
the finish line. It moves out the i_crypt_info and i_verity_info
pointers out of 'struct inode' and into the fs-specific part of the
inode.
So now the few filesytems that actually make use of this pay the price
in their own private inode storage instead of forcing it upon every
user of struct inode.
The pointer for the crypt and verity info is simply found by storing
an offset to its address in struct fsverity_operations and struct
fscrypt_operations. This shrinks struct inode by 16 bytes.
I hope to move a lot more out of it in the future so that struct inode
becomes really just about very core stuff that we need, much like
struct dentry and struct file, instead of the dumping ground it has
become over the years.
On top of this are a various changes associated with the ongoing inode
lifetime handling rework that multiple people are pushing forward:
- Stop accessing inode->i_count directly in f2fs and gfs2. They
simply should use the __iget() and iput() helpers
- Make the i_state flags an enum
- Rework the iput() logic
Currently, if we are the last iput, and we have the I_DIRTY_TIME
bit set, we will grab a reference on the inode again and then mark
it dirty and then redo the put. This is to make sure we delay the
time update for as long as possible
We can rework this logic to simply dec i_count if it is not 1, and
if it is do the time update while still holding the i_count
reference
Then we can replace the atomic_dec_and_lock with locking the
->i_lock and doing atomic_dec_and_test, since we did the
atomic_add_unless above
- Add an icount_read() helper and convert everyone that accesses
inode->i_count directly for this purpose to use the helper
- Expand dump_inode() to dump more information about an inode helping
in debugging
- Add some might_sleep() annotations to iput() and associated
helpers"
* tag 'vfs-6.18-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: add might_sleep() annotation to iput() and more
fs: expand dump_inode()
inode: fix whitespace issues
fs: add an icount_read helper
fs: rework iput logic
fs: make the i_state flags an enum
fs: stop accessing ->i_count directly in f2fs and gfs2
fsverity: check IS_VERITY() in fsverity_cleanup_inode()
fs: remove inode::i_verity_info
btrfs: move verity info pointer to fs-specific part of inode
f2fs: move verity info pointer to fs-specific part of inode
ext4: move verity info pointer to fs-specific part of inode
fsverity: add support for info in fs-specific part of inode
fs: remove inode::i_crypt_info
ceph: move crypt info pointer to fs-specific part of inode
ubifs: move crypt info pointer to fs-specific part of inode
f2fs: move crypt info pointer to fs-specific part of inode
ext4: move crypt info pointer to fs-specific part of inode
fscrypt: add support for info in fs-specific part of inode
fscrypt: replace raw loads of info pointer with helper function
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual selections of misc updates for this cycle.
Features:
- Add "initramfs_options" parameter to set initramfs mount options.
This allows to add specific mount options to the rootfs to e.g.,
limit the memory size
- Add RWF_NOSIGNAL flag for pwritev2()
Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE
signal from being raised when writing on disconnected pipes or
sockets. The flag is handled directly by the pipe filesystem and
converted to the existing MSG_NOSIGNAL flag for sockets
- Allow to pass pid namespace as procfs mount option
Ever since the introduction of pid namespaces, procfs has had very
implicit behaviour surrounding them (the pidns used by a procfs
mount is auto-selected based on the mounting process's active
pidns, and the pidns itself is basically hidden once the mount has
been constructed)
This implicit behaviour has historically meant that userspace was
required to do some special dances in order to configure the pidns
of a procfs mount as desired. Examples include:
* In order to bypass the mnt_too_revealing() check, Kubernetes
creates a procfs mount from an empty pidns so that user
namespaced containers can be nested (without this, the nested
containers would fail to mount procfs)
But this requires forking off a helper process because you cannot
just one-shot this using mount(2)
* Container runtimes in general need to fork into a container
before configuring its mounts, which can lead to security issues
in the case of shared-pidns containers (a privileged process in
the pidns can interact with your container runtime process)
While SUID_DUMP_DISABLE and user namespaces make this less of an
issue, the strict need for this due to a minor uAPI wart is kind
of unfortunate
Things would be much easier if there was a way for userspace to
just specify the pidns they want. So this pull request contains
changes to implement a new "pidns" argument which can be set
using fsconfig(2):
fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
or classic mount(2) / mount(8):
// mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
Cleanups:
- Remove the last references to EXPORT_OP_ASYNC_LOCK
- Make file_remove_privs_flags() static
- Remove redundant __GFP_NOWARN when GFP_NOWAIT is used
- Use try_cmpxchg() in start_dir_add()
- Use try_cmpxchg() in sb_init_done_wq()
- Replace offsetof() with struct_size() in ioctl_file_dedupe_range()
- Remove vfs_ioctl() export
- Replace rwlock() with spinlock in epoll code as rwlock causes
priority inversion on preempt rt kernels
- Make ns_entries in fs/proc/namespaces const
- Use a switch() statement() in init_special_inode() just like we do
in may_open()
- Use struct_size() in dir_add() in the initramfs code
- Use str_plural() in rd_load_image()
- Replace strcpy() with strscpy() in find_link()
- Rename generic_delete_inode() to inode_just_drop() and
generic_drop_inode() to inode_generic_drop()
- Remove unused arguments from fcntl_{g,s}et_rw_hint()
Fixes:
- Document @name parameter for name_contains_dotdot() helper
- Fix spelling mistake
- Always return zero from replace_fd() instead of the file descriptor
number
- Limit the size for copy_file_range() in compat mode to prevent a
signed overflow
- Fix debugfs mount options not being applied
- Verify the inode mode when loading it from disk in minixfs
- Verify the inode mode when loading it from disk in cramfs
- Don't trigger automounts with RESOLVE_NO_XDEV
If openat2() was called with RESOLVE_NO_XDEV it didn't traverse
through automounts, but could still trigger them
- Add FL_RECLAIM flag to show_fl_flags() macro so it appears in
tracepoints
- Fix unused variable warning in rd_load_image() on s390
- Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD
- Use ns_capable_noaudit() when determining net sysctl permissions
- Don't call path_put() under namespace semaphore in listmount() and
statmount()"
* tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits)
fcntl: trim arguments
listmount: don't call path_put() under namespace semaphore
statmount: don't call path_put() under namespace semaphore
pid: use ns_capable_noaudit() when determining net sysctl permissions
fs: rename generic_delete_inode() and generic_drop_inode()
init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD
initramfs: Replace strcpy() with strscpy() in find_link()
initrd: Use str_plural() in rd_load_image()
initramfs: Use struct_size() helper to improve dir_add()
initrd: Fix unused variable warning in rd_load_image() on s390
fs: use the switch statement in init_special_inode()
fs/proc/namespaces: make ns_entries const
filelock: add FL_RECLAIM to show_fl_flags() macro
eventpoll: Replace rwlock with spinlock
selftests/proc: add tests for new pidns APIs
procfs: add "pidns" mount option
pidns: move is-ancestor logic to helper
openat2: don't trigger automounts with RESOLVE_NO_XDEV
namei: move cross-device check to __traverse_mounts
namei: remove LOOKUP_NO_XDEV check from handle_mounts
...
|
|
When orphan file feature is enabled, inode can be tracked as orphan
either in the standard orphan list or in the orphan file. The first can
be tested by checking ei->i_orphan list head, the second is recorded by
EXT4_STATE_ORPHAN_FILE inode state flag. There are several places where
we want to check whether inode is tracked as orphan and only some of
them properly check for both possibilities. Luckily the consequences are
mostly minor, the worst that can happen is that we track an inode as
orphan although we don't need to and e2fsck then complains (resulting in
occasional ext4/307 xfstest failures). Fix the problem by introducing a
helper for checking whether an inode is tracked as orphan and use it in
appropriate places.
Fixes: 4a79a98c7b19 ("ext4: Improve scalability of ext4 orphan file handling")
Cc: stable@kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Message-ID: <20250925123038.20264-2-jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
During xattr block validation, check_xattrs() processes xattr entries
without validating that entries claiming to use EA inodes have non-zero
sizes. Corrupted filesystems may contain xattr entries where e_value_size
is zero but e_value_inum is non-zero, indicating invalid xattr data.
Add validation in check_xattrs() to detect this corruption pattern early
and return -EFSCORRUPTED, preventing invalid xattr entries from causing
issues throughout the ext4 codebase.
Cc: stable@kernel.org
Suggested-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: syzbot+4c9d23743a2409b80293@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=4c9d23743a2409b80293
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Message-ID: <20250923133245.1091761-1-kartikey406@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
syzkaller found a path where ext4_xattr_inode_update_ref() reads an EA
inode refcount that is already <= 0 and then applies ref_change (often
-1). That lets the refcount underflow and we proceed with a bogus value,
triggering errors like:
EXT4-fs error: EA inode <n> ref underflow: ref_count=-1 ref_change=-1
EXT4-fs warning: ea_inode dec ref err=-117
Make the invariant explicit: if the current refcount is non-positive,
treat this as on-disk corruption, emit ext4_error_inode(), and fail the
operation with -EFSCORRUPTED instead of updating the refcount. Delete the
WARN_ONCE() as negative refcounts are now impossible; keep error reporting
in ext4_error_inode().
This prevents the underflow and the follow-on orphan/cleanup churn.
Reported-by: syzbot+0be4f339a8218d2a5bb1@syzkaller.appspotmail.com
Fixes: https://syzbot.org/bug?extid=0be4f339a8218d2a5bb1
Cc: stable@kernel.org
Co-developed-by: Albin Babu Varghese <albinbabuvarghese20@gmail.com>
Signed-off-by: Albin Babu Varghese <albinbabuvarghese20@gmail.com>
Signed-off-by: Ahmet Eray Karadag <eraykrdg1@gmail.com>
Message-ID: <20250920021342.45575-1-eraykrdg1@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Implement the EXT4_IOC_GET_TUNE_SB_PARAM and
EXT4_IOC_SET_TUNE_SB_PARAM ioctls, which allow certains superblock
parameters to be set while the file system is mounted, without needing
write access to the block device.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Message-ID: <20250916-tune2fs-v2-3-d594dc7486f0@mit.edu>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Support for specifying the default user id and group id that is
allowed to use the reserved block space was added way back when Linux
only supported 16-bit uid's and gid's. (Yeah, that long ago.) It's
not a commonly used feature, but let's add support for 32-bit user and
group id's.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Message-ID: <20250916-tune2fs-v2-2-d594dc7486f0@mit.edu>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Unlike other strings in the ext4 superblock, we rely on tune2fs to
make sure s_mount_opts is NUL terminated. Harden
parse_apply_sb_mount_options() by treating s_mount_opts as a potential
__nonstring.
Cc: stable@vger.kernel.org
Fixes: 8b67f04ab9de ("ext4: Add mount options in superblock")
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Message-ID: <20250916-tune2fs-v2-1-d594dc7486f0@mit.edu>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
During the movement of a written extent, mext_page_mkuptodate() is
called to read data in the range [from, to) into the page cache and to
update the corresponding buffers. Therefore, we should not wait on any
buffer whose start offset is >= 'to'. Otherwise, it will return -EIO and
fail the extents movement.
$ for i in `seq 3 -1 0`; \
do xfs_io -fs -c "pwrite -b 1024 $((i * 1024)) 1024" /mnt/foo; \
done
$ umount /mnt && mount /dev/pmem1s /mnt # drop cache
$ e4defrag /mnt/foo
e4defrag 1.47.0 (5-Feb-2023)
ext4 defragmentation for /mnt/foo
[1/1]/mnt/foo: 0% [ NG ]
Success: [0/1]
Cc: stable@kernel.org
Fixes: a40759fb16ae ("ext4: remove array of buffer_heads from mext_page_mkuptodate()")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-ID: <20250912105841.1886799-1-yi.zhang@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
After running a stress test combined with fault injection,
we performed fsck -a followed by fsck -fn on the filesystem
image. During the second pass, fsck -fn reported:
Inode 131512, end of extent exceeds allowed value
(logical block 405, physical block 1180540, len 2)
This inode was not in the orphan list. Analysis revealed the
following call chain that leads to the inconsistency:
ext4_da_write_end()
//does not update i_disksize
ext4_punch_hole()
//truncate folio, keep size
ext4_page_mkwrite()
ext4_block_page_mkwrite()
ext4_block_write_begin()
ext4_get_block()
//insert written extent without update i_disksize
journal commit
echo 1 > /sys/block/xxx/device/delete
da-write path updates i_size but does not update i_disksize. Then
ext4_punch_hole truncates the da-folio yet still leaves i_disksize
unchanged(in the ext4_update_disksize_before_punch function, the
condition offset + len < size is met). Then ext4_page_mkwrite sees
ext4_nonda_switch return 1 and takes the nodioread_nolock path, the
folio about to be written has just been punched out, and it’s offset
sits beyond the current i_disksize. This may result in a written
extent being inserted, but again does not update i_disksize. If the
journal gets committed and then the block device is yanked, we might
run into this. It should be noted that replacing ext4_punch_hole with
ext4_zero_range in the call sequence may also trigger this issue, as
neither will update i_disksize under these circumstances.
To fix this, we can modify ext4_update_disksize_before_punch to
increase i_disksize to min(i_size, offset + len) when both i_size and
(offset + len) are greater than i_disksize.
Cc: stable@kernel.org
Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Message-ID: <20250911133024.1841027-1-sunyongjian@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
In principle orphan file can be arbitrarily large. However orphan replay
needs to traverse it all and we also pin all its buffers in memory. Thus
filesystems with absurdly large orphan files can lead to big amounts of
memory consumed. Limit orphan file size to a sane value and also use
kvmalloc() for allocating array of block descriptor structures to avoid
large order allocations for sane but large orphan files.
Reported-by: syzbot+0b92850d68d9b12934f5@syzkaller.appspotmail.com
Fixes: 02f310fcf47f ("ext4: Speedup ext4 orphan inode handling")
Cc: stable@kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Message-ID: <20250909112206.10459-2-jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Commit bc264fea0f6f ("iomap: support incremental iomap_iter advances")
changed the error handling logic in iomap_iter(). Previously any error
from iomap_dio_bio_iter() got propagated to userspace, after this commit
if ->iomap_end returns error, it gets propagated to userspace instead of
an error from iomap_dio_bio_iter(). This results in unaligned writes to
ext4 to silently fallback to buffered IO instead of erroring out.
Now returning ENOTBLK for DIO writes from ext4_iomap_end() seems
unnecessary these days. It is enough to return ENOTBLK from
ext4_iomap_begin() when we don't support DIO write for that particular
file offset (due to hole).
Fixes: bc264fea0f6f ("iomap: support incremental iomap_iter advances")
Cc: stable@kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Message-ID: <20250901112739.32484-2-jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Currently, our handling of metadata is _ambiguous_ in some scenarios,
that is, we end up returning unknown if the range only covers the
mapping partially.
For example, in the following case:
$ xfs_io -c fsmap -d
0: 254:16 [0..7]: static fs metadata 8
1: 254:16 [8..15]: special 102:1 8
2: 254:16 [16..5127]: special 102:2 5112
3: 254:16 [5128..5255]: special 102:3 128
4: 254:16 [5256..5383]: special 102:4 128
5: 254:16 [5384..70919]: inodes 65536
6: 254:16 [70920..70967]: unknown 48
...
$ xfs_io -c fsmap -d 24 33
0: 254:16 [24..39]: unknown 16 <--- incomplete reporting
$ xfs_io -c fsmap -d 24 33 (With patch)
0: 254:16 [16..5127]: special 102:2 5112
This is because earlier in ext4_getfsmap_meta_helper, we end up ignoring
any extent that starts before our queried range, but overlaps it. While
the man page [1] is a bit ambiguous on this, this fix makes the output
make more sense since we are anyways returning an "unknown" extent. This
is also consistent to how XFS does it:
$ xfs_io -c fsmap -d
...
6: 254:16 [104..127]: free space 24
7: 254:16 [128..191]: inodes 64
...
$ xfs_io -c fsmap -d 137 150
0: 254:16 [128..191]: inodes 64 <-- full extent returned
[1] https://man7.org/linux/man-pages/man2/ioctl_getfsmap.2.html
Reported-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Message-ID: <023f37e35ee280cd9baac0296cbadcbe10995cab.1757058211.git.ojaswin@linux.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The following code paths may result in high latency or even task hangs:
1. fastcommit io is throttled by wbt.
2. jbd2_fc_wait_bufs() might wait for a long time while
JBD2_FAST_COMMIT_ONGOING is set in journal->flags, and then
jbd2_journal_commit_transaction() waits for the
JBD2_FAST_COMMIT_ONGOING bit for a long time while holding the write
lock of j_state_lock.
3. start_this_handle() waits for read lock of j_state_lock which
results in high latency or task hang.
Given the fact that ext4_fc_commit() already modifies the current
process' IO priority to match that of the jbd2 thread, it should be
reasonable to match jbd2's IO submission flags as well.
Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-ID: <20250827121812.1477634-1-sunjunchao@bytedance.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
In June 2015, commit c290ea01abb7 ("fs: Remove ext3 filesystem driver")
removed the historic ext3 filesystem support as ext3 partitions are fully
supported with the ext4 filesystem support. To simplify updating the kernel
build configuration, which had only EXT3 support but not EXT4 support
enabled, the three config options EXT3_{FS,FS_POSIX_ACL,FS_SECURITY} were
kept, instead of immediately removing them. The three options just enable
the corresponding EXT4 counterparts when configs from older kernel versions
are used to build on later kernel versions. This ensures that the kernels
from those kernel build configurations would then continue to have EXT4
enabled for supporting booting from ext3 and ext4 file systems, to avoid
potential unexpected surprises.
Given that the kernel build configuration has no backwards-compatibility
guarantee and this transition phase for such build configurations has been
in place for a decade, we can reasonably expect all such users to have
transitioned to use the EXT4 config options in their config files at this
point in time. With that in mind, the three EXT3 config options are
obsolete by now.
Remove the obsolete EXT3 config options.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
In ext4_mb_init(), ext4_mb_avg_fragment_size_destroy() may be called
when sbi->s_mb_avg_fragment_size remains uninitialized (e.g., if groupinfo
slab cache allocation fails). Since ext4_mb_avg_fragment_size_destroy()
lacks null pointer checking, this leads to a null pointer dereference.
==================================================================
EXT4-fs: no memory for groupinfo slab cache
BUG: kernel NULL pointer dereference, address: 0000000000000000
PGD 0 P4D 0
Oops: Oops: 0002 [#1] SMP PTI
CPU:2 UID: 0 PID: 87 Comm:mount Not tainted 6.17.0-rc2 #1134 PREEMPT(none)
RIP: 0010:_raw_spin_lock_irqsave+0x1b/0x40
Call Trace:
<TASK>
xa_destroy+0x61/0x130
ext4_mb_init+0x483/0x540
__ext4_fill_super+0x116d/0x17b0
ext4_fill_super+0xd3/0x280
get_tree_bdev_flags+0x132/0x1d0
vfs_get_tree+0x29/0xd0
do_new_mount+0x197/0x300
__x64_sys_mount+0x116/0x150
do_syscall_64+0x50/0x1c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
==================================================================
Therefore, add necessary null check to ext4_mb_avg_fragment_size_destroy()
to prevent this issue. The same fix is also applied to
ext4_mb_largest_free_orders_destroy().
Reported-by: syzbot+1713b1aa266195b916c2@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=1713b1aa266195b916c2
Cc: stable@kernel.org
Fixes: f7eaacbb4e54 ("ext4: convert free groups order lists to xarrays")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The implicit __GFP_NOFAIL flag in ext4_sb_bread() was removed in commit
8a83ac54940d ("ext4: call bdev_getblk() from sb_getblk_gfp()"), meaning
the function can now fail under memory pressure.
Most callers of ext4_sb_bread() propagate the error to userspace and do not
remount the filesystem read-only. However, ext4_free_branches() handles
ext4_sb_bread() failure by remounting the filesystem read-only.
This implies that an ext3 filesystem (mounted via the ext4 driver) could be
forcibly remounted read-only due to a transient page allocation failure,
which is unacceptable.
To mitigate this, introduce a new helper function, ext4_sb_bread_nofail(),
which explicitly uses __GFP_NOFAIL, and use it in ext4_free_branches().
Fixes: 8a83ac54940d ("ext4: call bdev_getblk() from sb_getblk_gfp()")
Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The clamp() macro explicitly expresses the intent of constraining a value
within bounds.Therefore, replacing max(min(a,b),c) with clamp(val, lo, hi)
can improve code readability.
Signed-off-by: Xichao Zhao <zhao.xichao@vivo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The parent function ext4_xattr_inode_lookup_create already uses GFP_NOFS for memory alloction, so the function ext4_xattr_inode_cache_find should use same gfp_flag.
Signed-off-by: chuguangqing <chuguangqing@inspur.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.
Adding system_dfl_wq to encourage its use when unbound work should be used.
The old system_unbound_wq will be kept for a few release cycles.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://lore.kernel.org/20250916082906.77439-2-marco.crivellari@suse.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
generic_delete_inode() is rather misleading for what the routine is
doing. inode_just_drop() should be much clearer.
The new naming is inconsistent with generic_drop_inode(), so rename that
one as well with inode_ as the suffix.
No functional changes.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Instead of doing direct access to ->i_count, add a helper to handle
this. This will make it easier to convert i_count to a refcount later.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/9bc62a84c6b9d6337781203f60837bd98fbc4a96.1756222464.git.josef@toxicpanda.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Move the fsverity_info pointer into the filesystem-specific part of the
inode by adding the field ext4_inode_info::i_verity_info and configuring
fsverity_operations::inode_info_offs accordingly.
This is a prerequisite for a later commit that removes
inode::i_verity_info, saving memory and improving cache efficiency on
filesystems that don't support fsverity.
Co-developed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://lore.kernel.org/20250810075706.172910-10-ebiggers@kernel.org
Acked-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Move the fscrypt_inode_info pointer into the filesystem-specific part of
the inode by adding the field ext4_inode_info::i_crypt_info and
configuring fscrypt_operations::inode_info_offs accordingly.
This is a prerequisite for a later commit that removes
inode::i_crypt_info, saving memory and improving cache efficiency with
filesystems that don't support fscrypt.
Co-developed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://lore.kernel.org/20250810075706.172910-4-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
- Fix fast commit checks for file systems with ea_inode enabled
- Don't drop the i_version mount option on a remount
- Fix FIEMAP reporting when there are holes in a bigalloc file system
- Don't fail when mounting read-only when there are inodes in the
orphan file
- Fix hole length overflow for indirect mapped files on file systems
with an 8k or 16k block file system
* tag 'ext4_for_linus-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
jbd2: prevent softlockup in jbd2_log_do_checkpoint()
ext4: fix incorrect function name in comment
ext4: use kmalloc_array() for array space allocation
ext4: fix hole length calculation overflow in non-extent inodes
ext4: don't try to clear the orphan_present feature block device is r/o
ext4: fix reserved gdt blocks handling in fsmap
ext4: fix fsmap end of range reporting with bigalloc
ext4: remove redundant __GFP_NOWARN
ext4: fix unused variable warning in ext4_init_new_dir
ext4: remove useless if check
ext4: check fast symlink for ea_inode correctly
ext4: preserve SB_I_VERSION on remount
ext4: show the default enabled i_version option
|
|
Since commit 6b730a405037 “ext4: hoist ext4_block_write_begin and
replace the __block_write_begin”, the comment should be updated
accordingly from "__block_write_begin" to "ext4_block_write_begin".
Fixes: 6b730a405037 (“ext4: hoist ext4_block_write_begin and replace...")
Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://patch.msgid.link/20250812021709.1120716-1-liubaolin12138@163.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Replace kmalloc(size * sizeof) with kmalloc_array() for safer memory
allocation and overflow prevention.
Cc: stable@kernel.org
Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com>
Link: https://patch.msgid.link/20250811125816.570142-1-liaoyuanhong@vivo.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
In a filesystem with a block size larger than 4KB, the hole length
calculation for a non-extent inode in ext4_ind_map_blocks() can easily
exceed INT_MAX. Then it could return a zero length hole and trigger the
following waring and infinite in the iomap infrastructure.
------------[ cut here ]------------
WARNING: CPU: 3 PID: 434101 at fs/iomap/iter.c:34 iomap_iter_done+0x148/0x190
CPU: 3 UID: 0 PID: 434101 Comm: fsstress Not tainted 6.16.0-rc7+ #128 PREEMPT(voluntary)
Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : iomap_iter_done+0x148/0x190
lr : iomap_iter+0x174/0x230
sp : ffff8000880af740
x29: ffff8000880af740 x28: ffff0000db8e6840 x27: 0000000000000000
x26: 0000000000000000 x25: ffff8000880af830 x24: 0000004000000000
x23: 0000000000000002 x22: 000001bfdbfa8000 x21: ffffa6a41c002e48
x20: 0000000000000001 x19: ffff8000880af808 x18: 0000000000000000
x17: 0000000000000000 x16: ffffa6a495ee6cd0 x15: 0000000000000000
x14: 00000000000003d4 x13: 00000000fa83b2da x12: 0000b236fc95f18c
x11: ffffa6a4978b9c08 x10: 0000000000001da0 x9 : ffffa6a41c1a2a44
x8 : ffff8000880af5c8 x7 : 0000000001000000 x6 : 0000000000000000
x5 : 0000000000000004 x4 : 000001bfdbfa8000 x3 : 0000000000000000
x2 : 0000000000000000 x1 : 0000004004030000 x0 : 0000000000000000
Call trace:
iomap_iter_done+0x148/0x190 (P)
iomap_iter+0x174/0x230
iomap_fiemap+0x154/0x1d8
ext4_fiemap+0x110/0x140 [ext4]
do_vfs_ioctl+0x4b8/0xbc0
__arm64_sys_ioctl+0x8c/0x120
invoke_syscall+0x6c/0x100
el0_svc_common.constprop.0+0x48/0xf0
do_el0_svc+0x24/0x38
el0_svc+0x38/0x120
el0t_64_sync_handler+0x10c/0x138
el0t_64_sync+0x198/0x1a0
---[ end trace 0000000000000000 ]---
Cc: stable@kernel.org
Fixes: facab4d9711e ("ext4: return hole from ext4_map_blocks()")
Reported-by: Qu Wenruo <wqu@suse.com>
Closes: https://lore.kernel.org/linux-ext4/9b650a52-9672-4604-a765-bb6be55d1e4a@gmx.com/
Tested-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250811064532.1788289-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When the file system is frozen in preparation for taking an LVM
snapshot, the journal is checkpointed and if the orphan_file feature
is enabled, and the orphan file is empty, we clear the orphan_present
feature flag. But if there are pending inodes that need to be removed
the orphan_present feature flag can't be cleared.
The problem comes if the block device is read-only. In that case, we
can't process the orphan inode list, so it is skipped in
ext4_orphan_cleanup(). But then in ext4_mark_recovery_complete(),
this results in the ext4 error "Orphan file not empty on read-only fs"
firing and the file system mount is aborted.
Fix this by clearing the needs_recovery flag in the block device is
read-only. We do this after the call to ext4_load_and_init-journal()
since there are some error checks need to be done in case the journal
needs to be replayed and the block device is read-only, or if the
block device containing the externa journal is read-only, etc.
Cc: stable@kernel.org
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1108271
Cc: stable@vger.kernel.org
Fixes: 02f310fcf47f ("ext4: Speedup ext4 orphan inode handling")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
In some cases like small FSes with no meta_bg and where the resize
doesn't need extra gdt blocks as it can fit in the current one,
s_reserved_gdt_blocks is set as 0, which causes fsmap to emit a 0
length entry, which is incorrect.
$ mkfs.ext4 -b 65536 -O bigalloc /dev/sda 5G
$ mount /dev/sda /mnt/scratch
$ xfs_io -c "fsmap -d" /mnt/scartch
0: 253:48 [0..127]: static fs metadata 128
1: 253:48 [128..255]: special 102:1 128
2: 253:48 [256..255]: special 102:2 0 <---- 0 len entry
3: 253:48 [256..383]: special 102:3 128
Fix this by adding a check for this case.
Cc: stable@kernel.org
Fixes: 0c9ec4beecac ("ext4: support GETFSMAP ioctls")
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://patch.msgid.link/08781b796453a5770112aa96ad14c864fbf31935.1754377641.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
With bigalloc enabled, the logic to report last extent has a bug since
we try to use cluster units instead of block units. This can cause an
issue where extra incorrect entries might be returned back to the
user. This was flagged by generic/365 with 64k bs and -O bigalloc.
** Details of issue **
The issue was noticed on 5G 64k blocksize FS with -O bigalloc which has
only 1 bg.
$ xfs_io -c "fsmap -d" /mnt/scratch
0: 253:48 [0..127]: static fs metadata 128 /* sb */
1: 253:48 [128..255]: special 102:1 128 /* gdt */
3: 253:48 [256..383]: special 102:3 128 /* block bitmap */
4: 253:48 [384..2303]: unknown 1920 /* flex bg empty space */
5: 253:48 [2304..2431]: special 102:4 128 /* inode bitmap */
6: 253:48 [2432..4351]: unknown 1920 /* flex bg empty space */
7: 253:48 [4352..6911]: inodes 2560
8: 253:48 [6912..538623]: unknown 531712
9: 253:48 [538624..10485759]: free space 9947136
The issue can be seen with:
$ xfs_io -c "fsmap -d 0 3" /mnt/scratch
0: 253:48 [0..127]: static fs metadata 128
1: 253:48 [384..2047]: unknown 1664
Only the first entry was expected to be returned but we get 2. This is
because:
ext4_getfsmap_datadev()
first_cluster, last_cluster = 0
...
info->gfi_last = true;
ext4_getfsmap_datadev_helper(sb, end_ag, last_cluster + 1, 0, info);
fsb = C2B(1) = 16
fslen = 0
...
/* Merge in any relevant extents from the meta_list */
list_for_each_entry_safe(p, tmp, &info->gfi_meta_list, fmr_list) {
...
// since fsb = 16, considers all metadata which starts before 16 blockno
iter 1: error = ext4_getfsmap_helper(sb, info, p); // p = sb (0,1), nop
info->gfi_next_fsblk = 1
iter 2: error = ext4_getfsmap_helper(sb, info, p); // p = gdt (1,2), nop
info->gfi_next_fsblk = 2
iter 3: error = ext4_getfsmap_helper(sb, info, p); // p = blk bitmap (2,3), nop
info->gfi_next_fsblk = 3
iter 4: error = ext4_getfsmap_helper(sb, info, p); // p = ino bitmap (18,19)
if (rec_blk > info->gfi_next_fsblk) { // (18 > 3)
// emits an extra entry ** BUG **
}
}
Fix this by directly calling ext4_getfsmap_datadev() with a dummy
record that has fmr_physical set to (end_fsb + 1) instead of
last_cluster + 1. By using the block instead of cluster we get the
correct behavior.
Replacing ext4_getfsmap_datadev_helper() with ext4_getfsmap_helper()
is okay since the gfi_lastfree and metadata checks in
ext4_getfsmap_datadev_helper() are anyways redundant when we only want
to emit the last allocated block of the range, as we have already
taken care of emitting metadata and any last free blocks.
Cc: stable@kernel.org
Reported-by: Disha Goel <disgoel@linux.ibm.com>
Fixes: 4a622e4d477b ("ext4: fix FS_IOC_GETFSMAP handling")
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://patch.msgid.link/e7472c8535c9c5ec10f425f495366864ea12c9da.1754377641.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|