summaryrefslogtreecommitdiff
path: root/fs/ext4
AgeCommit message (Collapse)Author
2025-11-06ext4: refactor mext_check_arguments()Zhang Yi
When moving extents, mext_check_validity() performs some basic file system and file checks. However, some essential checks need to be performed after acquiring the i_rwsem are still scattered in mext_check_arguments(). Move those checks into mext_check_validity() and make it executes entirely under the i_rwsem to make the checks clearer. Furthermore, rename mext_check_arguments() to mext_check_adjust_range(), as it only performs checks and length adjustments on the move extent range. Finally, also change the print message for the non-existent file check to be consistent with other unsupported checks. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251013015128.499308-8-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-06ext4: add mext_check_validity() to do basic checkZhang Yi
Currently, the basic validation checks during the move extent operation are scattered across __ext4_ioctl() and ext4_move_extents(), which makes the code somewhat disorganized. Introduce a new helper, mext_check_validity(), to handle these checks. This change involves only code relocation without any logical modifications. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251013015128.499308-7-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-06ext4: use EXT4_B_TO_LBLK() in mext_check_arguments()Zhang Yi
Switch to using EXT4_B_TO_LBLK() to calculate the EOF position of the origin and donor inodes, instead of using open-coded calculations. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251013015128.499308-6-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-06ext4: pass out extent seq counter when mapping blocksZhang Yi
When creating or querying mapping blocks using the ext4_map_blocks() and ext4_map_{query|create}_blocks() helpers, also pass out the extent sequence number of the block mapping info through the ext4_map_blocks structure. This sequence number can later serve as a valid cookie within iomap infrastructure and the move extents procedure. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251013015128.499308-5-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-06ext4: make ext4_es_lookup_extent() pass out the extent seq counterZhang Yi
When querying extents in the extent status tree, we should hold the data_sem if we want to obtain the sequence number as a valid cookie simultaneously. However, currently, ext4_map_blocks() calls ext4_es_lookup_extent() without holding data_sem. Therefore, we should acquire i_es_lock instead, which also ensures that the sequence cookie and the extent remain consistent. Consequently, make ext4_es_lookup_extent() to pass out the sequence number when necessary. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251013015128.499308-4-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-06ext4: introduce seq counter for the extent status entryZhang Yi
In the iomap_write_iter(), the iomap buffered write frame does not hold any locks between querying the inode extent mapping info and performing page cache writes. As a result, the extent mapping can be changed due to concurrent I/O in flight. Similarly, in the iomap_writepage_map(), the write-back process faces a similar problem: concurrent changes can invalidate the extent mapping before the I/O is submitted. Therefore, both of these processes must recheck the mapping info after acquiring the folio lock. To address this, similar to XFS, we propose introducing an extent sequence number to serve as a validity cookie for the extent. After commit 24b7a2331fcd ("ext4: clairfy the rules for modifying extents"), we can ensure the extent information should always be processed through the extent status tree, and the extent status tree is always uptodate under i_rwsem or invalidate_lock or folio lock, so it's safe to introduce this sequence number. The sequence number will be increased whenever the extent status tree changes, preparing for the buffered write iomap conversion. Besides, this mechanism is also applicable for the moving extents case. In move_extent_per_page(), it also needs to reacquire data_sem and check the mapping info again under the folio lock. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251013015128.499308-3-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-06ext4: correct the checking of quota files before moving extentsZhang Yi
The move extent operation should return -EOPNOTSUPP if any of the inodes is a quota inode, rather than requiring both to be quota inodes. Fixes: 02749a4c2082 ("ext4: add ext4_is_quota_file()") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251013015128.499308-2-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-06fs: ext4: fix uninitialized symbolsRanganath V N
Fix the issue detected by the smatch tool. fs/ext4/inode.c:3583 ext4_map_blocks_atomic_write_slow() error: uninitialized symbol 'next_pblk'. fs/ext4/namei.c:1776 ext4_lookup() error: uninitialized symbol 'de'. fs/ext4/namei.c:1829 ext4_get_parent() error: uninitialized symbol 'de'. fs/ext4/namei.c:3162 ext4_rmdir() error: uninitialized symbol 'de'. fs/ext4/namei.c:3242 __ext4_unlink() error: uninitialized symbol 'de'. fs/ext4/namei.c:3697 ext4_find_delete_entry() error: uninitialized symbol 'de'. These changes enhance code clarity, address static analysis tool errors. Signed-off-by: Ranganath V N <vnranganath.20@gmail.com> Message-ID: <20251011063830.47485-1-vnranganath.20@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-06ext4: make error code in __ext4fs_dirhash() consistent.Julian Sun
Currently __ext4fs_dirhash() returns -1 (-EPERM) if fscrypt doesn't have encryption key, which may confuse users. Make the error code here consistent with existing error code. Signed-off-by: Julian Sun <sunjunchao@bytedance.com> Message-ID: <20251010095257.3008275-1-sunjunchao@bytedance.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-05ext4: use super write guard in write_mmp_block()Christian Brauner
Link: https://patch.msgid.link/20251104-work-guards-v1-5-5108ac78a171@kernel.org Acked-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-31ext4: Use folio_next_pos()Matthew Wilcox (Oracle)
This is one instruction more efficient than open-coding folio_pos() + folio_size(). It's the equivalent of (x + y) << z rather than x << z + y << z. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20251024170822.1427218-5-willy@infradead.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: linux-ext4@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-29fs: Make wbc_to_tag() inline and use it in fs.Julian Sun
The logic in wbc_to_tag() is widely used in file systems, so modify this function to be inline and use it in file systems. This patch has only passed compilation tests, but it should be fine. Signed-off-by: Julian Sun <sunjunchao@bytedance.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20Manual conversion to use ->i_state accessors of all places not covered by ↵Mateusz Guzik
coccinelle Nothing to look at apart from iput_final(). Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-15Merge tag 'ext4_for_linus-6.18-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bug fixes from Ted Ts'o: - Fix regression caused by removing CONFIG_EXT3_FS when testing some very old defconfigs - Avoid a BUG_ON when opening a file on a maliciously corrupted file system - Avoid mm warnings when freeing a very large orphan file metadata - Avoid a theoretical races between metadata writeback and checkpoints (it's very hard to hit in practice, since the race requires that the writeback take a very long time) * tag 'ext4_for_linus-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: Use CONFIG_EXT4_FS instead of CONFIG_EXT3_FS in all of the defconfigs ext4: free orphan info with kvfree ext4: detect invalid INLINE_DATA + EXTENTS flag combination ext4, doc: fix and improve directory hash tree description ext4: wait for ongoing I/O to complete before freeing blocks jbd2: ensure that all ongoing I/O complete before freeing blocks
2025-10-10ext4: free orphan info with kvfreeJan Kara
Orphan info is now getting allocated with kvmalloc_array(). Free it with kvfree() instead of kfree() to avoid complaints from mm. Reported-by: Chris Mason <clm@meta.com> Fixes: 0a6ce20c1564 ("ext4: verify orphan file size is not too big") Cc: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Message-ID: <20251007134936.7291-2-jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-10-10ext4: detect invalid INLINE_DATA + EXTENTS flag combinationDeepanshu Kartikey
syzbot reported a BUG_ON in ext4_es_cache_extent() when opening a verity file on a corrupted ext4 filesystem mounted without a journal. The issue is that the filesystem has an inode with both the INLINE_DATA and EXTENTS flags set: EXT4-fs error (device loop0): ext4_cache_extents:545: inode #15: comm syz.0.17: corrupted extent tree: lblk 0 < prev 66 Investigation revealed that the inode has both flags set: DEBUG: inode 15 - flag=1, i_inline_off=164, has_inline=1, extents_flag=1 This is an invalid combination since an inode should have either: - INLINE_DATA: data stored directly in the inode - EXTENTS: data stored in extent-mapped blocks Having both flags causes ext4_has_inline_data() to return true, skipping extent tree validation in __ext4_iget(). The unvalidated out-of-order extents then trigger a BUG_ON in ext4_es_cache_extent() due to integer underflow when calculating hole sizes. Fix this by detecting this invalid flag combination early in ext4_iget() and rejecting the corrupted inode. Cc: stable@kernel.org Reported-and-tested-by: syzbot+038b7bf43423e132b308@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=038b7bf43423e132b308 Suggested-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Message-ID: <20250930112810.315095-1-kartikey406@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-10-10ext4: wait for ongoing I/O to complete before freeing blocksZhang Yi
When freeing metadata blocks in nojournal mode, ext4_forget() calls bforget() to clear the dirty flag on the buffer_head and remvoe associated mappings. This is acceptable if the metadata has not yet begun to be written back. However, if the write-back has already started but is not yet completed, ext4_forget() will have no effect. Subsequently, ext4_mb_clear_bb() will immediately return the block to the mb allocator. This block can then be reallocated immediately, potentially causing an data corruption issue. Fix this by clearing the buffer's dirty flag and waiting for the ongoing I/O to complete, ensuring that no further writes to stale data will occur. Fixes: 16e08b14a455 ("ext4: cleanup clean_bdev_aliases() calls") Cc: stable@kernel.org Reported-by: Gao Xiang <hsiangkao@linux.alibaba.com> Closes: https://lore.kernel.org/linux-ext4/a9417096-9549-4441-9878-b1955b899b4e@huaweicloud.com/ Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20250916093337.3161016-3-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-10-03Merge tag 'ext4_for_linus-6.18-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "New ext4 features: - Add support so tune2fs can modify/update the superblock using an ioctl, without needing write access to the block device - Add support for 32-bit reserved uid's and gid's Bug fixes: - Fix potential warnings and other failures caused by corrupted / fuzzed file systems - Fail unaligned direct I/O write with EINVAL instead of silently falling back to buffered I/O - Correectly handle fsmap queries for metadata mappings - Avoid journal stalls caused by writeback throttling - Add some missing GFP_NOFAIL flags to avoid potential deadlocks under extremem memory pressure Cleanups: - Remove obsolete EXT3 Kconfigs" * tag 'ext4_for_linus-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix checks for orphan inodes ext4: validate ea_ino and size in check_xattrs ext4: guard against EA inode refcount underflow in xattr update ext4: implemet new ioctls to set and get superblock parameters ext4: add support for 32-bit default reserved uid and gid values ext4: avoid potential buffer over-read in parse_apply_sb_mount_options() ext4: fix an off-by-one issue during moving extents ext4: increase i_disksize to offset + len in ext4_update_disksize_before_punch() ext4: verify orphan file size is not too big ext4: fail unaligned direct IO write with EINVAL ext4: correctly handle queries for metadata mappings ext4: increase IO priority of fastcommit ext4: remove obsolete EXT3 config options jbd2: increase IO priority of checkpoint ext4: fix potential null deref in ext4_mb_init() ext4: add ext4_sb_bread_nofail() helper function for ext4_free_branches() ext4: replace min/max nesting with clamp() fs: ext4: change GFP_KERNEL to GFP_NOFS to avoid deadlock
2025-09-29Merge tag 'vfs-6.18-rc1.workqueue' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs workqueue updates from Christian Brauner: "This contains various workqueue changes affecting the filesystem layer. Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This replaces the use of system_wq and system_unbound_wq. system_wq is a per-CPU workqueue which isn't very obvious from the name and system_unbound_wq is to be used when locality is not required. So this renames system_wq to system_percpu_wq, and system_unbound_wq to system_dfl_wq. This also adds a new WQ_PERCPU flag to allow the fs subsystem users to explicitly request the use of per-CPU behavior. Both WQ_UNBOUND and WQ_PERCPU flags coexist for one release cycle to allow callers to transition their calls. WQ_UNBOUND will be removed in a next release cycle" * tag 'vfs-6.18-rc1.workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: WQ_PERCPU added to alloc_workqueue users fs: replace use of system_wq with system_percpu_wq fs: replace use of system_unbound_wq with system_dfl_wq
2025-09-29Merge tag 'vfs-6.18-rc1.inode' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs inode updates from Christian Brauner: "This contains a series I originally wrote and that Eric brought over the finish line. It moves out the i_crypt_info and i_verity_info pointers out of 'struct inode' and into the fs-specific part of the inode. So now the few filesytems that actually make use of this pay the price in their own private inode storage instead of forcing it upon every user of struct inode. The pointer for the crypt and verity info is simply found by storing an offset to its address in struct fsverity_operations and struct fscrypt_operations. This shrinks struct inode by 16 bytes. I hope to move a lot more out of it in the future so that struct inode becomes really just about very core stuff that we need, much like struct dentry and struct file, instead of the dumping ground it has become over the years. On top of this are a various changes associated with the ongoing inode lifetime handling rework that multiple people are pushing forward: - Stop accessing inode->i_count directly in f2fs and gfs2. They simply should use the __iget() and iput() helpers - Make the i_state flags an enum - Rework the iput() logic Currently, if we are the last iput, and we have the I_DIRTY_TIME bit set, we will grab a reference on the inode again and then mark it dirty and then redo the put. This is to make sure we delay the time update for as long as possible We can rework this logic to simply dec i_count if it is not 1, and if it is do the time update while still holding the i_count reference Then we can replace the atomic_dec_and_lock with locking the ->i_lock and doing atomic_dec_and_test, since we did the atomic_add_unless above - Add an icount_read() helper and convert everyone that accesses inode->i_count directly for this purpose to use the helper - Expand dump_inode() to dump more information about an inode helping in debugging - Add some might_sleep() annotations to iput() and associated helpers" * tag 'vfs-6.18-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: add might_sleep() annotation to iput() and more fs: expand dump_inode() inode: fix whitespace issues fs: add an icount_read helper fs: rework iput logic fs: make the i_state flags an enum fs: stop accessing ->i_count directly in f2fs and gfs2 fsverity: check IS_VERITY() in fsverity_cleanup_inode() fs: remove inode::i_verity_info btrfs: move verity info pointer to fs-specific part of inode f2fs: move verity info pointer to fs-specific part of inode ext4: move verity info pointer to fs-specific part of inode fsverity: add support for info in fs-specific part of inode fs: remove inode::i_crypt_info ceph: move crypt info pointer to fs-specific part of inode ubifs: move crypt info pointer to fs-specific part of inode f2fs: move crypt info pointer to fs-specific part of inode ext4: move crypt info pointer to fs-specific part of inode fscrypt: add support for info in fs-specific part of inode fscrypt: replace raw loads of info pointer with helper function
2025-09-29Merge tag 'vfs-6.18-rc1.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Add "initramfs_options" parameter to set initramfs mount options. This allows to add specific mount options to the rootfs to e.g., limit the memory size - Add RWF_NOSIGNAL flag for pwritev2() Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE signal from being raised when writing on disconnected pipes or sockets. The flag is handled directly by the pipe filesystem and converted to the existing MSG_NOSIGNAL flag for sockets - Allow to pass pid namespace as procfs mount option Ever since the introduction of pid namespaces, procfs has had very implicit behaviour surrounding them (the pidns used by a procfs mount is auto-selected based on the mounting process's active pidns, and the pidns itself is basically hidden once the mount has been constructed) This implicit behaviour has historically meant that userspace was required to do some special dances in order to configure the pidns of a procfs mount as desired. Examples include: * In order to bypass the mnt_too_revealing() check, Kubernetes creates a procfs mount from an empty pidns so that user namespaced containers can be nested (without this, the nested containers would fail to mount procfs) But this requires forking off a helper process because you cannot just one-shot this using mount(2) * Container runtimes in general need to fork into a container before configuring its mounts, which can lead to security issues in the case of shared-pidns containers (a privileged process in the pidns can interact with your container runtime process) While SUID_DUMP_DISABLE and user namespaces make this less of an issue, the strict need for this due to a minor uAPI wart is kind of unfortunate Things would be much easier if there was a way for userspace to just specify the pidns they want. So this pull request contains changes to implement a new "pidns" argument which can be set using fsconfig(2): fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd); fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0); or classic mount(2) / mount(8): // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid"); Cleanups: - Remove the last references to EXPORT_OP_ASYNC_LOCK - Make file_remove_privs_flags() static - Remove redundant __GFP_NOWARN when GFP_NOWAIT is used - Use try_cmpxchg() in start_dir_add() - Use try_cmpxchg() in sb_init_done_wq() - Replace offsetof() with struct_size() in ioctl_file_dedupe_range() - Remove vfs_ioctl() export - Replace rwlock() with spinlock in epoll code as rwlock causes priority inversion on preempt rt kernels - Make ns_entries in fs/proc/namespaces const - Use a switch() statement() in init_special_inode() just like we do in may_open() - Use struct_size() in dir_add() in the initramfs code - Use str_plural() in rd_load_image() - Replace strcpy() with strscpy() in find_link() - Rename generic_delete_inode() to inode_just_drop() and generic_drop_inode() to inode_generic_drop() - Remove unused arguments from fcntl_{g,s}et_rw_hint() Fixes: - Document @name parameter for name_contains_dotdot() helper - Fix spelling mistake - Always return zero from replace_fd() instead of the file descriptor number - Limit the size for copy_file_range() in compat mode to prevent a signed overflow - Fix debugfs mount options not being applied - Verify the inode mode when loading it from disk in minixfs - Verify the inode mode when loading it from disk in cramfs - Don't trigger automounts with RESOLVE_NO_XDEV If openat2() was called with RESOLVE_NO_XDEV it didn't traverse through automounts, but could still trigger them - Add FL_RECLAIM flag to show_fl_flags() macro so it appears in tracepoints - Fix unused variable warning in rd_load_image() on s390 - Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD - Use ns_capable_noaudit() when determining net sysctl permissions - Don't call path_put() under namespace semaphore in listmount() and statmount()" * tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits) fcntl: trim arguments listmount: don't call path_put() under namespace semaphore statmount: don't call path_put() under namespace semaphore pid: use ns_capable_noaudit() when determining net sysctl permissions fs: rename generic_delete_inode() and generic_drop_inode() init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD initramfs: Replace strcpy() with strscpy() in find_link() initrd: Use str_plural() in rd_load_image() initramfs: Use struct_size() helper to improve dir_add() initrd: Fix unused variable warning in rd_load_image() on s390 fs: use the switch statement in init_special_inode() fs/proc/namespaces: make ns_entries const filelock: add FL_RECLAIM to show_fl_flags() macro eventpoll: Replace rwlock with spinlock selftests/proc: add tests for new pidns APIs procfs: add "pidns" mount option pidns: move is-ancestor logic to helper openat2: don't trigger automounts with RESOLVE_NO_XDEV namei: move cross-device check to __traverse_mounts namei: remove LOOKUP_NO_XDEV check from handle_mounts ...
2025-09-26ext4: fix checks for orphan inodesJan Kara
When orphan file feature is enabled, inode can be tracked as orphan either in the standard orphan list or in the orphan file. The first can be tested by checking ei->i_orphan list head, the second is recorded by EXT4_STATE_ORPHAN_FILE inode state flag. There are several places where we want to check whether inode is tracked as orphan and only some of them properly check for both possibilities. Luckily the consequences are mostly minor, the worst that can happen is that we track an inode as orphan although we don't need to and e2fsck then complains (resulting in occasional ext4/307 xfstest failures). Fix the problem by introducing a helper for checking whether an inode is tracked as orphan and use it in appropriate places. Fixes: 4a79a98c7b19 ("ext4: Improve scalability of ext4 orphan file handling") Cc: stable@kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Message-ID: <20250925123038.20264-2-jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-26ext4: validate ea_ino and size in check_xattrsDeepanshu Kartikey
During xattr block validation, check_xattrs() processes xattr entries without validating that entries claiming to use EA inodes have non-zero sizes. Corrupted filesystems may contain xattr entries where e_value_size is zero but e_value_inum is non-zero, indicating invalid xattr data. Add validation in check_xattrs() to detect this corruption pattern early and return -EFSCORRUPTED, preventing invalid xattr entries from causing issues throughout the ext4 codebase. Cc: stable@kernel.org Suggested-by: Theodore Ts'o <tytso@mit.edu> Reported-by: syzbot+4c9d23743a2409b80293@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=4c9d23743a2409b80293 Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Message-ID: <20250923133245.1091761-1-kartikey406@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-26ext4: guard against EA inode refcount underflow in xattr updateAhmet Eray Karadag
syzkaller found a path where ext4_xattr_inode_update_ref() reads an EA inode refcount that is already <= 0 and then applies ref_change (often -1). That lets the refcount underflow and we proceed with a bogus value, triggering errors like: EXT4-fs error: EA inode <n> ref underflow: ref_count=-1 ref_change=-1 EXT4-fs warning: ea_inode dec ref err=-117 Make the invariant explicit: if the current refcount is non-positive, treat this as on-disk corruption, emit ext4_error_inode(), and fail the operation with -EFSCORRUPTED instead of updating the refcount. Delete the WARN_ONCE() as negative refcounts are now impossible; keep error reporting in ext4_error_inode(). This prevents the underflow and the follow-on orphan/cleanup churn. Reported-by: syzbot+0be4f339a8218d2a5bb1@syzkaller.appspotmail.com Fixes: https://syzbot.org/bug?extid=0be4f339a8218d2a5bb1 Cc: stable@kernel.org Co-developed-by: Albin Babu Varghese <albinbabuvarghese20@gmail.com> Signed-off-by: Albin Babu Varghese <albinbabuvarghese20@gmail.com> Signed-off-by: Ahmet Eray Karadag <eraykrdg1@gmail.com> Message-ID: <20250920021342.45575-1-eraykrdg1@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-26ext4: implemet new ioctls to set and get superblock parametersTheodore Ts'o
Implement the EXT4_IOC_GET_TUNE_SB_PARAM and EXT4_IOC_SET_TUNE_SB_PARAM ioctls, which allow certains superblock parameters to be set while the file system is mounted, without needing write access to the block device. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Message-ID: <20250916-tune2fs-v2-3-d594dc7486f0@mit.edu> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-26ext4: add support for 32-bit default reserved uid and gid valuesTheodore Ts'o
Support for specifying the default user id and group id that is allowed to use the reserved block space was added way back when Linux only supported 16-bit uid's and gid's. (Yeah, that long ago.) It's not a commonly used feature, but let's add support for 32-bit user and group id's. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Message-ID: <20250916-tune2fs-v2-2-d594dc7486f0@mit.edu> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-26ext4: avoid potential buffer over-read in parse_apply_sb_mount_options()Theodore Ts'o
Unlike other strings in the ext4 superblock, we rely on tune2fs to make sure s_mount_opts is NUL terminated. Harden parse_apply_sb_mount_options() by treating s_mount_opts as a potential __nonstring. Cc: stable@vger.kernel.org Fixes: 8b67f04ab9de ("ext4: Add mount options in superblock") Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Message-ID: <20250916-tune2fs-v2-1-d594dc7486f0@mit.edu> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-26ext4: fix an off-by-one issue during moving extentsZhang Yi
During the movement of a written extent, mext_page_mkuptodate() is called to read data in the range [from, to) into the page cache and to update the corresponding buffers. Therefore, we should not wait on any buffer whose start offset is >= 'to'. Otherwise, it will return -EIO and fail the extents movement. $ for i in `seq 3 -1 0`; \ do xfs_io -fs -c "pwrite -b 1024 $((i * 1024)) 1024" /mnt/foo; \ done $ umount /mnt && mount /dev/pmem1s /mnt # drop cache $ e4defrag /mnt/foo e4defrag 1.47.0 (5-Feb-2023) ext4 defragmentation for /mnt/foo [1/1]/mnt/foo: 0% [ NG ] Success: [0/1] Cc: stable@kernel.org Fixes: a40759fb16ae ("ext4: remove array of buffer_heads from mext_page_mkuptodate()") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20250912105841.1886799-1-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-26ext4: increase i_disksize to offset + len in ext4_update_disksize_before_punch()Yongjian Sun
After running a stress test combined with fault injection, we performed fsck -a followed by fsck -fn on the filesystem image. During the second pass, fsck -fn reported: Inode 131512, end of extent exceeds allowed value (logical block 405, physical block 1180540, len 2) This inode was not in the orphan list. Analysis revealed the following call chain that leads to the inconsistency: ext4_da_write_end() //does not update i_disksize ext4_punch_hole() //truncate folio, keep size ext4_page_mkwrite() ext4_block_page_mkwrite() ext4_block_write_begin() ext4_get_block() //insert written extent without update i_disksize journal commit echo 1 > /sys/block/xxx/device/delete da-write path updates i_size but does not update i_disksize. Then ext4_punch_hole truncates the da-folio yet still leaves i_disksize unchanged(in the ext4_update_disksize_before_punch function, the condition offset + len < size is met). Then ext4_page_mkwrite sees ext4_nonda_switch return 1 and takes the nodioread_nolock path, the folio about to be written has just been punched out, and it’s offset sits beyond the current i_disksize. This may result in a written extent being inserted, but again does not update i_disksize. If the journal gets committed and then the block device is yanked, we might run into this. It should be noted that replacing ext4_punch_hole with ext4_zero_range in the call sequence may also trigger this issue, as neither will update i_disksize under these circumstances. To fix this, we can modify ext4_update_disksize_before_punch to increase i_disksize to min(i_size, offset + len) when both i_size and (offset + len) are greater than i_disksize. Cc: stable@kernel.org Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Message-ID: <20250911133024.1841027-1-sunyongjian@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-26ext4: verify orphan file size is not too bigJan Kara
In principle orphan file can be arbitrarily large. However orphan replay needs to traverse it all and we also pin all its buffers in memory. Thus filesystems with absurdly large orphan files can lead to big amounts of memory consumed. Limit orphan file size to a sane value and also use kvmalloc() for allocating array of block descriptor structures to avoid large order allocations for sane but large orphan files. Reported-by: syzbot+0b92850d68d9b12934f5@syzkaller.appspotmail.com Fixes: 02f310fcf47f ("ext4: Speedup ext4 orphan inode handling") Cc: stable@kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Message-ID: <20250909112206.10459-2-jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-26ext4: fail unaligned direct IO write with EINVALJan Kara
Commit bc264fea0f6f ("iomap: support incremental iomap_iter advances") changed the error handling logic in iomap_iter(). Previously any error from iomap_dio_bio_iter() got propagated to userspace, after this commit if ->iomap_end returns error, it gets propagated to userspace instead of an error from iomap_dio_bio_iter(). This results in unaligned writes to ext4 to silently fallback to buffered IO instead of erroring out. Now returning ENOTBLK for DIO writes from ext4_iomap_end() seems unnecessary these days. It is enough to return ENOTBLK from ext4_iomap_begin() when we don't support DIO write for that particular file offset (due to hole). Fixes: bc264fea0f6f ("iomap: support incremental iomap_iter advances") Cc: stable@kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Message-ID: <20250901112739.32484-2-jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-25ext4: correctly handle queries for metadata mappingsOjaswin Mujoo
Currently, our handling of metadata is _ambiguous_ in some scenarios, that is, we end up returning unknown if the range only covers the mapping partially. For example, in the following case: $ xfs_io -c fsmap -d 0: 254:16 [0..7]: static fs metadata 8 1: 254:16 [8..15]: special 102:1 8 2: 254:16 [16..5127]: special 102:2 5112 3: 254:16 [5128..5255]: special 102:3 128 4: 254:16 [5256..5383]: special 102:4 128 5: 254:16 [5384..70919]: inodes 65536 6: 254:16 [70920..70967]: unknown 48 ... $ xfs_io -c fsmap -d 24 33 0: 254:16 [24..39]: unknown 16 <--- incomplete reporting $ xfs_io -c fsmap -d 24 33 (With patch) 0: 254:16 [16..5127]: special 102:2 5112 This is because earlier in ext4_getfsmap_meta_helper, we end up ignoring any extent that starts before our queried range, but overlaps it. While the man page [1] is a bit ambiguous on this, this fix makes the output make more sense since we are anyways returning an "unknown" extent. This is also consistent to how XFS does it: $ xfs_io -c fsmap -d ... 6: 254:16 [104..127]: free space 24 7: 254:16 [128..191]: inodes 64 ... $ xfs_io -c fsmap -d 137 150 0: 254:16 [128..191]: inodes 64 <-- full extent returned [1] https://man7.org/linux/man-pages/man2/ioctl_getfsmap.2.html Reported-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: stable@kernel.org Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <023f37e35ee280cd9baac0296cbadcbe10995cab.1757058211.git.ojaswin@linux.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-25ext4: increase IO priority of fastcommitJulian Sun
The following code paths may result in high latency or even task hangs: 1. fastcommit io is throttled by wbt. 2. jbd2_fc_wait_bufs() might wait for a long time while JBD2_FAST_COMMIT_ONGOING is set in journal->flags, and then jbd2_journal_commit_transaction() waits for the JBD2_FAST_COMMIT_ONGOING bit for a long time while holding the write lock of j_state_lock. 3. start_this_handle() waits for read lock of j_state_lock which results in high latency or task hang. Given the fact that ext4_fc_commit() already modifies the current process' IO priority to match that of the jbd2 thread, it should be reasonable to match jbd2's IO submission flags as well. Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Julian Sun <sunjunchao@bytedance.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20250827121812.1477634-1-sunjunchao@bytedance.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-25ext4: remove obsolete EXT3 config optionsLukas Bulwahn
In June 2015, commit c290ea01abb7 ("fs: Remove ext3 filesystem driver") removed the historic ext3 filesystem support as ext3 partitions are fully supported with the ext4 filesystem support. To simplify updating the kernel build configuration, which had only EXT3 support but not EXT4 support enabled, the three config options EXT3_{FS,FS_POSIX_ACL,FS_SECURITY} were kept, instead of immediately removing them. The three options just enable the corresponding EXT4 counterparts when configs from older kernel versions are used to build on later kernel versions. This ensures that the kernels from those kernel build configurations would then continue to have EXT4 enabled for supporting booting from ext3 and ext4 file systems, to avoid potential unexpected surprises. Given that the kernel build configuration has no backwards-compatibility guarantee and this transition phase for such build configurations has been in place for a decade, we can reasonably expect all such users to have transitioned to use the EXT4 config options in their config files at this point in time. With that in mind, the three EXT3 config options are obsolete by now. Remove the obsolete EXT3 config options. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-25ext4: fix potential null deref in ext4_mb_init()Baokun Li
In ext4_mb_init(), ext4_mb_avg_fragment_size_destroy() may be called when sbi->s_mb_avg_fragment_size remains uninitialized (e.g., if groupinfo slab cache allocation fails). Since ext4_mb_avg_fragment_size_destroy() lacks null pointer checking, this leads to a null pointer dereference. ================================================================== EXT4-fs: no memory for groupinfo slab cache BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 0 P4D 0 Oops: Oops: 0002 [#1] SMP PTI CPU:2 UID: 0 PID: 87 Comm:mount Not tainted 6.17.0-rc2 #1134 PREEMPT(none) RIP: 0010:_raw_spin_lock_irqsave+0x1b/0x40 Call Trace: <TASK> xa_destroy+0x61/0x130 ext4_mb_init+0x483/0x540 __ext4_fill_super+0x116d/0x17b0 ext4_fill_super+0xd3/0x280 get_tree_bdev_flags+0x132/0x1d0 vfs_get_tree+0x29/0xd0 do_new_mount+0x197/0x300 __x64_sys_mount+0x116/0x150 do_syscall_64+0x50/0x1c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e ================================================================== Therefore, add necessary null check to ext4_mb_avg_fragment_size_destroy() to prevent this issue. The same fix is also applied to ext4_mb_largest_free_orders_destroy(). Reported-by: syzbot+1713b1aa266195b916c2@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=1713b1aa266195b916c2 Cc: stable@kernel.org Fixes: f7eaacbb4e54 ("ext4: convert free groups order lists to xarrays") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-25ext4: add ext4_sb_bread_nofail() helper function for ext4_free_branches()Baokun Li
The implicit __GFP_NOFAIL flag in ext4_sb_bread() was removed in commit 8a83ac54940d ("ext4: call bdev_getblk() from sb_getblk_gfp()"), meaning the function can now fail under memory pressure. Most callers of ext4_sb_bread() propagate the error to userspace and do not remount the filesystem read-only. However, ext4_free_branches() handles ext4_sb_bread() failure by remounting the filesystem read-only. This implies that an ext3 filesystem (mounted via the ext4 driver) could be forcibly remounted read-only due to a transient page allocation failure, which is unacceptable. To mitigate this, introduce a new helper function, ext4_sb_bread_nofail(), which explicitly uses __GFP_NOFAIL, and use it in ext4_free_branches(). Fixes: 8a83ac54940d ("ext4: call bdev_getblk() from sb_getblk_gfp()") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-25ext4: replace min/max nesting with clamp()Xichao Zhao
The clamp() macro explicitly expresses the intent of constraining a value within bounds.Therefore, replacing max(min(a,b),c) with clamp(val, lo, hi) can improve code readability. Signed-off-by: Xichao Zhao <zhao.xichao@vivo.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-25fs: ext4: change GFP_KERNEL to GFP_NOFS to avoid deadlockchuguangqing
The parent function ext4_xattr_inode_lookup_create already uses GFP_NOFS for memory alloction, so the function ext4_xattr_inode_cache_find should use same gfp_flag. Signed-off-by: chuguangqing <chuguangqing@inspur.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-19fs: replace use of system_unbound_wq with system_dfl_wqMarco Crivellari
Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_unbound_wq should be the default workqueue so as not to enforce locality constraints for random work whenever it's not required. Adding system_dfl_wq to encourage its use when unbound work should be used. The old system_unbound_wq will be kept for a few release cycles. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://lore.kernel.org/20250916082906.77439-2-marco.crivellari@suse.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-15fs: rename generic_delete_inode() and generic_drop_inode()Mateusz Guzik
generic_delete_inode() is rather misleading for what the routine is doing. inode_just_drop() should be much clearer. The new naming is inconsistent with generic_drop_inode(), so rename that one as well with inode_ as the suffix. No functional changes. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-01fs: add an icount_read helperJosef Bacik
Instead of doing direct access to ->i_count, add a helper to handle this. This will make it easier to convert i_count to a refcount later. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/9bc62a84c6b9d6337781203f60837bd98fbc4a96.1756222464.git.josef@toxicpanda.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-21ext4: move verity info pointer to fs-specific part of inodeEric Biggers
Move the fsverity_info pointer into the filesystem-specific part of the inode by adding the field ext4_inode_info::i_verity_info and configuring fsverity_operations::inode_info_offs accordingly. This is a prerequisite for a later commit that removes inode::i_verity_info, saving memory and improving cache efficiency on filesystems that don't support fsverity. Co-developed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Link: https://lore.kernel.org/20250810075706.172910-10-ebiggers@kernel.org Acked-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-21ext4: move crypt info pointer to fs-specific part of inodeEric Biggers
Move the fscrypt_inode_info pointer into the filesystem-specific part of the inode by adding the field ext4_inode_info::i_crypt_info and configuring fscrypt_operations::inode_info_offs accordingly. This is a prerequisite for a later commit that removes inode::i_crypt_info, saving memory and improving cache efficiency with filesystems that don't support fscrypt. Co-developed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Link: https://lore.kernel.org/20250810075706.172910-4-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-18Merge tag 'ext4_for_linus-6.17-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: - Fix fast commit checks for file systems with ea_inode enabled - Don't drop the i_version mount option on a remount - Fix FIEMAP reporting when there are holes in a bigalloc file system - Don't fail when mounting read-only when there are inodes in the orphan file - Fix hole length overflow for indirect mapped files on file systems with an 8k or 16k block file system * tag 'ext4_for_linus-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: jbd2: prevent softlockup in jbd2_log_do_checkpoint() ext4: fix incorrect function name in comment ext4: use kmalloc_array() for array space allocation ext4: fix hole length calculation overflow in non-extent inodes ext4: don't try to clear the orphan_present feature block device is r/o ext4: fix reserved gdt blocks handling in fsmap ext4: fix fsmap end of range reporting with bigalloc ext4: remove redundant __GFP_NOWARN ext4: fix unused variable warning in ext4_init_new_dir ext4: remove useless if check ext4: check fast symlink for ea_inode correctly ext4: preserve SB_I_VERSION on remount ext4: show the default enabled i_version option
2025-08-13ext4: fix incorrect function name in commentBaolin Liu
Since commit 6b730a405037 “ext4: hoist ext4_block_write_begin and replace the __block_write_begin”, the comment should be updated accordingly from "__block_write_begin" to "ext4_block_write_begin". Fixes: 6b730a405037 (“ext4: hoist ext4_block_write_begin and replace...") Signed-off-by: Baolin Liu <liubaolin@kylinos.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://patch.msgid.link/20250812021709.1120716-1-liubaolin12138@163.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-08-12ext4: use kmalloc_array() for array space allocationLiao Yuanhong
Replace kmalloc(size * sizeof) with kmalloc_array() for safer memory allocation and overflow prevention. Cc: stable@kernel.org Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com> Link: https://patch.msgid.link/20250811125816.570142-1-liaoyuanhong@vivo.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-08-12ext4: fix hole length calculation overflow in non-extent inodesZhang Yi
In a filesystem with a block size larger than 4KB, the hole length calculation for a non-extent inode in ext4_ind_map_blocks() can easily exceed INT_MAX. Then it could return a zero length hole and trigger the following waring and infinite in the iomap infrastructure. ------------[ cut here ]------------ WARNING: CPU: 3 PID: 434101 at fs/iomap/iter.c:34 iomap_iter_done+0x148/0x190 CPU: 3 UID: 0 PID: 434101 Comm: fsstress Not tainted 6.16.0-rc7+ #128 PREEMPT(voluntary) Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : iomap_iter_done+0x148/0x190 lr : iomap_iter+0x174/0x230 sp : ffff8000880af740 x29: ffff8000880af740 x28: ffff0000db8e6840 x27: 0000000000000000 x26: 0000000000000000 x25: ffff8000880af830 x24: 0000004000000000 x23: 0000000000000002 x22: 000001bfdbfa8000 x21: ffffa6a41c002e48 x20: 0000000000000001 x19: ffff8000880af808 x18: 0000000000000000 x17: 0000000000000000 x16: ffffa6a495ee6cd0 x15: 0000000000000000 x14: 00000000000003d4 x13: 00000000fa83b2da x12: 0000b236fc95f18c x11: ffffa6a4978b9c08 x10: 0000000000001da0 x9 : ffffa6a41c1a2a44 x8 : ffff8000880af5c8 x7 : 0000000001000000 x6 : 0000000000000000 x5 : 0000000000000004 x4 : 000001bfdbfa8000 x3 : 0000000000000000 x2 : 0000000000000000 x1 : 0000004004030000 x0 : 0000000000000000 Call trace: iomap_iter_done+0x148/0x190 (P) iomap_iter+0x174/0x230 iomap_fiemap+0x154/0x1d8 ext4_fiemap+0x110/0x140 [ext4] do_vfs_ioctl+0x4b8/0xbc0 __arm64_sys_ioctl+0x8c/0x120 invoke_syscall+0x6c/0x100 el0_svc_common.constprop.0+0x48/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x38/0x120 el0t_64_sync_handler+0x10c/0x138 el0t_64_sync+0x198/0x1a0 ---[ end trace 0000000000000000 ]--- Cc: stable@kernel.org Fixes: facab4d9711e ("ext4: return hole from ext4_map_blocks()") Reported-by: Qu Wenruo <wqu@suse.com> Closes: https://lore.kernel.org/linux-ext4/9b650a52-9672-4604-a765-bb6be55d1e4a@gmx.com/ Tested-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250811064532.1788289-1-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-08-12ext4: don't try to clear the orphan_present feature block device is r/oTheodore Ts'o
When the file system is frozen in preparation for taking an LVM snapshot, the journal is checkpointed and if the orphan_file feature is enabled, and the orphan file is empty, we clear the orphan_present feature flag. But if there are pending inodes that need to be removed the orphan_present feature flag can't be cleared. The problem comes if the block device is read-only. In that case, we can't process the orphan inode list, so it is skipped in ext4_orphan_cleanup(). But then in ext4_mark_recovery_complete(), this results in the ext4 error "Orphan file not empty on read-only fs" firing and the file system mount is aborted. Fix this by clearing the needs_recovery flag in the block device is read-only. We do this after the call to ext4_load_and_init-journal() since there are some error checks need to be done in case the journal needs to be replayed and the block device is read-only, or if the block device containing the externa journal is read-only, etc. Cc: stable@kernel.org Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1108271 Cc: stable@vger.kernel.org Fixes: 02f310fcf47f ("ext4: Speedup ext4 orphan inode handling") Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-08-12ext4: fix reserved gdt blocks handling in fsmapOjaswin Mujoo
In some cases like small FSes with no meta_bg and where the resize doesn't need extra gdt blocks as it can fit in the current one, s_reserved_gdt_blocks is set as 0, which causes fsmap to emit a 0 length entry, which is incorrect. $ mkfs.ext4 -b 65536 -O bigalloc /dev/sda 5G $ mount /dev/sda /mnt/scratch $ xfs_io -c "fsmap -d" /mnt/scartch 0: 253:48 [0..127]: static fs metadata 128 1: 253:48 [128..255]: special 102:1 128 2: 253:48 [256..255]: special 102:2 0 <---- 0 len entry 3: 253:48 [256..383]: special 102:3 128 Fix this by adding a check for this case. Cc: stable@kernel.org Fixes: 0c9ec4beecac ("ext4: support GETFSMAP ioctls") Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://patch.msgid.link/08781b796453a5770112aa96ad14c864fbf31935.1754377641.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-08-12ext4: fix fsmap end of range reporting with bigallocOjaswin Mujoo
With bigalloc enabled, the logic to report last extent has a bug since we try to use cluster units instead of block units. This can cause an issue where extra incorrect entries might be returned back to the user. This was flagged by generic/365 with 64k bs and -O bigalloc. ** Details of issue ** The issue was noticed on 5G 64k blocksize FS with -O bigalloc which has only 1 bg. $ xfs_io -c "fsmap -d" /mnt/scratch 0: 253:48 [0..127]: static fs metadata 128 /* sb */ 1: 253:48 [128..255]: special 102:1 128 /* gdt */ 3: 253:48 [256..383]: special 102:3 128 /* block bitmap */ 4: 253:48 [384..2303]: unknown 1920 /* flex bg empty space */ 5: 253:48 [2304..2431]: special 102:4 128 /* inode bitmap */ 6: 253:48 [2432..4351]: unknown 1920 /* flex bg empty space */ 7: 253:48 [4352..6911]: inodes 2560 8: 253:48 [6912..538623]: unknown 531712 9: 253:48 [538624..10485759]: free space 9947136 The issue can be seen with: $ xfs_io -c "fsmap -d 0 3" /mnt/scratch 0: 253:48 [0..127]: static fs metadata 128 1: 253:48 [384..2047]: unknown 1664 Only the first entry was expected to be returned but we get 2. This is because: ext4_getfsmap_datadev() first_cluster, last_cluster = 0 ... info->gfi_last = true; ext4_getfsmap_datadev_helper(sb, end_ag, last_cluster + 1, 0, info); fsb = C2B(1) = 16 fslen = 0 ... /* Merge in any relevant extents from the meta_list */ list_for_each_entry_safe(p, tmp, &info->gfi_meta_list, fmr_list) { ... // since fsb = 16, considers all metadata which starts before 16 blockno iter 1: error = ext4_getfsmap_helper(sb, info, p); // p = sb (0,1), nop info->gfi_next_fsblk = 1 iter 2: error = ext4_getfsmap_helper(sb, info, p); // p = gdt (1,2), nop info->gfi_next_fsblk = 2 iter 3: error = ext4_getfsmap_helper(sb, info, p); // p = blk bitmap (2,3), nop info->gfi_next_fsblk = 3 iter 4: error = ext4_getfsmap_helper(sb, info, p); // p = ino bitmap (18,19) if (rec_blk > info->gfi_next_fsblk) { // (18 > 3) // emits an extra entry ** BUG ** } } Fix this by directly calling ext4_getfsmap_datadev() with a dummy record that has fmr_physical set to (end_fsb + 1) instead of last_cluster + 1. By using the block instead of cluster we get the correct behavior. Replacing ext4_getfsmap_datadev_helper() with ext4_getfsmap_helper() is okay since the gfi_lastfree and metadata checks in ext4_getfsmap_datadev_helper() are anyways redundant when we only want to emit the last allocated block of the range, as we have already taken care of emitting metadata and any last free blocks. Cc: stable@kernel.org Reported-by: Disha Goel <disgoel@linux.ibm.com> Fixes: 4a622e4d477b ("ext4: fix FS_IOC_GETFSMAP handling") Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://patch.msgid.link/e7472c8535c9c5ec10f425f495366864ea12c9da.1754377641.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>