summaryrefslogtreecommitdiff
path: root/fs/libfs.c
AgeCommit message (Collapse)Author
6 daysshmem: fix recovery on rename failuresAl Viro
maple_tree insertions can fail if we are seriously short on memory; simple_offset_rename() does not recover well if it runs into that. The same goes for simple_offset_rename_exchange(). Moreover, shmem_whiteout() expects that if it succeeds, the caller will progress to d_move(), i.e. that shmem_rename2() won't fail past the successful call of shmem_whiteout(). Not hard to fix, fortunately - mtree_store() can't fail if the index we are trying to store into is already present in the tree as a singleton. For simple_offset_rename_exchange() that's enough - we just need to be careful about the order of operations. For simple_offset_rename() solution is to preinsert the target into the tree for new_dir; the rest can be done without any potentially failing operations. That preinsertion has to be done in shmem_rename2() rather than in simple_offset_rename() itself - otherwise we'd need to deal with the possibility of failure after successful shmem_whiteout(). Fixes: a2e459555c5f ("shmem: stable directory offsets") Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-12-05Merge tag 'pull-persistency' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull persistent dentry infrastructure and conversion from Al Viro: "Some filesystems use a kinda-sorta controlled dentry refcount leak to pin dentries of created objects in dcache (and undo it when removing those). A reference is grabbed and not released, but it's not actually _stored_ anywhere. That works, but it's hard to follow and verify; among other things, we have no way to tell _which_ of the increments is intended to be an unpaired one. Worse, on removal we need to decide whether the reference had already been dropped, which can be non-trivial if that removal is on umount and we need to figure out if this dentry is pinned due to e.g. unlink() not done. Usually that is handled by using kill_litter_super() as ->kill_sb(), but there are open-coded special cases of the same (consider e.g. /proc/self). Things get simpler if we introduce a new dentry flag (DCACHE_PERSISTENT) marking those "leaked" dentries. Having it set claims responsibility for +1 in refcount. The end result this series is aiming for: - get these unbalanced dget() and dput() replaced with new primitives that would, in addition to adjusting refcount, set and clear persistency flag. - instead of having kill_litter_super() mess with removing the remaining "leaked" references (e.g. for all tmpfs files that hadn't been removed prior to umount), have the regular shrink_dcache_for_umount() strip DCACHE_PERSISTENT of all dentries, dropping the corresponding reference if it had been set. After that kill_litter_super() becomes an equivalent of kill_anon_super(). Doing that in a single step is not feasible - it would affect too many places in too many filesystems. It has to be split into a series. This work has really started early in 2024; quite a few preliminary pieces have already gone into mainline. This chunk is finally getting to the meat of that stuff - infrastructure and most of the conversions to it. Some pieces are still sitting in the local branches, but the bulk of that stuff is here" * tag 'pull-persistency' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits) d_make_discardable(): warn if given a non-persistent dentry kill securityfs_recursive_remove() convert securityfs get rid of kill_litter_super() convert rust_binderfs convert nfsctl convert rpc_pipefs convert hypfs hypfs: swich hypfs_create_u64() to returning int hypfs: switch hypfs_create_str() to returning int hypfs: don't pin dentries twice convert gadgetfs gadgetfs: switch to simple_remove_by_name() convert functionfs functionfs: switch to simple_remove_by_name() functionfs: fix the open/removal races functionfs: need to cancel ->reset_work in ->kill_sb() functionfs: don't bother with ffs->ref in ffs_data_{opened,closed}() functionfs: don't abuse ffs_data_closed() on fs shutdown convert selinuxfs ...
2025-12-01Merge tag 'vfs-6.19-rc1.directory.locking' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull directory locking updates from Christian Brauner: "This contains the work to add centralized APIs for directory locking operations. This series is part of a larger effort to change directory operation locking to allow multiple concurrent operations in a directory. The ultimate goal is to lock the target dentry(s) rather than the whole parent directory. To help with changing the locking protocol, this series centralizes locking and lookup in new helper functions. The helpers establish a pattern where it is the dentry that is being locked and unlocked (currently the lock is held on dentry->d_parent->d_inode, but that can change in the future). This also changes vfs_mkdir() to unlock the parent on failure, as well as dput()ing the dentry. This allows end_creating() to only require the target dentry (which may be IS_ERR() after vfs_mkdir()), not the parent" * tag 'vfs-6.19-rc1.directory.locking' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: nfsd: fix end_creating() conversion VFS: introduce end_creating_keep() VFS: change vfs_mkdir() to unlock on failure. ecryptfs: use new start_creating/start_removing APIs Add start_renaming_two_dentries() VFS/ovl/smb: introduce start_renaming_dentry() VFS/nfsd/ovl: introduce start_renaming() and end_renaming() VFS: add start_creating_killable() and start_removing_killable() VFS: introduce start_removing_dentry() smb/server: use end_removing_noperm for for target of smb2_create_link() VFS: introduce start_creating_noperm() and start_removing_noperm() VFS/nfsd/cachefiles/ovl: introduce start_removing() and end_removing() VFS/nfsd/cachefiles/ovl: add start_creating() and end_creating() VFS: tidy up do_unlinkat() VFS: introduce start_dirop() and end_dirop() debugfs: rename end_creating() to debugfs_end_creating()
2025-12-01Merge tag 'namespace-6.19-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull namespace updates from Christian Brauner: "This contains substantial namespace infrastructure changes including a new system call, active reference counting, and extensive header cleanups. The branch depends on the shared kbuild branch for -fms-extensions support. Features: - listns() system call Add a new listns() system call that allows userspace to iterate through namespaces in the system. This provides a programmatic interface to discover and inspect namespaces, addressing longstanding limitations: Currently, there is no direct way for userspace to enumerate namespaces. Applications must resort to scanning /proc/*/ns/ across all processes, which is: - Inefficient - requires iterating over all processes - Incomplete - misses namespaces not attached to any running process but kept alive by file descriptors, bind mounts, or parent references - Permission-heavy - requires access to /proc for many processes - No ordering or ownership information - No filtering per namespace type The listns() system call solves these problems: ssize_t listns(const struct ns_id_req *req, u64 *ns_ids, size_t nr_ns_ids, unsigned int flags); struct ns_id_req { __u32 size; __u32 spare; __u64 ns_id; struct /* listns */ { __u32 ns_type; __u32 spare2; __u64 user_ns_id; }; }; Features include: - Pagination support for large namespace sets - Filtering by namespace type (MNT_NS, NET_NS, USER_NS, etc.) - Filtering by owning user namespace - Permission checks respecting namespace isolation - Active Reference Counting Introduce an active reference count that tracks namespace visibility to userspace. A namespace is visible in the following cases: - The namespace is in use by a task - The namespace is persisted through a VFS object (namespace file descriptor or bind-mount) - The namespace is a hierarchical type and is the parent of child namespaces The active reference count does not regulate lifetime (that's still done by the normal reference count) - it only regulates visibility to namespace file handles and listns(). This prevents resurrection of namespaces that are pinned only for internal kernel reasons (e.g., user namespaces held by file->f_cred, lazy TLB references on idle CPUs, etc.) which should not be accessible via (1)-(3). - Unified Namespace Tree Introduce a unified tree structure for all namespaces with: - Fixed IDs assigned to initial namespaces - Lookup based solely on inode number - Maintained list of owned namespaces per user namespace - Simplified rbtree comparison helpers Cleanups - Header Reorganization: - Move namespace types into separate header (ns_common_types.h) - Decouple nstree from ns_common header - Move nstree types into separate header - Switch to new ns_tree_{node,root} structures with helper functions - Use guards for ns_tree_lock - Initial Namespace Reference Count Optimization - Make all reference counts on initial namespaces a nop to avoid pointless cacheline ping-pong for namespaces that can never go away - Drop custom reference count initialization for initial namespaces - Add NS_COMMON_INIT() macro and use it for all namespaces - pid: rely on common reference count behavior - Miscellaneous Cleanups - Rename exit_task_namespaces() to exit_nsproxy_namespaces() - Rename is_initial_namespace() and make argument const - Use boolean to indicate anonymous mount namespace - Simplify owner list iteration in nstree - nsfs: raise SB_I_NODEV, SB_I_NOEXEC, and DCACHE_DONTCACHE explicitly - nsfs: use inode_just_drop() - pidfs: raise DCACHE_DONTCACHE explicitly - pidfs: simplify PIDFD_GET__NAMESPACE ioctls - libfs: allow to specify s_d_flags - cgroup: add cgroup namespace to tree after owner is set - nsproxy: fix free_nsproxy() and simplify create_new_namespaces() Fixes: - setns(pidfd, ...) race condition Fix a subtle race when using pidfds with setns(). When the target task exits after prepare_nsset() but before commit_nsset(), the namespace's active reference count might have been dropped. If setns() then installs the namespaces, it would bump the active reference count from zero without taking the required reference on the owner namespace, leading to underflow when later decremented. The fix resurrects the ownership chain if necessary - if the caller succeeded in grabbing passive references, the setns() should succeed even if the target task exits or gets reaped. - Return EFAULT on put_user() error instead of success - Make sure references are dropped outside of RCU lock (some namespaces like mount namespace sleep when putting the last reference) - Don't skip active reference count initialization for network namespace - Add asserts for active refcount underflow - Add asserts for initial namespace reference counts (both passive and active) - ipc: enable is_ns_init_id() assertions - Fix kernel-doc comments for internal nstree functions - Selftests - 15 active reference count tests - 9 listns() functionality tests - 7 listns() permission tests - 12 inactive namespace resurrection tests - 3 threaded active reference count tests - commit_creds() active reference tests - Pagination and stress tests - EFAULT handling test - nsid tests fixes" * tag 'namespace-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (103 commits) pidfs: simplify PIDFD_GET_<type>_NAMESPACE ioctls nstree: fix kernel-doc comments for internal functions nsproxy: fix free_nsproxy() and simplify create_new_namespaces() selftests/namespaces: fix nsid tests ns: drop custom reference count initialization for initial namespaces pid: rely on common reference count behavior ns: add asserts for initial namespace active reference counts ns: add asserts for initial namespace reference counts ns: make all reference counts on initial namespace a nop ipc: enable is_ns_init_id() assertions fs: use boolean to indicate anonymous mount namespace ns: rename is_initial_namespace() ns: make is_initial_namespace() argument const nstree: use guards for ns_tree_lock nstree: simplify owner list iteration nstree: switch to new structures nstree: add helper to operate on struct ns_tree_{node,root} nstree: move nstree types into separate header nstree: decouple from ns_common header ns: move namespace types into separate header ...
2025-11-17d_make_discardable(): warn if given a non-persistent dentryAl Viro
At this point there are very few call chains that might lead to d_make_discardable() on a dentry that hadn't been made persistent: calls of simple_unlink() and simple_rmdir() in configfs and apparmorfs. Both filesystems do pin (part of) their contents in dcache, but they are currently playing very unusual games with that. Converting them to more usual patterns might be possible, but it's definitely going to be a long series of changes in both cases. For now the easiest solution is to have both stop using simple_unlink() and simple_rmdir() - that allows to make d_make_discardable() warn when given a non-persistent dentry. Rather than giving them full-blown private copies (with calls of d_make_discardable() replaced with dput()), let's pull the parts of simple_unlink() and simple_rmdir() that deal with timestamps and link counts into separate helpers (__simple_unlink() and __simple_rmdir() resp.) and have those used by configfs and apparmorfs. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-11-16convert simple_{link,unlink,rmdir,rename,fill_super}() to new primitivesAl Viro
Note that simple_unlink() et.al. are used by many filesystems; for now they can not assume that persistency mark will have been set back when the object got created. Once all conversions are done we'll have them complain if called for something that had not been marked persistent. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-11-16new helper: simple_done_creating()Al Viro
should be paired with simple_start_creating() - unlocks parent and drops dentry reference. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-11-16new helper: simple_remove_by_name()Al Viro
simple_recursive_removal(), but instead of victim dentry it takes parent + name. Used to be open-coded in fs/fuse/control.c, but there's no need to expose the guts of that thing there and there are other potential users, so let's lift it into libfs... Acked-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-11-14VFS: introduce start_dirop() and end_dirop()NeilBrown
The fact that directory operations (create,remove,rename) are protected by a lock on the parent is known widely throughout the kernel. In order to change this - to instead lock the target dentry - it is best to centralise this knowledge so it can be changed in one place. This patch introduces start_dirop() which is local to VFS code. It performs the required locking for create and remove. Rename will be handled separately. Various functions with names like start_creating() or start_removing_path(), some of which already exist, will export this functionality beyond the VFS. end_dirop() is the partner of start_dirop(). It drops the lock and releases the reference on the dentry. It *is* exported so that various end_creating etc functions can be inline. As vfs_mkdir() drops the dentry on error we cannot use end_dirop() as that won't unlock when the dentry IS_ERR(). For now we need an explicit unlock when dentry IS_ERR(). I hope to change vfs_mkdir() to unlock when it drops a dentry so that explicit unlock can go away. end_dirop() can always be called on the result of start_dirop(), but not after vfs_mkdir(). After a vfs_mkdir() we still may need the explicit unlock as seen in end_creating_path(). As well as adding start_dirop() and end_dirop() this patch uses them in: - simple_start_creating (which requires sharing lookup_noperm_common() with libfs.c) - start_removing_path / start_removing_user_path_at - filename_create / end_creating_path() - do_rmdir(), do_unlinkat() Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neil@brown.name> Link: https://patch.msgid.link/20251113002050.676694-3-neilb@ownmail.net Tested-by: syzbot@syzkaller.appspotmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-31libfs: allow to specify s_d_flagsChristian Brauner
Make it possible for pseudo filesystems to specify default dentry flags. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-1-2e6f823ebdc0@kernel.org Tested-by: syzbot@syzkaller.appspotmail.com Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20Coccinelle-based conversion to use ->i_state accessorsMateusz Guzik
All places were patched by coccinelle with the default expecting that ->i_lock is held, afterwards entries got fixed up by hand to use unlocked variants as needed. The script: @@ expression inode, flags; @@ - inode->i_state & flags + inode_state_read(inode) & flags @@ expression inode, flags; @@ - inode->i_state &= ~flags + inode_state_clear(inode, flags) @@ expression inode, flag1, flag2; @@ - inode->i_state &= ~flag1 & ~flag2 + inode_state_clear(inode, flag1 | flag2) @@ expression inode, flags; @@ - inode->i_state |= flags + inode_state_set(inode, flags) @@ expression inode, flags; @@ - inode->i_state = flags + inode_state_assign(inode, flags) @@ expression inode, flags; @@ - flags = inode->i_state + flags = inode_state_read(inode) @@ expression inode, flags; @@ - READ_ONCE(inode->i_state) & flags + inode_state_read(inode) & flags Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-28Merge tag 'vfs-6.17-rc1.pidfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull pidfs updates from Christian Brauner: - persistent info Persist exit and coredump information independent of whether anyone currently holds a pidfd for the struct pid. The current scheme allocated pidfs dentries on-demand repeatedly. This scheme is reaching it's limits as it makes it impossible to pin information that needs to be available after the task has exited or coredumped and that should not be lost simply because the pidfd got closed temporarily. The next opener should still see the stashed information. This is also a prerequisite for supporting extended attributes on pidfds to allow attaching meta information to them. If someone opens a pidfd for a struct pid a pidfs dentry is allocated and stashed in pid->stashed. Once the last pidfd for the struct pid is closed the pidfs dentry is released and removed from pid->stashed. So if 10 callers create a pidfs dentry for the same struct pid sequentially, i.e., each closing the pidfd before the other creates a new one then a new pidfs dentry is allocated every time. Because multiple tasks acquiring and releasing a pidfd for the same struct pid can race with each another a task may still find a valid pidfs entry from the previous task in pid->stashed and reuse it. Or it might find a dead dentry in there and fail to reuse it and so stashes a new pidfs dentry. Multiple tasks may race to stash a new pidfs dentry but only one will succeed, the other ones will put their dentry. The current scheme aims to ensure that a pidfs dentry for a struct pid can only be created if the task is still alive or if a pidfs dentry already existed before the task was reaped and so exit information has been was stashed in the pidfs inode. That's great except that it's buggy. If a pidfs dentry is stashed in pid->stashed after pidfs_exit() but before __unhash_process() is called we will return a pidfd for a reaped task without exit information being available. The pidfds_pid_valid() check does not guard against this race as it doens't sync at all with pidfs_exit(). The pid_has_task() check might be successful simply because we're before __unhash_process() but after pidfs_exit(). Introduce a new scheme where the lifetime of information associated with a pidfs entry (coredump and exit information) isn't bound to the lifetime of the pidfs inode but the struct pid itself. The first time a pidfs dentry is allocated for a struct pid a struct pidfs_attr will be allocated which will be used to store exit and coredump information. If all pidfs for the pidfs dentry are closed the dentry and inode can be cleaned up but the struct pidfs_attr will stick until the struct pid itself is freed. This will ensure minimal memory usage while persisting relevant information. The new scheme has various advantages. First, it allows to close the race where we end up handing out a pidfd for a reaped task for which no exit information is available. Second, it minimizes memory usage. Third, it allows to remove complex lifetime tracking via dentries when registering a struct pid with pidfs. There's no need to get or put a reference. Instead, the lifetime of exit and coredump information associated with a struct pid is bound to the lifetime of struct pid itself. - extended attributes Now that we have a way to persist information for pidfs dentries we can start supporting extended attributes on pidfds. This will allow userspace to attach meta information to tasks. One natural extension would be to introduce a custom pidfs.* extended attribute space and allow for the inheritance of extended attributes across fork() and exec(). The first simple scheme will allow privileged userspace to set trusted extended attributes on pidfs inodes. - Allow autonomous pidfs file handles Various filesystems such as pidfs and drm support opening file handles without having to require a file descriptor to identify the filesystem. The filesystem are global single instances and can be trivially identified solely on the information encoded in the file handle. This makes it possible to not have to keep or acquire a sentinal file descriptor just to pass it to open_by_handle_at() to identify the filesystem. That's especially useful when such sentinel file descriptor cannot or should not be acquired. For pidfs this means a file handle can function as full replacement for storing a pid in a file. Instead a file handle can be stored and reopened purely based on the file handle. Such autonomous file handles can be opened with or without specifying a a file descriptor. If no proper file descriptor is used the FD_PIDFS_ROOT sentinel must be passed. This allows us to define further special negative fd sentinels in the future. Userspace can trivially test for support by trying to open the file handle with an invalid file descriptor. - Allow pidfds for reaped tasks with SCM_PIDFD messages This is a logical continuation of the earlier work to create pidfds for reaped tasks through the SO_PEERPIDFD socket option merged in 923ea4d4482b ("Merge patch series "net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid""). - Two minor fixes: * Fold fs_struct->{lock,seq} into a seqlock * Don't bother with path_{get,put}() in unix_open_file() * tag 'vfs-6.17-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (37 commits) don't bother with path_get()/path_put() in unix_open_file() fold fs_struct->{lock,seq} into a seqlock selftests: net: extend SCM_PIDFD test to cover stale pidfds af_unix: enable handing out pidfds for reaped tasks in SCM_PIDFD af_unix: stash pidfs dentry when needed af_unix/scm: fix whitespace errors af_unix: introduce and use scm_replace_pid() helper af_unix: introduce unix_skb_to_scm helper af_unix: rework unix_maybe_add_creds() to allow sleep selftests/pidfd: decode pidfd file handles withou having to specify an fd fhandle, pidfs: support open_by_handle_at() purely based on file handle uapi/fcntl: add FD_PIDFS_ROOT uapi/fcntl: add FD_INVALID fcntl/pidfd: redefine PIDFD_SELF_THREAD_GROUP uapi/fcntl: mark range as reserved fhandle: reflow get_path_anchor() pidfs: add pidfs_root_path() helper fhandle: rename to get_path_anchor() fhandle: hoist copy_from_user() above get_path_from_fd() fhandle: raise FILEID_IS_DIR in handle_type ...
2025-07-28Merge tag 'vfs-6.17-rc1.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc VFS updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Add ext4 IOCB_DONTCACHE support This refactors the address_space_operations write_begin() and write_end() callbacks to take const struct kiocb * as their first argument, allowing IOCB flags such as IOCB_DONTCACHE to propagate to the filesystem's buffered I/O path. Ext4 is updated to implement handling of the IOCB_DONTCACHE flag and advertises support via the FOP_DONTCACHE file operation flag. Additionally, the i915 driver's shmem write paths are updated to bypass the legacy write_begin/write_end interface in favor of directly calling write_iter() with a constructed synchronous kiocb. Another i915 change replaces a manual write loop with kernel_write() during GEM shmem object creation. Cleanups: - don't duplicate vfs_open() in kernel_file_open() - proc_fd_getattr(): don't bother with S_ISDIR() check - fs/ecryptfs: replace snprintf with sysfs_emit in show function - vfs: Remove unnecessary list_for_each_entry_safe() from evict_inodes() - filelock: add new locks_wake_up_waiter() helper - fs: Remove three arguments from block_write_end() - VFS: change old_dir and new_dir in struct renamedata to dentrys - netfs: Remove unused declaration netfs_queue_write_request() Fixes: - eventpoll: Fix semi-unbounded recursion - eventpoll: fix sphinx documentation build warning - fs/read_write: Fix spelling typo - fs: annotate data race between poll_schedule_timeout() and pollwake() - fs/pipe: set FMODE_NOWAIT in create_pipe_files() - docs/vfs: update references to i_mutex to i_rwsem - fs/buffer: remove comment about hard sectorsize - fs/buffer: remove the min and max limit checks in __getblk_slow() - fs/libfs: don't assume blocksize <= PAGE_SIZE in generic_check_addressable - fs_context: fix parameter name in infofc() macro - fs: Prevent file descriptor table allocations exceeding INT_MAX" * tag 'vfs-6.17-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (24 commits) netfs: Remove unused declaration netfs_queue_write_request() eventpoll: fix sphinx documentation build warning ext4: support uncached buffered I/O mm/pagemap: add write_begin_get_folio() helper function fs: change write_begin/write_end interface to take struct kiocb * drm/i915: Refactor shmem_pwrite() to use kiocb and write_iter drm/i915: Use kernel_write() in shmem object create eventpoll: Fix semi-unbounded recursion vfs: Remove unnecessary list_for_each_entry_safe() from evict_inodes() fs/libfs: don't assume blocksize <= PAGE_SIZE in generic_check_addressable fs/buffer: remove the min and max limit checks in __getblk_slow() fs: Prevent file descriptor table allocations exceeding INT_MAX fs: Remove three arguments from block_write_end() fs/ecryptfs: replace snprintf with sysfs_emit in show function fs: annotate suspected data race between poll_schedule_timeout() and pollwake() docs/vfs: update references to i_mutex to i_rwsem fs/buffer: remove comment about hard sectorsize fs_context: fix parameter name in infofc() macro VFS: change old_dir and new_dir in struct renamedata to dentrys proc_fd_getattr(): don't bother with S_ISDIR() check ...
2025-07-28Merge tag 'pull-rpc_pipefs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull rpc_pipefs updates from Al Viro: "Massage rpc_pipefs to use saner primitives and clean up the APIs provided to the rest of the kernel" * tag 'pull-rpc_pipefs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: rpc_create_client_dir(): return 0 or -E... rpc_create_client_dir(): don't bother with rpc_populate() rpc_new_dir(): the last argument is always NULL rpc_pipe: expand the calls of rpc_mkdir_populate() rpc_gssd_dummy_populate(): don't bother with rpc_populate() rpc_mkpipe_dentry(): switch to simple_start_creating() rpc_pipe: saner primitive for creating regular files rpc_pipe: saner primitive for creating subdirectories rpc_pipe: don't overdo directory locking rpc_mkpipe_dentry(): saner calling conventions rpc_unlink(): saner calling conventions rpc_populate(): lift cleanup into callers rpc_unlink(): use simple_recursive_removal() rpc_{rmdir_,}depopulate(): use simple_recursive_removal() instead rpc_pipe: clean failure exits in fill_super new helper: simple_start_creating()
2025-07-28Merge tag 'pull-simple_recursive_removal' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull simple_recursive_removal() update from Al Viro: "Removing subtrees of kernel filesystems is done in quite a few places; unfortunately, it's easy to get wrong. A number of open-coded attempts are out there, with varying amount of bogosities. simple_recursive_removal() had been introduced for doing that with all precautions needed; it does an equivalent of rm -rf, with sufficient locking, eviction of anything mounted on top of the subtree, etc. This series converts a bunch of open-coded instances to using that" * tag 'pull-simple_recursive_removal' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: functionfs, gadgetfs: use simple_recursive_removal() kill binderfs_remove_file() fuse_ctl: use simple_recursive_removal() pstore: switch to locked_recursive_removal() binfmt_misc: switch to locked_recursive_removal() spufs: switch to locked_recursive_removal() add locked_recursive_removal() better lockdep annotations for simple_recursive_removal() simple_recursive_removal(): saner interaction with fsnotify
2025-07-28Merge tag 'pull-dcache' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull dentry d_flags updates from Al Viro: "The current exclusion rules for dentry->d_flags stores are rather unpleasant. The basic rules are simple: - stores to dentry->d_flags are OK under dentry->d_lock - stores to dentry->d_flags are OK in the dentry constructor, before becomes potentially visible to other threads Unfortunately, there's a couple of exceptions to that, and that's where the headache comes from. The main PITA comes from d_set_d_op(); that primitive sets ->d_op of dentry and adjusts the flags that correspond to presence of individual methods. It's very easy to misuse; existing uses _are_ safe, but proof of correctness is brittle. Use in __d_alloc() is safe (we are within a constructor), but we might as well precalculate the initial value of 'd_flags' when we set the default ->d_op for given superblock and set 'd_flags' directly instead of messing with that helper. The reasons why other uses are safe are bloody convoluted; I'm not going to reproduce it here. See [1] for gory details, if you care. The critical part is using d_set_d_op() only just prior to d_splice_alias(), which makes a combination of d_splice_alias() with setting ->d_op, etc a natural replacement primitive. Better yet, if we go that way, it's easy to take setting ->d_op and modifying 'd_flags' under ->d_lock, which eliminates the headache as far as 'd_flags' exclusion rules are concerned. Other exceptions are minor and easy to deal with. What this series does: - d_set_d_op() is no longer available; instead a new primitive (d_splice_alias_ops()) is provided, equivalent to combination of d_set_d_op() and d_splice_alias(). - new field of struct super_block - 's_d_flags'. This sets the default value of 'd_flags' to be used when allocating dentries on this filesystem. - new primitive for setting 's_d_op': set_default_d_op(). This replaces stores to 's_d_op' at mount time. All in-tree filesystems converted; out-of-tree ones will get caught by the compiler ('s_d_op' is renamed, so stores to it will be caught). 's_d_flags' is set by the same primitive to match the 's_d_op'. - a lot of filesystems had sb->s_d_op->d_delete equal to always_delete_dentry; that is equivalent to setting DCACHE_DONTCACHE in 'd_flags', so such filesystems can bloody well set that bit in 's_d_flags' and drop 'd_delete()' from dentry_operations. In quite a few cases that results in empty dentry_operations, which means that we can get rid of those. - kill simple_dentry_operations - not needed anymore - massage d_alloc_parallel() to get rid of the other exception wrt 'd_flags' stores - we can set DCACHE_PAR_LOOKUP as soon as we allocate the new dentry; no need to delay that until we commit to using the sucker. As the result, 'd_flags' stores are all either under ->d_lock or done before the dentry becomes visible in any shared data structures" Link: https://lore.kernel.org/all/20250224010624.GT1977892@ZenIV/ [1] * tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (21 commits) configfs: use DCACHE_DONTCACHE debugfs: use DCACHE_DONTCACHE efivarfs: use DCACHE_DONTCACHE instead of always_delete_dentry() 9p: don't bother with always_delete_dentry ramfs, hugetlbfs, mqueue: set DCACHE_DONTCACHE kill simple_dentry_operations devpts, sunrpc, hostfs: don't bother with ->d_op shmem: no dentry retention past the refcount reaching zero d_alloc_parallel(): set DCACHE_PAR_LOOKUP earlier make d_set_d_op() static simple_lookup(): just set DCACHE_DONTCACHE tracefs: Add d_delete to remove negative dentries set_default_d_op(): calculate the matching value for ->d_flags correct the set of flags forbidden at d_set_d_op() time split d_flags calculation out of d_set_d_op() new helper: set_default_d_op() fuse: no need for special dentry_operations for root dentry switch procfs from d_set_d_op() to d_splice_alias_ops() new helper: d_splice_alias_ops() procfs: kill ->proc_dops ...
2025-07-16fs: change write_begin/write_end interface to take struct kiocb *Taotao Chen
Change the address_space_operations callbacks write_begin() and write_end() to take struct kiocb * as the first argument instead of struct file *. Update all affected function prototypes, implementations, call sites, and related documentation across VFS, filesystems, and block layer. Part of a series refactoring address_space_operations write_begin and write_end callbacks to use struct kiocb for passing write context and flags. Signed-off-by: Taotao Chen <chentaotao@didiglobal.com> Link: https://lore.kernel.org/20250716093559.217344-4-chentaotao@didiglobal.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-08fs/libfs: don't assume blocksize <= PAGE_SIZE in generic_check_addressablePankaj Raghav
Since [1], it is possible for filesystems to have blocksize > PAGE_SIZE of the system. Remove the assumption and make the check generic for all blocksizes in generic_check_addressable(). [1] https://lore.kernel.org/linux-xfs/20240822135018.1931258-1-kernel@pankajraghav.com/ Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/20250630104018.213985-1-p.raghav@samsung.com Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-02new helper: simple_start_creating()Al Viro
Set the things up for kernel-initiated creation of object in a tree-in-dcache filesystem. With respect to locking it's an equivalent of filename_create() - we either get a negative dentry with locked parent, or ERR_PTR() and no locks taken. tracefs and debugfs had that open-coded as part of their object creation machinery; switched to calling new helper. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-07-02add locked_recursive_removal()Al Viro
simple_recursive_removal() assumes that parent is not locked and locks it when it finally gets to removing the victim itself. Usually that's what we want, but there are places where the parent is *already* locked and we need it to stay that way. In those cases simple_recursive_removal() would, of course, deadlock, so we have to play racy games with unlocking/relocking the parent around the call or open-code the entire thing. A better solution is to provide a variant that expects to be called with the parent already locked by the caller. Parent should be locked with I_MUTEX_PARENT, to avoid false positives from lockdep. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-07-02better lockdep annotations for simple_recursive_removal()Al Viro
We want a class that nests outside of I_MUTEX_NORMAL (for the sake of callbacks that might want to lock the victim) and inside I_MUTEX_PARENT (so that a variant of that could be used with parent of the victim held locked by the caller). In reality, simple_recursive_removal() * never holds two locks at once * holds the lock on parent of dentry passed to callback * is used only on the trees with fixed topology, so the depths are not changing. So the locking order is actually fine. AFAICS, the best solution is to assign I_MUTEX_CHILD to the locks grabbed by that thing. Reported-by: syzbot+169de184e9defe7fe709@syzkaller.appspotmail.com Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-07-02anon_inode: rework assertionsChristian Brauner
Making anonymous inodes regular files comes with a lot of risk and regression potential as evidenced by a recent hickup in io_uring. We're better of continuing to not have them be regular files. Since we have S_ANON_INODE we can port all of our assertions easily. Link: https://lore.kernel.org/20250702-work-fixes-v1-1-ff76ea589e33@kernel.org Fixes: cfd86ef7e8e7 ("anon_inode: use a proper mode internally") Acked-by: Jens Axboe <axboe@kernel.dk> Cc: stable@kernel.org Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23docs/vfs: update references to i_mutex to i_rwsemJunxuan Liao
VFS has switched to i_rwsem for ten years now (9902af79c01a: parallel lookups actual switch to rwsem), but the VFS documentation and comments still has references to i_mutex. Signed-off-by: Junxuan Liao <ljx@cs.wisc.edu> Link: https://lore.kernel.org/72223729-5471-474a-af3c-f366691fba82@cs.wisc.edu Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-19libfs: prepare to allow for non-immutable pidfd inodesChristian Brauner
Allow for S_IMMUTABLE to be stripped so that we can support xattrs. Link: https://lore.kernel.org/20250618-work-pidfs-persistent-v2-10-98f3456fd552@kernel.org Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-19libfs: massage path_from_stashed()Christian Brauner
Make it a littler easier to follow. Link: https://lore.kernel.org/20250618-work-pidfs-persistent-v2-3-98f3456fd552@kernel.org Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-19libfs: massage path_from_stashed() to allow custom stashing behaviorChristian Brauner
* Add a callback to struct stashed_operations so it's possible to implement custom behavior for pidfs and allow for it to return errors. * Teach stashed_dentry_get() to handle error pointers. Link: https://lore.kernel.org/20250618-work-pidfs-persistent-v2-2-98f3456fd552@kernel.org Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-16simple_recursive_removal(): saner interaction with fsnotifyAl Viro
Make it match the real unlink(2)/rmdir(2) - notify *after* the operation. And use fsnotify_delete() instead of messing with fsnotify_unlink()/fsnotify_rmdir(). Currently the only caller that cares is the one in debugfs, and there the order matching the normal syscalls makes more sense; it'll get more serious for users introduced later in the series. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-11kill simple_dentry_operationsAl Viro
No users left and anything that wants it would be better off just setting DCACHE_DONTCACHE in their ->s_d_flags. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-11simple_lookup(): just set DCACHE_DONTCACHEAl Viro
No need to mess with ->d_op at all. Note that ->d_delete that always returns 1 is equivalent to having DCACHE_DONTCACHE in ->d_flags. Later the same thing will be placed into ->s_d_flags of the filesystems where we want that behaviour for all dentries; then the check in simple_lookup() will at least get unlikely() slapped on it. NOTE: there are only two filesystems where * simple_lookup() might be called * default ->d_op is non-NULL * its ->d_delete() doesn't always return 1 If not for those, we could have simple_lookup() just set DCACHE_DONTCACHE without even looking at ->d_op. Filesystems in question are btrfs and tracefs; both have ->d_delete() returning 1 on anything fed to simple_lookup(), so both would be fine with simple_lookup() setting DCACHE_DONTCACHE regardless of ->d_op. IOW, we might want to drop the check for ->d_op in simple_lookup(); it's definitely a separate story, though. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-10new helper: set_default_d_op()Al Viro
... to be used instead of manually assigning to ->s_d_op. All in-tree filesystem converted (and field itself is renamed, so any out-of-tree ones in need of conversion will be caught by compiler). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-26Merge tag 'vfs-6.16-rc1.super' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs freezing updates from Christian Brauner: "This contains various filesystem freezing related work for this cycle: - Allow the power subsystem to support filesystem freeze for suspend and hibernate. Now all the pieces are in place to actually allow the power subsystem to freeze/thaw filesystems during suspend/resume. Filesystems are only frozen and thawed if the power subsystem does actually own the freeze. If the filesystem is already frozen by the time we've frozen all userspace processes we don't care to freeze it again. That's userspace's job once the process resumes. We only actually freeze filesystems if we absolutely have to and we ignore other failures to freeze. We could bubble up errors and fail suspend/resume if the error isn't EBUSY (aka it's already frozen) but I don't think that this is worth it. Filesystem freezing during suspend/resume is best-effort. If the user has 500 ext4 filesystems mounted and 4 fail to freeze for whatever reason then we simply skip them. What we have now is already a big improvement and let's see how we fare with it before making our lives even harder (and uglier) than we have to. - Allow efivars to support freeze and thaw Allow efivarfs to partake to resync variable state during system hibernation and suspend. Add freeze/thaw support. This is a pretty straightforward implementation. We simply add regular freeze/thaw support for both userspace and the kernel. efivars is the first pseudofilesystem that adds support for filesystem freezing and thawing. The simplicity comes from the fact that we simply always resync variable state after efivarfs has been frozen. It doesn't matter whether that's because of suspend, userspace initiated freeze or hibernation. Efivars is simple enough that it doesn't matter that we walk all dentries. There are no directories and there aren't insane amounts of entries and both freeze/thaw are already heavy-handed operations. If userspace initiated a freeze/thaw cycle they would need CAP_SYS_ADMIN in the initial user namespace (as that's where efivarfs is mounted) so it can't be triggered by random userspace. IOW, we really really don't care" * tag 'vfs-6.16-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: f2fs: fix freezing filesystem during resize kernfs: add warning about implementing freeze/thaw efivarfs: support freeze/thaw power: freeze filesystems during suspend/resume libfs: export find_next_child() super: add filesystem freezing helpers for suspend and hibernate gfs2: pass through holder from the VFS for freeze/thaw super: use common iterator (Part 2) super: use a common iterator (Part 1) super: skip dying superblocks early super: simplify user_get_super() super: remove pointless s_root checks fs: allow all writers to be frozen locking/percpu-rwsem: add freezable alternative to down_read
2025-05-09libfs: export find_next_child()Christian Brauner
Export find_next_child() so it can be used by efivarfs. Keep it internal for now. There's no reason to advertise this kernel-wide. Link: https://lore.kernel.org/r/20250331-work-freeze-v1-1-6dfbe8253b9f@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-21fs: add S_ANON_INODEChristian Brauner
This makes it easy to detect proper anonymous inodes and to ensure that we can detect them in codepaths such as readahead(). Readahead on anonymous inodes didn't work because they didn't have a proper mode. Now that they have we need to retain EINVAL being returned otherwise LTP will fail. We also need to ensure that ioctls aren't simply fired like they are for regular files so things like inotify inodes continue to correctly call their own ioctl handlers as in [1]. Reported-by: Xilin Wu <sophon@radxa.com> Link: https://lore.kernel.org/3A9139D5CD543962+89831381-31b9-4392-87ec-a84a5b3507d8@radxa.com [1] Link: https://lore.kernel.org/7a1a7076-ff6b-4cb0-94e7-7218a0a44028@sirena.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07anon_inode: use a proper mode internallyChristian Brauner
This allows the VFS to not trip over anonymous inodes and we can add asserts based on the mode into the vfs. When we report it to userspace we can simply hide the mode to avoid regressions. I've audited all direct callers of alloc_anon_inode() and only secretmen overrides i_mode and i_op inode operations but it already uses a regular file. Link: https://lore.kernel.org/20250407-work-anon_inode-v1-1-53a44c20d44e@kernel.org Fixes: af153bb63a336 ("vfs: catch invalid modes in may_open()") Reviewed-by: Jeff Layton <jlayton@kernel.org> Cc: stable@vger.kernel.org # all LTS kernels Reported-by: syzbot+5d8e79d323a13aa0b248@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67ed3fb3.050a0220.14623d.0009.GAE@google.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-24Merge tag 'vfs-6.15-rc1.pidfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs pidfs updates from Christian Brauner: - Allow retrieving exit information after a process has been reaped through pidfds via the new PIDFD_INTO_EXIT extension for the PIDFD_GET_INFO ioctl. Various tools need access to information about a process/task even after it has already been reaped. Pidfd polling allows waiting on either task exit or for a task to have been reaped. The contract for PIDFD_INFO_EXIT is simply that EPOLLHUP must be observed before exit information can be retrieved, i.e., exit information is only provided once the task has been reaped and then can be retrieved as long as the pidfd is open. - Add PIDFD_SELF_{THREAD,THREAD_GROUP} sentinels allowing userspace to forgo allocating a file descriptor for their own process. This is useful in scenarios where users want to act on their own process through pidfds and is akin to AT_FDCWD. - Improve premature thread-group leader and subthread exec behavior when polling on pidfds: (1) During a multi-threaded exec by a subthread, i.e., non-thread-group leader thread, all other threads in the thread-group including the thread-group leader are killed and the struct pid of the thread-group leader will be taken over by the subthread that called exec. IOW, two tasks change their TIDs. (2) A premature thread-group leader exit means that the thread-group leader exited before all of the other subthreads in the thread-group have exited. Both cases lead to inconsistencies for pidfd polling with PIDFD_THREAD. Any caller that holds a PIDFD_THREAD pidfd to the current thread-group leader may or may not see an exit notification on the file descriptor depending on when poll is performed. If the poll is performed before the exec of the subthread has concluded an exit notification is generated for the old thread-group leader. If the poll is performed after the exec of the subthread has concluded no exit notification is generated for the old thread-group leader. The correct behavior is to simply not generate an exit notification on the struct pid of a subhthread exec because the struct pid is taken over by the subthread and thus remains alive. But this is difficult to handle because a thread-group may exit premature as mentioned in (2). In that case an exit notification is reliably generated but the subthreads may continue to run for an indeterminate amount of time and thus also may exec at some point. After this pull no exit notifications will be generated for a PIDFD_THREAD pidfd for a thread-group leader until all subthreads have been reaped. If a subthread should exec before no exit notification will be generated until that task exits or it creates subthreads and repeates the cycle. This means an exit notification indicates the ability for the father to reap the child. * tag 'vfs-6.15-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (25 commits) selftests/pidfd: third test for multi-threaded exec polling selftests/pidfd: second test for multi-threaded exec polling selftests/pidfd: first test for multi-threaded exec polling pidfs: improve multi-threaded exec and premature thread-group leader exit polling pidfs: ensure that PIDFS_INFO_EXIT is available selftests/pidfd: add seventh PIDFD_INFO_EXIT selftest selftests/pidfd: add sixth PIDFD_INFO_EXIT selftest selftests/pidfd: add fifth PIDFD_INFO_EXIT selftest selftests/pidfd: add fourth PIDFD_INFO_EXIT selftest selftests/pidfd: add third PIDFD_INFO_EXIT selftest selftests/pidfd: add second PIDFD_INFO_EXIT selftest selftests/pidfd: add first PIDFD_INFO_EXIT selftest selftests/pidfd: expand common pidfd header pidfs/selftests: ensure correct headers for ioctl handling selftests/pidfd: fix header inclusion pidfs: allow to retrieve exit information pidfs: record exit code and cgroupid at exit pidfs: use private inode slab cache pidfs: move setting flags into pidfs_alloc_file() pidfd: rely on automatic cleanup in __pidfd_prepare() ...
2025-03-20libfs: Fix duplicate directory entry in offset_dir_lookupYongjian Sun
There is an issue in the kernel: In tmpfs, when using the "ls" command to list the contents of a directory with a large number of files, glibc performs the getdents call in multiple rounds. If a concurrent unlink occurs between these getdents calls, it may lead to duplicate directory entries in the ls output. One possible reproduction scenario is as follows: Create 1026 files and execute ls and rm concurrently: for i in {1..1026}; do echo "This is file $i" > /tmp/dir/file$i done ls /tmp/dir rm /tmp/dir/file4 ->getdents(file1026-file5) ->unlink(file4) ->getdents(file5,file3,file2,file1) It is expected that the second getdents call to return file3 through file1, but instead it returns an extra file5. The root cause of this problem is in the offset_dir_lookup function. It uses mas_find to determine the starting position for the current getdents call. Since mas_find locates the first position that is greater than or equal to mas->index, when file4 is deleted, it ends up returning file5. It can be fixed by replacing mas_find with mas_find_rev, which finds the first position that is less than or equal to mas->index. Fixes: b9b588f22a0c ("libfs: Use d_children list to iterate simple_offset directories") Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> Link: https://lore.kernel.org/r/20250320034417.555810-1-sunyongjian@huaweicloud.com Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05pidfs: record exit code and cgroupid at exitChristian Brauner
Record the exit code and cgroupid in release_task() and stash in struct pidfs_exit_info so it can be retrieved even after the task has been reaped. Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-5-c8c3d8361705@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-30Merge tag 'pull-revalidate' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs d_revalidate updates from Al Viro: "Provide stable parent and name to ->d_revalidate() instances Most of the filesystem methods where we care about dentry name and parent have their stability guaranteed by the callers; ->d_revalidate() is the major exception. It's easy enough for callers to supply stable values for expected name and expected parent of the dentry being validated. That kills quite a bit of boilerplate in ->d_revalidate() instances, along with a bunch of races where they used to access ->d_name without sufficient precautions" * tag 'pull-revalidate' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: 9p: fix ->rename_sem exclusion orangefs_d_revalidate(): use stable parent inode and name passed by caller ocfs2_dentry_revalidate(): use stable parent inode and name passed by caller nfs: fix ->d_revalidate() UAF on ->d_name accesses nfs{,4}_lookup_validate(): use stable parent inode passed by caller gfs2_drevalidate(): use stable parent inode and name passed by caller fuse_dentry_revalidate(): use stable parent inode and name passed by caller vfat_revalidate{,_ci}(): use stable parent inode passed by caller exfat_d_revalidate(): use stable parent inode passed by caller fscrypt_d_revalidate(): use stable parent inode passed by caller ceph_d_revalidate(): propagate stable name down into request encoding ceph_d_revalidate(): use stable parent inode passed by caller afs_d_revalidate(): use stable name and parent inode passed by caller Pass parent directory inode and expected name to ->d_revalidate() generic_ci_d_compare(): use shortname_storage ext4 fast_commit: make use of name_snapshot primitives dissolve external_name.u into separate members make take_dentry_name_snapshot() lockless dcache: back inline names with a struct-wrapped array of unsigned long make sure that DNAME_INLINE_LEN is a multiple of word size
2025-01-27generic_ci_d_compare(): use shortname_storageAl Viro
... and check the "name might be unstable" predicate the right way. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Gabriel Krisman Bertazi <gabriel@krisman.be> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-20Merge tag 'vfs-6.14-rc1.libfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs libfs updates from Christian Brauner: "This improves the stable directory offset behavior in various ways. Stable offsets are needed so that NFS can reliably read directories on filesystems such as tmpfs: - Improve the end-of-directory detection According to getdents(3), the d_off field in each returned directory entry points to the next entry in the directory. The d_off field in the last returned entry in the readdir buffer must contain a valid offset value, but if it points to an actual directory entry, then readdir/getdents can loop. Introduce a specific fixed offset value that is placed in the d_off field of the last entry in a directory. Some user space applications assume that the EOD offset value is larger than the offsets of real directory entries, so the largest valid offset value is reserved for this purpose. This new value is never allocated by simple_offset_add(). When ->iterate_dir() returns, getdents{64} inserts the ctx->pos value into the d_off field of the last valid entry in the readdir buffer. When it hits EOD, offset_readdir() sets ctx->pos to the EOD offset value so the last entry is updated to point to the EOD marker. When trying to read the entry at the EOD offset, offset_readdir() terminates immediately. - Rely on d_children to iterate stable offset directories Instead of using the mtree to emit entries in the order of their offset values, use it only to map incoming ctx->pos to a starting entry. Then use the directory's d_children list, which is already maintained properly by the dcache, to find the next child to emit. - Narrow the range of directory offset values returned by simple_offset_add() to 3 .. (S32_MAX - 1) on all platforms. This means the allocation behavior is identical on 32-bit systems, 64-bit systems, and 32-bit user space on 64-bit kernels. The new range still permits over 2 billion concurrent entries per directory. - Return ENOSPC when the directory offset range is exhausted. Hitting this error is almost impossible though. - Remove the simple_offset_empty() helper" * tag 'vfs-6.14-rc1.libfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: libfs: Use d_children list to iterate simple_offset directories libfs: Replace simple_offset end-of-directory detection Revert "libfs: fix infinite directory reads for offset dir" Revert "libfs: Add simple_offset_empty()" libfs: Return ENOSPC when the directory offset range is exhausted
2025-01-04libfs: Use d_children list to iterate simple_offset directoriesChuck Lever
The mtree mechanism has been effective at creating directory offsets that are stable over multiple opendir instances. However, it has not been able to handle the subtleties of renames that are concurrent with readdir. Instead of using the mtree to emit entries in the order of their offset values, use it only to map incoming ctx->pos to a starting entry. Then use the directory's d_children list, which is already maintained properly by the dcache, to find the next child to emit. One of the sneaky things about this is that when the mtree-allocated offset value wraps (which is very rare), looking up ctx->pos++ is not going to find the next entry; it will return NULL. Instead, by following the d_children list, the offset values can appear in any order but all of the entries in the directory will be visited eventually. Note also that the readdir() is guaranteed to reach the tail of this list. Entries are added only at the head of d_children, and readdir walks from its current position in that list towards its tail. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20241228175522.1854234-6-cel@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-04libfs: Replace simple_offset end-of-directory detectionChuck Lever
According to getdents(3), the d_off field in each returned directory entry points to the next entry in the directory. The d_off field in the last returned entry in the readdir buffer must contain a valid offset value, but if it points to an actual directory entry, then readdir/getdents can loop. This patch introduces a specific fixed offset value that is placed in the d_off field of the last entry in a directory. Some user space applications assume that the EOD offset value is larger than the offsets of real directory entries, so the largest valid offset value is reserved for this purpose. This new value is never allocated by simple_offset_add(). When ->iterate_dir() returns, getdents{64} inserts the ctx->pos value into the d_off field of the last valid entry in the readdir buffer. When it hits EOD, offset_readdir() sets ctx->pos to the EOD offset value so the last entry is updated to point to the EOD marker. When trying to read the entry at the EOD offset, offset_readdir() terminates immediately. It is worth noting that using a Maple tree for directory offset value allocation does not guarantee a 63-bit range of values -- on platforms where "long" is a 32-bit type, the directory offset value range is still 0..(2^31 - 1). For broad compatibility with 32-bit user space, the largest tmpfs directory cookie value is now S32_MAX. Fixes: 796432efab1e ("libfs: getdents() should return 0 after reaching EOD") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20241228175522.1854234-5-cel@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-04Revert "libfs: fix infinite directory reads for offset dir"Chuck Lever
The current directory offset allocator (based on mtree_alloc_cyclic) stores the next offset value to return in octx->next_offset. This mechanism typically returns values that increase monotonically over time. Eventually, though, the newly allocated offset value wraps back to a low number (say, 2) which is smaller than other already- allocated offset values. Yu Kuai <yukuai3@huawei.com> reports that, after commit 64a7ce76fb90 ("libfs: fix infinite directory reads for offset dir"), if a directory's offset allocator wraps, existing entries are no longer visible via readdir/getdents because offset_readdir() stops listing entries once an entry's offset is larger than octx->next_offset. These entries vanish persistently -- they can be looked up, but will never again appear in readdir(3) output. The reason for this is that the commit treats directory offsets as monotonically increasing integer values rather than opaque cookies, and introduces this comparison: if (dentry2offset(dentry) >= last_index) { On 64-bit platforms, the directory offset value upper bound is 2^63 - 1. Directory offsets will monotonically increase for millions of years without wrapping. On 32-bit platforms, however, LONG_MAX is 2^31 - 1. The allocator can wrap after only a few weeks (at worst). Revert commit 64a7ce76fb90 ("libfs: fix infinite directory reads for offset dir") to prepare for a fix that can work properly on 32-bit systems and might apply to recent LTS kernels where shmem employs the simple_offset mechanism. Reported-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20241228175522.1854234-4-cel@kernel.org Reviewed-by: Yang Erkun <yangerkun@huawei.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-04Revert "libfs: Add simple_offset_empty()"Chuck Lever
simple_empty() and simple_offset_empty() perform the same task. The latter's use as a canary to find bugs has not found any new issues. A subsequent patch will remove the use of the mtree for iterating directory contents, so revert back to using a similar mechanism for determining whether a directory is indeed empty. Only one such mechanism is ever needed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20241228175522.1854234-3-cel@kernel.org Reviewed-by: Yang Erkun <yangerkun@huawei.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-04libfs: Return ENOSPC when the directory offset range is exhaustedChuck Lever
Testing shows that the EBUSY error return from mtree_alloc_cyclic() leaks into user space. The ERRORS section of "man creat(2)" says: > EBUSY O_EXCL was specified in flags and pathname refers > to a block device that is in use by the system > (e.g., it is mounted). ENOSPC is closer to what applications expect in this situation. Note that the normal range of simple directory offset values is 2..2^63, so hitting this error is going to be rare to impossible. Fixes: 6faddda69f62 ("libfs: Add directory operations for stable offsets") Cc: stable@vger.kernel.org # v6.9+ Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Yang Erkun <yangerkun@huawei.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20241228175522.1854234-2-cel@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-14pseudofs: add support for export_opsErin Shepherd
Pseudo-filesystems might reasonably wish to implement the export ops (particularly for name_to_handle_at/open_by_handle_at); plumb this through pseudo_fs_context Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu> Link: https://lore.kernel.org/r/20241113-pidfs_fh-v2-1-9a4d28155a37@e43.eu Link: https://lore.kernel.org/r/20241129-work-pidfs-file_handle-v1-1-87d803a42495@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-18Merge tag 'pull-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull statx updates from Al Viro: "Sanitize struct filename and lookup flags handling in statx and friends" * tag 'pull-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: libfs: kill empty_dir_getattr() fs: Simplify getattr interface function checking AT_GETATTR_NOSEC flag fs/stat.c: switch to CLASS(fd_raw) kill getname_statx_lookup_flags() io_statx_prep(): use getname_uflags()
2024-11-13libfs: kill empty_dir_getattr()Al Viro
It's used only to initialize ->getattr in one inode_operations instance (empty_dir_inode_operations) and its behaviour had always been equivalent to what we get with NULL ->getattr. Just remove that initializer, along with empty_dir_getattr() itself. While we are at it, the same instance has ->permission initialized to generic_permission, which is what NULL ->permission ends up doing. Again, no point keeping it. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-28tmpfs: Add casefold lookup supportAndré Almeida
Enable casefold lookup in tmpfs, based on the encoding defined by userspace. That means that instead of comparing byte per byte a file name, it compares to a case-insensitive equivalent of the Unicode string. * Dcache handling There's a special need when dealing with case-insensitive dentries. First of all, we currently invalidated every negative casefold dentries. That happens because currently VFS code has no proper support to deal with that, giving that it could incorrectly reuse a previous filename for a new file that has a casefold match. For instance, this could happen: $ mkdir DIR $ rm -r DIR $ mkdir dir $ ls DIR/ And would be perceived as inconsistency from userspace point of view, because even that we match files in a case-insensitive manner, we still honor whatever is the initial filename. Along with that, tmpfs stores only the first equivalent name dentry used in the dcache, preventing duplications of dentries in the dcache. The d_compare() version for casefold files uses a normalized string, so the filename under lookup will be compared to another normalized string for the existing file, achieving a casefolded lookup. * Enabling casefold via mount options Most filesystems have their data stored in disk, so casefold option need to be enabled when building a filesystem on a device (via mkfs). However, as tmpfs is a RAM backed filesystem, there's no disk information and thus no mkfs to store information about casefold. For tmpfs, create casefold options for mounting. Userspace can then enable casefold support for a mount point using: $ mount -t tmpfs -o casefold=utf8-12.1.0 fs_name mount_dir/ Userspace must set what Unicode standard is aiming to. The available options depends on what the kernel Unicode subsystem supports. And for strict encoding: $ mount -t tmpfs -o casefold=utf8-12.1.0,strict_encoding fs_name mount_dir/ Strict encoding means that tmpfs will refuse to create invalid UTF-8 sequences. When this option is not enabled, any invalid sequence will be treated as an opaque byte sequence, ignoring the encoding thus not being able to be looked up in a case-insensitive way. * Check for casefold dirs on simple_lookup() On simple_lookup(), do not create dentries for casefold directories. Currently, VFS does not support case-insensitive negative dentries and can create inconsistencies in the filesystem. Prevent such dentries to being created in the first place. Reviewed-by: Gabriel Krisman Bertazi <gabriel@krisman.be> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20241021-tonyk-tmpfs-v8-6-f443d5814194@igalia.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-28libfs: Export generic_ci_ dentry functionsAndré Almeida
Export generic_ci_ dentry functions so they can be used by case-insensitive filesystems that need something more custom than the default one set by `struct generic_ci_dentry_ops`. Reviewed-by: Gabriel Krisman Bertazi <gabriel@krisman.be> Signed-off-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20241021-tonyk-tmpfs-v8-5-f443d5814194@igalia.com Signed-off-by: Christian Brauner <brauner@kernel.org>