summaryrefslogtreecommitdiff
path: root/Documentation/filesystems
AgeCommit message (Collapse)Author
2025-09-30Merge tag 'x86_cache_for_v6.18_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 resource control updates from Borislav Petkov: "Add support on AMD for assigning QoS bandwidth counters to resources (RMIDs) with the ability for those resources to be tracked by the counters as long as they're assigned to them. Previously, due to hw limitations, bandwidth counts from untracked resources would get lost when those resources are not tracked. Refactor the code and user interfaces to be able to also support other, similar features on ARM, for example" * tag 'x86_cache_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits) fs/resctrl: Fix counter auto-assignment on mkdir with mbm_event enabled MAINTAINERS: resctrl: Add myself as reviewer x86/resctrl: Configure mbm_event mode if supported fs/resctrl: Introduce the interface to switch between monitor modes fs/resctrl: Disable BMEC event configuration when mbm_event mode is enabled fs/resctrl: Introduce the interface to modify assignments in a group fs/resctrl: Introduce mbm_L3_assignments to list assignments in a group fs/resctrl: Auto assign counters on mkdir and clean up on group removal fs/resctrl: Introduce mbm_assign_on_mkdir to enable assignments on mkdir fs/resctrl: Provide interface to update the event configurations fs/resctrl: Add event configuration directory under info/L3_MON/ fs/resctrl: Support counter read/reset with mbm_event assignment mode x86/resctrl: Implement resctrl_arch_reset_cntr() and resctrl_arch_cntr_read() x86/resctrl: Refactor resctrl_arch_rmid_read() fs/resctrl: Introduce counter ID read, reset calls in mbm_event mode fs/resctrl: Pass struct rdtgroup instead of individual members fs/resctrl: Add the functionality to unassign MBM events fs/resctrl: Add the functionality to assign MBM events x86,fs/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC fs/resctrl: Introduce event configuration field in struct mon_evt ...
2025-09-29Remove bcachefs core codeLinus Torvalds
bcachefs was marked 'externally maintained' in 6.17 but the code remained to make the transition smoother. It's now a DKMS module, making the in-kernel code stale, so remove it to avoid any version confusion. Link: https://lore.kernel.org/linux-bcachefs/yokpt2d2g2lluyomtqrdvmkl3amv3kgnipmenobkpgx537kay7@xgcgjviv3n7x/T/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-09-29Merge tag 'vfs-6.18-rc1.async' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs async directory updates from Christian Brauner: "This contains further preparatory changes for the asynchronous directory locking scheme: - Add lookup_one_positive_killable() which allows overlayfs to perform lookup that won't block on a fatal signal - Unify the mount idmap handling in struct renamedata as a rename can only happen within a single mount - Introduce kern_path_parent() for audit which sets the path to the parent and returns a dentry for the target without holding any locks on return - Rename kern_path_locked() as it is only used to prepare for the removal of an object from the filesystem: kern_path_locked() => start_removing_path() kern_path_create() => start_creating_path() user_path_create() => start_creating_user_path() user_path_locked_at() => start_removing_user_path_at() done_path_create() => end_creating_path() NA => end_removing_path()" * tag 'vfs-6.18-rc1.async' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: debugfs: rename start_creating() to debugfs_start_creating() VFS: rename kern_path_locked() and related functions. VFS/audit: introduce kern_path_parent() for audit VFS: unify old_mnt_idmap and new_mnt_idmap in renamedata VFS: discard err2 in filename_create() VFS/ovl: add lookup_one_positive_killable()
2025-09-29Merge tag 'vfs-6.18-rc1.mount' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount updates from Christian Brauner: "This contains some work around mount api handling: - Output the warning message for mnt_too_revealing() triggered during fsmount() to the fscontext log. This makes it possible for the mount tool to output appropriate warnings on the command line. For example, with the newest fsopen()-based mount(8) from util-linux, the error messages now look like: # mount -t proc proc /tmp mount: /tmp: fsmount() failed: VFS: Mount too revealing. dmesg(1) may have more information after failed mount system call. - Do not consume fscontext log entries when returning -EMSGSIZE Userspace generally expects APIs that return -EMSGSIZE to allow for them to adjust their buffer size and retry the operation. However, the fscontext log would previously clear the message even in the -EMSGSIZE case. Given that it is very cheap for us to check whether the buffer is too small before we remove the message from the ring buffer, let's just do that instead. - Drop an unused argument from do_remount()" * tag 'vfs-6.18-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: vfs: fs/namespace.c: remove ms_flags argument from do_remount selftests/filesystems: add basic fscontext log tests fscontext: do not consume log entries when returning -EMSGSIZE vfs: output mount_too_revealing() errors to fscontext docs/vfs: Remove mentions to the old mount API helpers fscontext: add custom-prefix log helpers fs: Remove mount_bdev fs: Remove mount_nodev
2025-09-23VFS: rename kern_path_locked() and related functions.NeilBrown
kern_path_locked() is now only used to prepare for removing an object from the filesystem (and that is the only credible reason for wanting a positive locked dentry). Thus it corresponds to kern_path_create() and so should have a corresponding name. Unfortunately the name "kern_path_create" is somewhat misleading as it doesn't actually create anything. The recently added simple_start_creating() provides a better pattern I believe. The "start" can be matched with "end" to bracket the creating or removing. So this patch changes names: kern_path_locked -> start_removing_path kern_path_create -> start_creating_path user_path_create -> start_creating_user_path user_path_locked_at -> start_removing_user_path_at done_path_create -> end_creating_path and also introduces end_removing_path() which is identical to end_creating_path(). __start_removing_path (which was __kern_path_locked) is enhanced to call mnt_want_write() for consistency with the start_creating_path(). Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-21alloc_tag: mark inaccurate allocation counters in /proc/allocinfo outputSuren Baghdasaryan
While rare, memory allocation profiling can contain inaccurate counters if slab object extension vector allocation fails. That allocation might succeed later but prior to that, slab allocations that would have used that object extension vector will not be accounted for. To indicate incorrect counters, "accurate:no" marker is appended to the call site line in the /proc/allocinfo output. Bump up /proc/allocinfo version to reflect the change in the file format and update documentation. Example output with invalid counters: allocinfo - version: 2.0 0 0 arch/x86/kernel/kdebugfs.c:105 func:create_setup_data_nodes 0 0 arch/x86/kernel/alternative.c:2090 func:alternatives_smp_module_add 0 0 arch/x86/kernel/alternative.c:127 func:__its_alloc accurate:no 0 0 arch/x86/kernel/fpu/regset.c:160 func:xstateregs_set 0 0 arch/x86/kernel/fpu/xstate.c:1590 func:fpstate_realloc 0 0 arch/x86/kernel/cpu/aperfmperf.c:379 func:arch_enable_hybrid_capacity_scale 0 0 arch/x86/kernel/cpu/amd_cache_disable.c:258 func:init_amd_l3_attrs 49152 48 arch/x86/kernel/cpu/mce/core.c:2709 func:mce_device_create accurate:no 32768 1 arch/x86/kernel/cpu/mce/genpool.c:132 func:mce_gen_pool_create 0 0 arch/x86/kernel/cpu/mce/amd.c:1341 func:mce_threshold_create_device [surenb@google.com: document new "accurate:no" marker] Fixes: 39d117e04d15 ("alloc_tag: mark inaccurate allocation counters in /proc/allocinfo output") [akpm@linux-foundation.org: simplification per Usama, reflow text] [akpm@linux-foundation.org: add newline to prevent docs warning, per Randy] Link: https://lkml.kernel.org/r/20250915230224.4115531-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Usama Arif <usamaarif642@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: David Wang <00107082@163.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-18docs: filesystems: sysfs: add remaining top level sysfs directory descriptionsAlex Tran
Finish top level sysfs directory descriptions for block, class, firmware, hypervisor, kernel, and power. Did not write one for net directory. See commit bc3a88431672 ("docs: filesystems: sysfs: remove top level sysfs net directory") Signed-off-by: Alex Tran <alex.t.tran@gmail.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Message-ID: <20250902023039.1351270-1-alex.t.tran@gmail.com>
2025-09-18docs: filesystems: sysfs: clarify symlink destinations in dev and ↵Alex Tran
bus/devices descriptions Change sysfs bus/devices and dev directory descriptions to provide more verbose information about the specific symlink destination the devices point to. Signed-off-by: Alex Tran <alex.t.tran@gmail.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Message-ID: <20250902023039.1351270-2-alex.t.tran@gmail.com>
2025-09-18docs: filesystems: sysfs: remove top level sysfs net directoryAlex Tran
The net/ directory is not present as a top level sysfs directory in standard Linux systems. Network interfaces can be accessible via /sys/class/net instead. Signed-off-by: Alex Tran <alex.t.tran@gmail.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Message-ID: <20250902023039.1351270-3-alex.t.tran@gmail.com>
2025-09-15fs: rename generic_delete_inode() and generic_drop_inode()Mateusz Guzik
generic_delete_inode() is rather misleading for what the routine is doing. inode_just_drop() should be much clearer. The new naming is inconsistent with generic_drop_inode(), so rename that one as well with inode_ as the suffix. No functional changes. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-15fs/resctrl: Introduce the interface to switch between monitor modesBabu Moger
Resctrl subsystem can support two monitoring modes, "mbm_event" or "default". In mbm_event mode, monitoring event can only accumulate data while it is backed by a hardware counter. In "default" mode, resctrl assumes there is a hardware counter for each event within every CTRL_MON and MON group. Introduce mbm_assign_mode resctrl file to switch between mbm_event and default modes. Example: To list the MBM monitor modes supported: $ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode [mbm_event] default To enable the "mbm_event" counter assignment mode: $ echo "mbm_event" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode To enable the "default" monitoring mode: $ echo "default" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode Reset MBM event counters automatically as part of changing the mode. Clear both architectural and non-architectural event states to prevent overflow conditions during the next event read. Clear assignable counter configuration on all the domains. Also, enable auto assignment when switching to "mbm_event" mode. Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15fs/resctrl: Introduce the interface to modify assignments in a groupBabu Moger
Enable the mbm_l3_assignments resctrl file to be used to modify counter assignments of CTRL_MON and MON groups when the "mbm_event" counter assignment mode is enabled. Process the assignment modifications in the following format: <Event>:<Domain id>=<Assignment state>;<Domain id>=<Assignment state> Event: A valid MBM event in the /sys/fs/resctrl/info/L3_MON/event_configs directory. Domain ID: A valid domain ID. When writing, '*' applies the changes to all domains. Assignment states: _ : Unassign a counter. e : Assign a counter exclusively. Examples: $ cd /sys/fs/resctrl $ cat /sys/fs/resctrl/mbm_L3_assignments mbm_total_bytes:0=e;1=e mbm_local_bytes:0=e;1=e To unassign the counter associated with the mbm_total_bytes event on domain 0: $ echo "mbm_total_bytes:0=_" > mbm_L3_assignments $ cat /sys/fs/resctrl/mbm_L3_assignments mbm_total_bytes:0=_;1=e mbm_local_bytes:0=e;1=e To unassign the counter associated with the mbm_total_bytes event on all the domains: $ echo "mbm_total_bytes:*=_" > mbm_L3_assignments $ cat /sys/fs/resctrl/mbm_L3_assignments mbm_total_bytes:0=_;1=_ mbm_local_bytes:0=e;1=e Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15fs/resctrl: Introduce mbm_L3_assignments to list assignments in a groupBabu Moger
Introduce the mbm_L3_assignments resctrl file associated with CTRL_MON and MON resource groups to display the counter assignment states of the resource group when "mbm_event" counter assignment mode is enabled. Display the list in the following format: <Event>:<Domain id>=<Assignment state>;<Domain id>=<Assignment state> Event: A valid MBM event listed in /sys/fs/resctrl/info/L3_MON/event_configs directory. Domain ID: A valid domain ID. The assignment state can be one of the following: _ : No counter assigned. e : Counter assigned exclusively. Example: To list the assignment states for the default group $ cd /sys/fs/resctrl $ cat /sys/fs/resctrl/mbm_L3_assignments mbm_total_bytes:0=e;1=e mbm_local_bytes:0=e;1=e Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15fs/resctrl: Introduce mbm_assign_on_mkdir to enable assignments on mkdirBabu Moger
The "mbm_event" counter assignment mode allows users to assign a hardware counter to an RMID, event pair and monitor the bandwidth as long as it is assigned. Introduce a user-configurable option that determines if a counter will automatically be assigned to an RMID, event pair when its associated monitor group is created via mkdir. Accessible when "mbm_event" counter assignment mode is enabled. Suggested-by: Peter Newman <peternewman@google.com> Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15fs/resctrl: Provide interface to update the event configurationsBabu Moger
When "mbm_event" counter assignment mode is enabled, users can modify the event configuration by writing to the 'event_filter' resctrl file. The event configurations for mbm_event mode are located in /sys/fs/resctrl/info/L3_MON/event_configs/. Update the assignments of all CTRL_MON and MON resource groups when the event configuration is modified. Example: $ mount -t resctrl resctrl /sys/fs/resctrl $ cd /sys/fs/resctrl/ $ cat info/L3_MON/event_configs/mbm_local_bytes/event_filter local_reads,local_non_temporal_writes,local_reads_slow_memory $ echo "local_reads,local_non_temporal_writes" > info/L3_MON/event_configs/mbm_total_bytes/event_filter $ cat info/L3_MON/event_configs/mbm_total_bytes/event_filter local_reads,local_non_temporal_writes Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15fs/resctrl: Add event configuration directory under info/L3_MON/Babu Moger
The "mbm_event" counter assignment mode allows the user to assign a hardware counter to an RMID, event pair and monitor the bandwidth as long as it is assigned. The user can specify the memory transaction(s) for the counter to track. When this mode is supported, the /sys/fs/resctrl/info/L3_MON/event_configs directory contains a sub-directory for each MBM event that can be assigned to a counter. The MBM event sub-directory contains a file named "event_filter" that is used to view and modify which memory transactions the MBM event is configured with. Create /sys/fs/resctrl/info/L3_MON/event_configs directory on resctrl mount and pre-populate it with directories for the two existing MBM events: mbm_total_bytes and mbm_local_bytes. Create the "event_filter" file within each MBM event directory with the needed *show() that displays the memory transactions with which the MBM event is configured. Example: $ mount -t resctrl resctrl /sys/fs/resctrl $ cd /sys/fs/resctrl/ $ cat info/L3_MON/event_configs/mbm_total_bytes/event_filter local_reads,remote_reads,local_non_temporal_writes, remote_non_temporal_writes,local_reads_slow_memory, remote_reads_slow_memory,dirty_victim_writes_all $ cat info/L3_MON/event_configs/mbm_local_bytes/event_filter local_reads,local_non_temporal_writes,local_reads_slow_memory Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15fs/resctrl: Support counter read/reset with mbm_event assignment modeBabu Moger
When "mbm_event" counter assignment mode is enabled, the architecture requires a counter ID to read the event data. Introduce an is_mbm_cntr field in struct rmid_read to indicate whether counter assignment mode is in use. Update the logic to call resctrl_arch_cntr_read() and resctrl_arch_reset_cntr() when the assignment mode is active. Report 'Unassigned' in case the user attempts to read an event without assigning a hardware counter. Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15fs/resctrl: Introduce interface to display number of free MBM countersBabu Moger
Introduce the "available_mbm_cntrs" resctrl file to display the number of counters available for assignment in each domain when "mbm_event" mode is enabled. Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15fs/resctrl: Add resctrl file to display number of assignable countersBabu Moger
The "mbm_event" counter assignment mode allows users to assign a hardware counter to an RMID, event pair and monitor bandwidth usage as long as it is assigned. The hardware continues to track the assigned counter until it is explicitly unassigned by the user. Create 'num_mbm_cntrs' resctrl file that displays the number of counters supported in each domain. 'num_mbm_cntrs' is only visible to user space when the system supports "mbm_event" mode. Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15fs/resctrl: Introduce the interface to display monitoring modesBabu Moger
Introduce the resctrl file "mbm_assign_mode" to list the supported counter assignment modes. The "mbm_event" counter assignment mode allows users to assign a hardware counter to an RMID, event pair and monitor bandwidth usage as long as it is assigned. The hardware continues to track the assigned counter until it is explicitly unassigned by the user. Each event within a resctrl group can be assigned independently in this mode. On AMD systems "mbm_event" mode is backed by the ABMC (Assignable Bandwidth Monitoring Counters) hardware feature and is enabled by default. The "default" mode is the existing mode that works without the explicit counter assignment, instead relying on dynamic counter assignment by hardware that may result in hardware not dedicating a counter resulting in monitoring data reads returning "Unavailable". Provide an interface to display the monitor modes on the system. $ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode [mbm_event] default Add IS_ENABLED(CONFIG_RESCTRL_ASSIGN_FIXED) check to support Arm64. On x86, CONFIG_RESCTRL_ASSIGN_FIXED is not defined. On Arm64, it will be defined when the "mbm_event" mode is supported. Add IS_ENABLED(CONFIG_RESCTRL_ASSIGN_FIXED) check early to ensure the user interface remains compatible with upcoming Arm64 support. IS_ENABLED() safely evaluates to 0 when the configuration is not defined. As a result, for MPAM, the display would be either: [default] or [mbm_event] Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15x86/resctrl: Add ABMC feature in the command line optionsBabu Moger
Add a kernel command-line parameter to enable or disable the exposure of the ABMC (Assignable Bandwidth Monitoring Counters) hardware feature to resctrl. Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-13prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGEDavid Hildenbrand
Patch series "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised", v5. This will allow individual processes to opt-out of THP = "always" into THP = "madvise", without affecting other workloads on the system. This has been extensively discussed on the mailing list and has been summarized very well by David in the first patch which also includes the links to alternatives, please refer to the first patch commit message for the motivation for this series. Patch 1 adds the PR_THP_DISABLE_EXCEPT_ADVISED flag to implement this, along with the MMF changes. Patch 2 is a cleanup patch for tva_flags that will allow the forced collapse case to be transmitted to vma_thp_disabled (which is done in patch 3). Patch 4 adds documentation for PR_SET_THP_DISABLE/PR_GET_THP_DISABLE. Patches 6-7 implement the selftests for PR_SET_THP_DISABLE for completely disabling THPs (old behaviour) and only enabling it at advise (PR_THP_DISABLE_EXCEPT_ADVISED). This patch (of 7): People want to make use of more THPs, for example, moving from the "never" system policy to "madvise", or from "madvise" to "always". While this is great news for every THP desperately waiting to get allocated out there, apparently there are some workloads that require a bit of care during that transition: individual processes may need to opt-out from this behavior for various reasons, and this should be permitted without needing to make all other workloads on the system similarly opt-out. The following scenarios are imaginable: (1) Switch from "none" system policy to "madvise"/"always", but keep THPs disabled for selected workloads. (2) Stay at "none" system policy, but enable THPs for selected workloads, making only these workloads use the "madvise" or "always" policy. (3) Switch from "madvise" system policy to "always", but keep the "madvise" policy for selected workloads: allocate THPs only when advised. (4) Stay at "madvise" system policy, but enable THPs even when not advised for selected workloads -- "always" policy. Once can emulate (2) through (1), by setting the system policy to "madvise"/"always" while disabling THPs for all processes that don't want THPs. It requires configuring all workloads, but that is a user-space problem to sort out. (4) can be emulated through (3) in a similar way. Back when (1) was relevant in the past, as people started enabling THPs, we added PR_SET_THP_DISABLE, so relevant workloads that were not ready yet (i.e., used by Redis) were able to just disable THPs completely. Redis still implements the option to use this interface to disable THPs completely. With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a workload -- a process, including fork+exec'ed process hierarchy. That essentially made us support (1): simply disable THPs for all workloads that are not ready for THPs yet, while still enabling THPs system-wide. The quest for handling (3) and (4) started, but current approaches (completely new prctl, options to set other policies per process, alternatives to prctl -- mctrl, cgroup handling) don't look particularly promising. Likely, the future will use bpf or something similar to implement better policies, in particular to also make better decisions about THP sizes to use, but this will certainly take a while as that work just started. Long story short: a simple enable/disable is not really suitable for the future, so we're not willing to add completely new toggles. While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs completely for these processes, this is a step backwards, because these processes can no longer allocate THPs in regions where THPs were explicitly advised: regions flagged as VM_HUGEPAGE. Apparently, that imposes a problem for relevant workloads, because "not THPs" is certainly worse than "THPs only when advised". Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not explicitly advised by the app through MAD_HUGEPAGE"? *maybe*, but this would change the documented semantics quite a bit, and the versatility to use it for debugging purposes, so I am not 100% sure that is what we want -- although it would certainly be much easier. So instead, as an easy way forward for (3) and (4), add an option to make PR_SET_THP_DISABLE disable *less* THPs for a process. In essence, this patch: (A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3 of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0). prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED). (B) Makes prctl(PR_GET_THP_DISABLE) return 3 if PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling. Previously, it would return 1 if THPs were disabled completely. Now it returns the set flags as well: 3 if PR_THP_DISABLE_EXCEPT_ADVISED was set. (C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express the semantics clearly. Fortunately, there are only two instances outside of prctl() code. (D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs with VM_HUGEPAGE" -- essentially "thp=madvise" behavior Fortunately, we only have to extend vma_thp_disabled(). (E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are disabled completely Only indicating that THPs are disabled when they are really disabled completely, not only partially. For now, we don't add another interface to obtained whether THPs are disabled partially (PR_THP_DISABLE_EXCEPT_ADVISED was set). If ever required, we could add a new entry. The documented semantics in the man page for PR_SET_THP_DISABLE "is inherited by a child created via fork(2) and is preserved across execve(2)" is maintained. This behavior, for example, allows for disabling THPs for a workload through the launching process (e.g., systemd where we fork() a helper process to then exec()). For now, MADV_COLLAPSE will *fail* in regions without VM_HUGEPAGE and VM_NOHUGEPAGE. As MADV_COLLAPSE is a clear advise that user space thinks a THP is a good idea, we'll enable that separately next (requiring a bit of cleanup first). There is currently not way to prevent that a process will not issue PR_SET_THP_DISABLE itself to re-enable THP. There are not really known users for re-enabling it, and it's against the purpose of the original interface. So if ever required, we could investigate just forbidding to re-enable them, or make this somehow configurable. Link: https://lkml.kernel.org/r/20250815135549.130506-1-usamaarif642@gmail.com Link: https://lkml.kernel.org/r/20250815135549.130506-2-usamaarif642@gmail.com Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Usama Arif <usamaarif642@gmail.com> Tested-by: Usama Arif <usamaarif642@gmail.com> Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Usama Arif <usamaarif642@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yafang <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-04change the calling conventions for vfs_parse_fs_string()Al Viro
Absolute majority of callers are passing the 4th argument equal to strlen() of the 3rd one. Drop the v_size argument, add vfs_parse_fs_qstr() for the cases that want independent length. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-03doc: filesystems: proc: remove stale information from introBaruch Siach
Most of the information in the first paragraph of the Introduction/Credits section is outdated. Documentation update suggestions should go to documentation maintainers listed in MAINTAINERS. Remove misleading contact information. Signed-off-by: Baruch Siach <baruch@tkos.co.il> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/cb4987a16ed96ee86841aec921d914bd44249d0b.1756294647.git.baruch@tkos.co.il
2025-09-03Documentation: Fix spelling mistakesRanganath V N
Corrected a few spelling mistakes to improve the readability. Signed-off-by: Ranganath V N <vnranganath.20@gmail.com> Acked-by: Rob Herring (Arm) <robh@kernel.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20250902193822.6349-1-vnranganath.20@gmail.com
2025-09-02procfs: add "pidns" mount optionAleksa Sarai
Since the introduction of pid namespaces, their interaction with procfs has been entirely implicit in ways that require a lot of dancing around by programs that need to construct sandboxes with different PID namespaces. Being able to explicitly specify the pid namespace to use when constructing a procfs super block will allow programs to no longer need to fork off a process which does then does unshare(2) / setns(2) and forks again in order to construct a procfs in a pidns. So, provide a "pidns" mount option which allows such users to just explicitly state which pid namespace they want that procfs instance to use. This interface can be used with fsconfig(2) either with a file descriptor or a path: fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd); fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0); or with classic mount(2) / mount(8): // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid"); As this new API is effectively shorthand for setns(2) followed by mount(2), the permission model for this mirrors pidns_install() to avoid opening up new attack surfaces by loosening the existing permission model. In order to avoid having to RCU-protect all users of proc_pid_ns() (to avoid UAFs), attempting to reconfigure an existing procfs instance's pid namespace will error out with -EBUSY. Creating new procfs instances is quite cheap, so this should not be an impediment to most users, and lets us avoid a lot of churn in fs/proc/* for a feature that it seems unlikely userspace would use. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/20250805-procfs-pidns-api-v4-2-705f984940e7@cyphar.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-02fuse: fix references to fuse.rst -> fuse/fuse.rstMiklos Szeredi
Commit 6be0ddb20200 ("Documentation: fuse: Consolidate FUSE docs into its own subdirectory") moved fuse docs to a subdirectory but didn't update references inside the kernel tree. Fixes: 6be0ddb20200 ("Documentation: fuse: Consolidate FUSE docs into its own subdirectory") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202508261621.EaNMWVjm-lkp@intel.com/ Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-08-29Documentation: sharedsubtree: Convert notes to note directiveBagas Sanjaya
While a few of the notes are already in reST syntax, others are left intact (inconsistent). Convert them to reST syntax too. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20250819061254.31220-6-bagasdotme@gmail.com
2025-08-29Documentation: sharedsubtree: Align textBagas Sanjaya
The docs make heavy use of lists. As it is currently written, these generate a lot of unnecessary hanging indents since these are not semantically meant to be definition lists by accident. Align text to trim these indents. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20250819061254.31220-5-bagasdotme@gmail.com
2025-08-29Documentation: sharedsubtree: Don't repeat lists with explanationBagas Sanjaya
Don't repeat lists only mentioning the items when a corresponding list with item's explanations suffices. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20250819061254.31220-4-bagasdotme@gmail.com
2025-08-29Documentation: sharedsubtree: Use proper enumerator sequence for enumerated ↵Bagas Sanjaya
lists Sphinx does not recognize mixed-letter sequences (e.g. 2a) as enumerator for enumerated lists. As such, lists that use such sequences end up as definition lists instead. Use proper enumeration sequences for this purpose. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20250819061254.31220-3-bagasdotme@gmail.com
2025-08-29Documentation: sharedsubtree: Format remaining of shell snippets as literal ↵Bagas Sanjaya
code blcoks Fix formatting inconsistency of shell snippets by wrapping the remaining of them in literal code blocks. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20250819061254.31220-2-bagasdotme@gmail.com
2025-08-29docs: fix spelling and grammar in atomic_writesMallikarjun Thammanavar
Fix minor spelling and grammatical issues in the ext4 atomic_writes documentation. Signed-off-by: Mallikarjun Thammanavar <mallikarjunst09@gmail.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20250819124604.8995-1-mallikarjunst09@gmail.com
2025-08-29Documentation/filesystems/xfs: Fix typo errorAlperen Aksu
Fixed typo error in referring to the section's headline Fixed to correct spelling of "mapping" Signed-off-by: Alperen Aksu <aksulperen@gmail.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20250821131404.25461-1-aksulperen@gmail.com
2025-08-29Documentation: ocfs2: Properly reindent filecheck operations listBagas Sanjaya
Some of texts in filecheck operations list are indented out of the list. In particular, the third operation is shown not as the third list item but rather as a separate paragraph. Reindent the list so that gets properly rendered as such. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20250826024756.16073-1-bagasdotme@gmail.com
2025-08-29docs: proc.rst: Fix VFIO Device title formattingAlex Williamson
Title underline is one character too short. Cc: Alex Mastro <amastro@fb.com> Cc: Jonathan Corbet <corbet@lwn.net> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/all/20250828123035.2f0c74e7@canb.auug.org.au Fixes: 1e736f148956 ("vfio/pci: print vfio-device syspath to fdinfo") Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Link: https://lore.kernel.org/r/20250828203629.283418-1-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-08-28Documentation: f2fs: Reword titleBagas Sanjaya
"What is F2FS" is rather a mistitle for the whole f2fs docs, as it implies the overview section (before "Background and design issues" section) and the docs covers beyond that: from mount options to filesystem implementation details. Retitle and add explicit overview section. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-08-28Documentation: f2fs: Indent compression_mode option listBagas Sanjaya
Indent description text so that compression_mode numbered list gets rendered as such in htmldocs output. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-08-28Documentation: f2fs: Wrap snippets in literal code blocksBagas Sanjaya
Compression mode code and device aliasing shell snippets are shown in htmldocs output as long-running paragraph instead. Wrap them. Fixes: 602a16d58e9a ("f2fs: add compress_mode mount option") Fixes: 128d333f0dff ("f2fs: introduce device aliasing file") Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-08-28Documentation: f2fs: Span write hint table section rowsBagas Sanjaya
Write hint policy table has two rows which act as section rows: buffered io and direct io, yet these rows are written as normal rows instead. Column-span them. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-08-28Documentation: f2fs: Format compression level subtableBagas Sanjaya
Format compression_algorithm subtable as reST table as it does the semantic job rather than normal paragraph. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-08-28Documentation: f2fs: Separate errors mode subtableBagas Sanjaya
errors=%s subtable is shown in htmldocs output as long-running paragraph instead due to missing separator from its previous paragraph. Add it. Fixes: b62e71be2110 ("f2fs: support errors=remount-ro|continue|panic mountoption") Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-08-27Documentation: fuse: Consolidate FUSE docs into its own subdirectoryBagas Sanjaya
All four FUSE docs are currently in upper-level Documentation/filesystems/ directory, but these are distinct as a group of its own. Move them into Documentation/filesystems/fuse/ subdirectory. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Acked-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-08-27doc: fuse: Add max_background and congestion_thresholdChen Linxuan
As I preparing patches adding selftests for fusectl, I notice that documentation of max_background and congestion_threshold is missing. This patch add some descriptions about these two files. Cc: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Chen Linxuan <chenlinxuan@uniontech.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-08-25vfio/pci: print vfio-device syspath to fdinfoAlex Mastro
Print the PCI device syspath to a vfio device's fdinfo. This enables tools to query which device is associated with a given vfio device fd. This results in output like below: $ cat /proc/"$SOME_PID"/fdinfo/"$VFIO_FD" | grep vfio vfio-device-syspath: /sys/devices/pci0000:e0/0000:e0:01.1/0000:e1:00.0/0000:e2:05.0/0000:e8:00.0 Signed-off-by: Alex Mastro <amastro@fb.com> Reviewed-by: Amit Machhiwal <amachhiw@linux.ibm.com> Tested-by: Amit Machhiwal <amachhiw@linux.ibm.com> Link: https://lore.kernel.org/r/20250804-show-fdinfo-v4-1-96b14c5691b3@fb.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-08-21docs: fix trailing whitespace error and remove repeated words in ↵Raphael Pinsonneault-Thibeault
propagate_umount.txt in Documentation/filesystems/propagate_umount.txt: line 289: remove whitespace on blank line line 315: remove duplicate "that" line 364: remove duplicate "in" Signed-off-by: Raphael Pinsonneault-Thibeault <rpthibeault@gmail.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20250818181934.55491-2-rpthibeault@gmail.com
2025-08-20f2fs: add reserved nodes for privileged usersChunhai Guo
This patch allows privileged users to reserve nodes via the 'reserve_node' mount option, which is similar to the existing 'reserve_root' option. "-o reserve_node=<N>" means <N> nodes are reserved for privileged users only. Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-08-18Merge branch 'bjorn' into docs-mwJonathan Corbet
A big set of typo fixes from Bjorn Helgaas
2025-08-18Documentation: Fix filesystems typosBjorn Helgaas
Fix typos. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20250813200526.290420-6-helgaas@kernel.org
2025-08-13block: switch ->getgeo() to struct gendiskAl Viro
Instances are happier that way and it makes more sense anyway - the only part of the result that is related to partition we are given is the start sector, and that has been filled in by the caller. Everything else is a function of the disk. Only one instance (DASD) is ever looking at anything other than bdev->bd_disk and that one is trivial to adjust. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>