| Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix missing btrfs_path release after printing a relocation error
message
- fix extent changeset leak on mmap write after failure to reserve
metadata
- fix fs devices list structure freeing, it could be potentially leaked
under some circumstances
- tree log fixes:
- fix incremental directory logging where inodes for new dentries
were incorrectly skipped
- don't log conflicting inode if it's a directory moved in the
current transaction
- regression fixes:
- fix incorrect btrfs_path freeing when it's auto-cleaned
- revert commit simplifying preallocation of temporary structures
in qgroup functions, some cases were not handled properly
* tag 'for-6.19-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix changeset leak on mmap write after failure to reserve metadata
btrfs: fix memory leak of fs_devices in degraded seed device path
btrfs: fix a potential path leak in print_data_reloc_error()
Revert "btrfs: add ASSERTs on prealloc in qgroup functions"
btrfs: do not skip logging new dentries when logging a new name
btrfs: don't log conflicting inode if it's a dir moved in the current transaction
btrfs: tests: fix double btrfs_path free in remove_extent_ref()
|
|
If the call to btrfs_delalloc_reserve_metadata() fails we jump to the
'out_noreserve' label and there we never free the extent_changeset
allocated by the previous call to btrfs_check_data_free_space() (if
qgroups are enabled). Fix this by calling extent_changeset_free() under
the 'out_noreserve' label.
Fixes: 6599716de2d6 ("btrfs: fix -ENOSPC mmap write failure on NOCOW files/extents")
Reported-by: syzbot+2f8aa76e6acc9fce6638@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/693a635a.a70a0220.33cd7b.0029.GAE@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In open_seed_devices(), when find_fsid() fails and we're in DEGRADED
mode, a new fs_devices is allocated via alloc_fs_devices() but is never
added to the seed_list before returning. This contrasts with the normal
path where fs_devices is properly added via list_add().
If any error occurs later in read_one_dev() or btrfs_read_chunk_tree(),
the cleanup code iterates seed_list to free seed devices, but this
orphaned fs_devices is never found and never freed, causing a memory
leak. Any devices allocated via add_missing_dev() and attached to this
fs_devices are also leaked.
Fix this by adding the newly allocated fs_devices to seed_list in the
degraded path, consistent with the normal path.
Fixes: 5f37583569442 ("Btrfs: move the missing device to its own fs device list")
Reported-by: syzbot+eadd98df8bceb15d7fed@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=eadd98df8bceb15d7fed
Tested-by: syzbot+eadd98df8bceb15d7fed@syzkaller.appspotmail.com
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Inside print_data_reloc_error(), if extent_from_logical() failed we
return immediately.
However there are the following cases where extent_from_logical() can
return error but still holds a path:
- btrfs_search_slot() returned 0
- No backref item found in extent tree
- No flags_ret provided
This is not possible in this call site though.
So for the above two cases, we can return without releasing the path,
causing extent buffer leaks.
Fixes: b9a9a85059cd ("btrfs: output affected files when relocation fails")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This reverts commit 252877a8701530fde861a4f27710c1e718e97caa.
Commit 252877a87015 ("btrfs: add ASSERTs on prealloc in qgroup
functions") tries to remove the kfree() on preallocated qgroup during
several call sites, but this cannot work as intended:
- btrfs_quota_enable()
- btrfs_create_qgroup()
If add_qgroup_item() failed, we go out_free_path() and at that time
prealloc is not yet utilized and will trigger the new ASSERT().
- btrfs_qgroup_inherit()
If qgroup_auto_inherit() failed, prealloc is not yet utilized and
will trigger the new ASSERT()
Reported-by: syzbot+b44d4a4885bc82af2a06@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/69369331.a70a0220.38f243.009e.GAE@google.com/
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When we are logging a directory and the log context indicates that we
are logging a new name for some other file (that is or was inside that
directory), we skip logging the inodes for new dentries in the directory.
This is ok most of the time, but if after the rename or link operation
that triggered the logging of that directory, we have an explicit fsync
of that directory without the directory inode being evicted and reloaded,
we end up never logging the inodes for the new dentries that we found
during the new name logging, as the next directory fsync will only process
dentries that were added after the last time we logged the directory (we
are doing an incremental directory logging).
So make sure we always log new dentries for a directory even if we are
in a context of logging a new name.
We started skipping logging inodes for new dentries as of commit
c48792c6ee7a ("btrfs: do not log new dentries when logging that a new name
exists") and it was fine back then, because when logging a directory we
always iterated over all the directory entries (for leaves changed in the
current transaction) so a subsequent fsync would always log anything that
was previously skipped while logging a directory when logging a new name
(with btrfs_log_new_name()). But later support for incrementally logging
a directory was added in commit dc2872247ec0 ("btrfs: keep track of the
last logged keys when logging a directory"), to avoid checking all dir
items every time we log a directory, so the check to skip dentry logging
added in the first commit should have been removed when the incremental
support for logging a directory was added.
A test case for fstests will follow soon.
Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/84c4e713-85d6-42b9-8dcf-0722ed26cb05@gmail.com/
Fixes: dc2872247ec0 ("btrfs: keep track of the last logged keys when logging a directory")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
transaction
We can't log a conflicting inode if it's a directory and it was moved
from one parent directory to another parent directory in the current
transaction, as this can result an attempt to have a directory with
two hard links during log replay, one for the old parent directory and
another for the new parent directory.
The following scenario triggers that issue:
1) We have directories "dir1" and "dir2" created in a past transaction.
Directory "dir1" has inode A as its parent directory;
2) We move "dir1" to some other directory;
3) We create a file with the name "dir1" in directory inode A;
4) We fsync the new file. This results in logging the inode of the new file
and the inode for the directory "dir1" that was previously moved in the
current transaction. So the log tree has the INODE_REF item for the
new location of "dir1";
5) We move the new file to some other directory. This results in updating
the log tree to included the new INODE_REF for the new location of the
file and removes the INODE_REF for the old location. This happens
during the rename when we call btrfs_log_new_name();
6) We fsync the file, and that persists the log tree changes done in the
previous step (btrfs_log_new_name() only updates the log tree in
memory);
7) We have a power failure;
8) Next time the fs is mounted, log replay happens and when processing
the inode for directory "dir1" we find a new INODE_REF and add that
link, but we don't remove the old link of the inode since we have
not logged the old parent directory of the directory inode "dir1".
As a result after log replay finishes when we trigger writeback of the
subvolume tree's extent buffers, the tree check will detect that we have
a directory a hard link count of 2 and we get a mount failure.
The errors and stack traces reported in dmesg/syslog are like this:
[ 3845.729764] BTRFS info (device dm-0): start tree-log replay
[ 3845.730304] page: refcount:3 mapcount:0 mapping:000000005c8a3027 index:0x1d00 pfn:0x11510c
[ 3845.731236] memcg:ffff9264c02f4e00
[ 3845.731751] aops:btree_aops [btrfs] ino:1
[ 3845.732300] flags: 0x17fffc00000400a(uptodate|private|writeback|node=0|zone=2|lastcpupid=0x1ffff)
[ 3845.733346] raw: 017fffc00000400a 0000000000000000 dead000000000122 ffff9264d978aea8
[ 3845.734265] raw: 0000000000001d00 ffff92650e6d4738 00000003ffffffff ffff9264c02f4e00
[ 3845.735305] page dumped because: eb page dump
[ 3845.735981] BTRFS critical (device dm-0): corrupt leaf: root=5 block=30408704 slot=6 ino=257, invalid nlink: has 2 expect no more than 1 for dir
[ 3845.737786] BTRFS info (device dm-0): leaf 30408704 gen 10 total ptrs 17 free space 14881 owner 5
[ 3845.737789] BTRFS info (device dm-0): refs 4 lock_owner 0 current 30701
[ 3845.737792] item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
[ 3845.737794] inode generation 3 transid 9 size 16 nbytes 16384
[ 3845.737795] block group 0 mode 40755 links 1 uid 0 gid 0
[ 3845.737797] rdev 0 sequence 2 flags 0x0
[ 3845.737798] atime 1764259517.0
[ 3845.737800] ctime 1764259517.572889464
[ 3845.737801] mtime 1764259517.572889464
[ 3845.737802] otime 1764259517.0
[ 3845.737803] item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
[ 3845.737805] index 0 name_len 2
[ 3845.737807] item 2 key (256 DIR_ITEM 2363071922) itemoff 16077 itemsize 34
[ 3845.737808] location key (257 1 0) type 2
[ 3845.737810] transid 9 data_len 0 name_len 4
[ 3845.737811] item 3 key (256 DIR_ITEM 2676584006) itemoff 16043 itemsize 34
[ 3845.737813] location key (258 1 0) type 2
[ 3845.737814] transid 9 data_len 0 name_len 4
[ 3845.737815] item 4 key (256 DIR_INDEX 2) itemoff 16009 itemsize 34
[ 3845.737816] location key (257 1 0) type 2
[ 3845.737818] transid 9 data_len 0 name_len 4
[ 3845.737819] item 5 key (256 DIR_INDEX 3) itemoff 15975 itemsize 34
[ 3845.737820] location key (258 1 0) type 2
[ 3845.737821] transid 9 data_len 0 name_len 4
[ 3845.737822] item 6 key (257 INODE_ITEM 0) itemoff 15815 itemsize 160
[ 3845.737824] inode generation 9 transid 10 size 6 nbytes 0
[ 3845.737825] block group 0 mode 40755 links 2 uid 0 gid 0
[ 3845.737826] rdev 0 sequence 1 flags 0x0
[ 3845.737827] atime 1764259517.572889464
[ 3845.737828] ctime 1764259517.572889464
[ 3845.737830] mtime 1764259517.572889464
[ 3845.737831] otime 1764259517.572889464
[ 3845.737832] item 7 key (257 INODE_REF 256) itemoff 15801 itemsize 14
[ 3845.737833] index 2 name_len 4
[ 3845.737834] item 8 key (257 INODE_REF 258) itemoff 15787 itemsize 14
[ 3845.737836] index 2 name_len 4
[ 3845.737837] item 9 key (257 DIR_ITEM 2507850652) itemoff 15754 itemsize 33
[ 3845.737838] location key (259 1 0) type 1
[ 3845.737839] transid 10 data_len 0 name_len 3
[ 3845.737840] item 10 key (257 DIR_INDEX 2) itemoff 15721 itemsize 33
[ 3845.737842] location key (259 1 0) type 1
[ 3845.737843] transid 10 data_len 0 name_len 3
[ 3845.737844] item 11 key (258 INODE_ITEM 0) itemoff 15561 itemsize 160
[ 3845.737846] inode generation 9 transid 10 size 8 nbytes 0
[ 3845.737847] block group 0 mode 40755 links 1 uid 0 gid 0
[ 3845.737848] rdev 0 sequence 1 flags 0x0
[ 3845.737849] atime 1764259517.572889464
[ 3845.737850] ctime 1764259517.572889464
[ 3845.737851] mtime 1764259517.572889464
[ 3845.737852] otime 1764259517.572889464
[ 3845.737853] item 12 key (258 INODE_REF 256) itemoff 15547 itemsize 14
[ 3845.737855] index 3 name_len 4
[ 3845.737856] item 13 key (258 DIR_ITEM 1843588421) itemoff 15513 itemsize 34
[ 3845.737857] location key (257 1 0) type 2
[ 3845.737858] transid 10 data_len 0 name_len 4
[ 3845.737860] item 14 key (258 DIR_INDEX 2) itemoff 15479 itemsize 34
[ 3845.737861] location key (257 1 0) type 2
[ 3845.737862] transid 10 data_len 0 name_len 4
[ 3845.737863] item 15 key (259 INODE_ITEM 0) itemoff 15319 itemsize 160
[ 3845.737865] inode generation 10 transid 10 size 0 nbytes 0
[ 3845.737866] block group 0 mode 100600 links 1 uid 0 gid 0
[ 3845.737867] rdev 0 sequence 2 flags 0x0
[ 3845.737868] atime 1764259517.580874966
[ 3845.737869] ctime 1764259517.586121869
[ 3845.737870] mtime 1764259517.580874966
[ 3845.737872] otime 1764259517.580874966
[ 3845.737873] item 16 key (259 INODE_REF 257) itemoff 15306 itemsize 13
[ 3845.737874] index 2 name_len 3
[ 3845.737875] BTRFS error (device dm-0): block=30408704 write time tree block corruption detected
[ 3845.739448] ------------[ cut here ]------------
[ 3845.740092] WARNING: CPU: 5 PID: 30701 at fs/btrfs/disk-io.c:335 btree_csum_one_bio+0x25a/0x270 [btrfs]
[ 3845.741439] Modules linked in: btrfs dm_flakey crc32c_cryptoapi (...)
[ 3845.750626] CPU: 5 UID: 0 PID: 30701 Comm: mount Tainted: G W 6.18.0-rc6-btrfs-next-218+ #1 PREEMPT(full)
[ 3845.752414] Tainted: [W]=WARN
[ 3845.752828] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[ 3845.754499] RIP: 0010:btree_csum_one_bio+0x25a/0x270 [btrfs]
[ 3845.755460] Code: 31 f6 48 89 (...)
[ 3845.758685] RSP: 0018:ffffa8d9c5677678 EFLAGS: 00010246
[ 3845.759450] RAX: 0000000000000000 RBX: ffff92650e6d4738 RCX: 0000000000000000
[ 3845.760309] RDX: 0000000000000000 RSI: ffffffff9aab45b9 RDI: ffff9264c4748000
[ 3845.761239] RBP: ffff9264d4324000 R08: 0000000000000000 R09: ffffa8d9c5677468
[ 3845.762607] R10: ffff926bdc1fffa8 R11: 0000000000000003 R12: ffffa8d9c5677680
[ 3845.764099] R13: 0000000000004000 R14: ffff9264dd624000 R15: ffff9264d978aba8
[ 3845.765094] FS: 00007f751fa5a840(0000) GS:ffff926c42a82000(0000) knlGS:0000000000000000
[ 3845.766226] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3845.766970] CR2: 0000558df1815380 CR3: 000000010ed88003 CR4: 0000000000370ef0
[ 3845.768009] Call Trace:
[ 3845.768392] <TASK>
[ 3845.768714] btrfs_submit_bbio+0x6ee/0x7f0 [btrfs]
[ 3845.769640] ? write_one_eb+0x28e/0x340 [btrfs]
[ 3845.770588] btree_write_cache_pages+0x2f0/0x550 [btrfs]
[ 3845.771286] ? alloc_extent_state+0x19/0x100 [btrfs]
[ 3845.771967] ? merge_next_state+0x1a/0x90 [btrfs]
[ 3845.772586] ? set_extent_bit+0x233/0x8b0 [btrfs]
[ 3845.773198] ? xas_load+0x9/0xc0
[ 3845.773589] ? xas_find+0x14d/0x1a0
[ 3845.773969] do_writepages+0xc6/0x160
[ 3845.774367] filemap_fdatawrite_wbc+0x48/0x60
[ 3845.775003] __filemap_fdatawrite_range+0x5b/0x80
[ 3845.775902] btrfs_write_marked_extents+0x61/0x170 [btrfs]
[ 3845.776707] btrfs_write_and_wait_transaction+0x4e/0xc0 [btrfs]
[ 3845.777379] ? _raw_spin_unlock_irqrestore+0x23/0x40
[ 3845.777923] btrfs_commit_transaction+0x5ea/0xd20 [btrfs]
[ 3845.778551] ? _raw_spin_unlock+0x15/0x30
[ 3845.778986] ? release_extent_buffer+0x34/0x160 [btrfs]
[ 3845.779659] btrfs_recover_log_trees+0x7a3/0x7c0 [btrfs]
[ 3845.780416] ? __pfx_replay_one_buffer+0x10/0x10 [btrfs]
[ 3845.781499] open_ctree+0x10bb/0x15f0 [btrfs]
[ 3845.782194] btrfs_get_tree.cold+0xb/0x16c [btrfs]
[ 3845.782764] ? fscontext_read+0x15c/0x180
[ 3845.783202] ? rw_verify_area+0x50/0x180
[ 3845.783667] vfs_get_tree+0x25/0xd0
[ 3845.784047] vfs_cmd_create+0x59/0xe0
[ 3845.784458] __do_sys_fsconfig+0x4f6/0x6b0
[ 3845.784914] do_syscall_64+0x50/0x1220
[ 3845.785340] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 3845.785980] RIP: 0033:0x7f751fc7f4aa
[ 3845.786759] Code: 73 01 c3 48 (...)
[ 3845.789951] RSP: 002b:00007ffcdba45dc8 EFLAGS: 00000246 ORIG_RAX: 00000000000001af
[ 3845.791402] RAX: ffffffffffffffda RBX: 000055ccc8291c20 RCX: 00007f751fc7f4aa
[ 3845.792688] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
[ 3845.794308] RBP: 000055ccc8292120 R08: 0000000000000000 R09: 0000000000000000
[ 3845.795829] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 3845.797183] R13: 00007f751fe11580 R14: 00007f751fe1326c R15: 00007f751fdf8a23
[ 3845.798633] </TASK>
[ 3845.799067] ---[ end trace 0000000000000000 ]---
[ 3845.800215] BTRFS: error (device dm-0) in btrfs_commit_transaction:2553: errno=-5 IO failure (Error while writing out transaction)
[ 3845.801860] BTRFS warning (device dm-0 state E): Skipping commit of aborted transaction.
[ 3845.802815] BTRFS error (device dm-0 state EA): Transaction aborted (error -5)
[ 3845.803728] BTRFS: error (device dm-0 state EA) in cleanup_transaction:2036: errno=-5 IO failure
[ 3845.805374] BTRFS: error (device dm-0 state EA) in btrfs_replay_log:2083: errno=-5 IO failure (Failed to recover log tree)
[ 3845.807919] BTRFS error (device dm-0 state EA): open_ctree failed: -5
Fix this by never logging a conflicting inode that is a directory and was
moved in the current transaction (its last_unlink_trans equals the current
transaction) and instead fallback to a transaction commit.
A test case for fstests will follow soon.
Reported-by: Vyacheslav Kovalevsky <slva.kovalevskiy.2014@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/7bbc9419-5c56-450a-b5a0-efeae7457113@gmail.com/
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We converted this code to use auto free cleanup.h magic but one
remaining free was accidentally left behind which leads to a double free
bug.
Fixes: a320476ca8a3 ("btrfs: tests: do trivial BTRFS_PATH_AUTO_FREE conversions")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Pull KVM updates from Paolo Bonzini:
"ARM:
- Support for userspace handling of synchronous external aborts
(SEAs), allowing the VMM to potentially handle the abort in a
non-fatal manner
- Large rework of the VGIC's list register handling with the goal of
supporting more active/pending IRQs than available list registers
in hardware. In addition, the VGIC now supports EOImode==1 style
deactivations for IRQs which may occur on a separate vCPU than the
one that acked the IRQ
- Support for FEAT_XNX (user / privileged execute permissions) and
FEAT_HAF (hardware update to the Access Flag) in the software page
table walkers and shadow MMU
- Allow page table destruction to reschedule, fixing long
need_resched latencies observed when destroying a large VM
- Minor fixes to KVM and selftests
Loongarch:
- Get VM PMU capability from HW GCFG register
- Add AVEC basic support
- Use 64-bit register definition for EIOINTC
- Add KVM timer test cases for tools/selftests
RISC/V:
- SBI message passing (MPXY) support for KVM guest
- Give a new, more specific error subcode for the case when in-kernel
AIA virtualization fails to allocate IMSIC VS-file
- Support KVM_DIRTY_LOG_INITIALLY_SET, enabling dirty log gradually
in small chunks
- Fix guest page fault within HLV* instructions
- Flush VS-stage TLB after VCPU migration for Andes cores
s390:
- Always allocate ESCA (Extended System Control Area), instead of
starting with the basic SCA and converting to ESCA with the
addition of the 65th vCPU. The price is increased number of exits
(and worse performance) on z10 and earlier processor; ESCA was
introduced by z114/z196 in 2010
- VIRT_XFER_TO_GUEST_WORK support
- Operation exception forwarding support
- Cleanups
x86:
- Skip the costly "zap all SPTEs" on an MMIO generation wrap if MMIO
SPTE caching is disabled, as there can't be any relevant SPTEs to
zap
- Relocate a misplaced export
- Fix an async #PF bug where KVM would clear the completion queue
when the guest transitioned in and out of paging mode, e.g. when
handling an SMI and then returning to paged mode via RSM
- Leave KVM's user-return notifier registered even when disabling
virtualization, as long as kvm.ko is loaded. On reboot/shutdown,
keeping the notifier registered is ok; the kernel does not use the
MSRs and the callback will run cleanly and restore host MSRs if the
CPU manages to return to userspace before the system goes down
- Use the checked version of {get,put}_user()
- Fix a long-lurking bug where KVM's lack of catch-up logic for
periodic APIC timers can result in a hard lockup in the host
- Revert the periodic kvmclock sync logic now that KVM doesn't use a
clocksource that's subject to NTP corrections
- Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the
latter behind CONFIG_CPU_MITIGATIONS
- Context switch XCR0, XSS, and PKRU outside of the entry/exit fast
path; the only reason they were handled in the fast path was to
paper of a bug in the core #MC code, and that has long since been
fixed
- Add emulator support for AVX MOV instructions, to play nice with
emulated devices whose guest drivers like to access PCI BARs with
large multi-byte instructions
x86 (AMD):
- Fix a few missing "VMCB dirty" bugs
- Fix the worst of KVM's lack of EFER.LMSLE emulation
- Add AVIC support for addressing 4k vCPUs in x2AVIC mode
- Fix incorrect handling of selective CR0 writes when checking
intercepts during emulation of L2 instructions
- Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32]
on VMRUN and #VMEXIT
- Fix a bug where KVM corrupt the guest code stream when re-injecting
a soft interrupt if the guest patched the underlying code after the
VM-Exit, e.g. when Linux patches code with a temporary INT3
- Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits
to userspace, and extend KVM "support" to all policy bits that
don't require any actual support from KVM
x86 (Intel):
- Use the root role from kvm_mmu_page to construct EPTPs instead of
the current vCPU state, partly as worthwhile cleanup, but mostly to
pave the way for tracking per-root TLB flushes, and elide EPT
flushes on pCPU migration if the root is clean from a previous
flush
- Add a few missing nested consistency checks
- Rip out support for doing "early" consistency checks via hardware
as the functionality hasn't been used in years and is no longer
useful in general; replace it with an off-by-default module param
to WARN if hardware fails a check that KVM does not perform
- Fix a currently-benign bug where KVM would drop the guest's
SPEC_CTRL[63:32] on VM-Enter
- Misc cleanups
- Overhaul the TDX code to address systemic races where KVM (acting
on behalf of userspace) could inadvertantly trigger lock contention
in the TDX-Module; KVM was either working around these in weird,
ugly ways, or was simply oblivious to them (though even Yan's
devilish selftests could only break individual VMs, not the host
kernel)
- Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a
TDX vCPU, if creating said vCPU failed partway through
- Fix a few sparse warnings (bad annotation, 0 != NULL)
- Use struct_size() to simplify copying TDX capabilities to userspace
- Fix a bug where TDX would effectively corrupt user-return MSR
values if the TDX Module rejects VP.ENTER and thus doesn't clobber
host MSRs as expected
Selftests:
- Fix a math goof in mmu_stress_test when running on a single-CPU
system/VM
- Forcefully override ARCH from x86_64 to x86 to play nice with
specifying ARCH=x86_64 on the command line
- Extend a bunch of nested VMX to validate nested SVM as well
- Add support for LA57 in the core VM_MODE_xxx macro, and add a test
to verify KVM can save/restore nested VMX state when L1 is using
5-level paging, but L2 is not
- Clean up the guest paging code in anticipation of sharing the core
logic for nested EPT and nested NPT
guest_memfd:
- Add NUMA mempolicy support for guest_memfd, and clean up a variety
of rough edges in guest_memfd along the way
- Define a CLASS to automatically handle get+put when grabbing a
guest_memfd from a memslot to make it harder to leak references
- Enhance KVM selftests to make it easer to develop and debug
selftests like those added for guest_memfd NUMA support, e.g. where
test and/or KVM bugs often result in hard-to-debug SIGBUS errors
- Misc cleanups
Generic:
- Use the recently-added WQ_PERCPU when creating the per-CPU
workqueue for irqfd cleanup
- Fix a goof in the dirty ring documentation
- Fix choice of target for directed yield across different calls to
kvm_vcpu_on_spin(); the function was always starting from the first
vCPU instead of continuing the round-robin search"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (260 commits)
KVM: arm64: at: Update AF on software walk only if VM has FEAT_HAFDBS
KVM: arm64: at: Use correct HA bit in TCR_EL2 when regime is EL2
KVM: arm64: Document KVM_PGTABLE_PROT_{UX,PX}
KVM: arm64: Fix spelling mistake "Unexpeced" -> "Unexpected"
KVM: arm64: Add break to default case in kvm_pgtable_stage2_pte_prot()
KVM: arm64: Add endian casting to kvm_swap_s[12]_desc()
KVM: arm64: Fix compilation when CONFIG_ARM64_USE_LSE_ATOMICS=n
KVM: arm64: selftests: Add test for AT emulation
KVM: arm64: nv: Expose hardware access flag management to NV guests
KVM: arm64: nv: Implement HW access flag management in stage-2 SW PTW
KVM: arm64: Implement HW access flag management in stage-1 SW PTW
KVM: arm64: Propagate PTW errors up to AT emulation
KVM: arm64: Add helper for swapping guest descriptor
KVM: arm64: nv: Use pgtable definitions in stage-2 walk
KVM: arm64: Handle endianness in read helper for emulated PTW
KVM: arm64: nv: Stop passing vCPU through void ptr in S2 PTW
KVM: arm64: Call helper for reading descriptors directly
KVM: arm64: nv: Advertise support for FEAT_XNX
KVM: arm64: Teach ptdump about FEAT_XNX permissions
KVM: s390: Use generic VIRT_XFER_TO_GUEST_WORK functions
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Features:
- shutdown ioctl support (needs CONFIG_BTRFS_EXPERIMENTAL for now):
- set filesystem state as being shut down (also named going down
in other filesystems), where all active operations return EIO
and this cannot be changed until unmount
- pending operations are attempted to be finished but error
messages may still show up depending on where exactly the
shutdown happened
- scrub (and device replace) vs suspend/hibernate:
- a running scrub will prevent suspend, which can be annoying as
suspend is an immediate request and scrub is not critical
- filesystem freezing before suspend was not sufficient as the
problem was in process freezing
- behaviour change: on suspend scrub and device replace are
cancelled, where scrub can record the last state and continue
from there; the device replace has to be restarted from the
beginning
- zone stats exported in sysfs, from the perspective of the
filesystem this includes active, reclaimable, relocation etc zones
Performance:
- improvements when processing space reservation tickets by
optimizing locking and shrinking critical sections, cumulative
improvements in lockstat numbers show +15%
Notable fixes:
- use vmalloc fallback when allocating bios as high order allocations
can happen with wide checksums (like sha256)
- scrub will always track the last position of progress so it's not
starting from zero after an error
Core:
- under experimental config, checksum calculations are offloaded to
process context, simplifies locking and allows to remove
compression write worker kthread(s):
- speed improvement in direct IO throughput with buffered IO
fallback is +15% when not offloaded but this is more related to
internal crypto subsystem improvements
- this will be probably default in the future removing the sysfs
tunable
- (experimental) block size > page size updates:
- support more operations when not using large folios (encoded
read/write and send)
- raid56
- more preparations for fscrypt support
Other:
- more conversions to auto-cleaned variables
- parameter cleanups and removals
- extended warning fixes
- improved printing of structured values like keys
- lots of other cleanups and refactoring"
* tag 'for-6.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (147 commits)
btrfs: remove unnecessary inode key in btrfs_log_all_parents()
btrfs: remove redundant zero/NULL initializations in btrfs_alloc_root()
btrfs: remaining BTRFS_PATH_AUTO_FREE conversions
btrfs: send: do not allocate memory for xattr data when checking it exists
btrfs: send: add unlikely to all unexpected overflow checks
btrfs: reduce arguments to btrfs_del_inode_ref_in_log()
btrfs: remove root argument from btrfs_del_dir_entries_in_log()
btrfs: use test_and_set_bit() in btrfs_delayed_delete_inode_ref()
btrfs: don't search back for dir inode item in INO_LOOKUP_USER
btrfs: don't rewrite ret from inode_permission
btrfs: add orig_logical to btrfs_bio for encryption
btrfs: disable verity on encrypted inodes
btrfs: disable various operations on encrypted inodes
btrfs: remove redundant level reset in btrfs_del_items()
btrfs: simplify leaf traversal after path release in btrfs_next_old_leaf()
btrfs: optimize balance_level() path reference handling
btrfs: factor out root promotion logic into promote_child_to_root()
btrfs: raid56: remove the "_step" infix
btrfs: raid56: enable bs > ps support
btrfs: raid56: prepare finish_parity_scrub() to support bs > ps cases
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe:
- Fix head insertion for mq-deadline, a regression from when priority
support was added
- Series simplifying and improving the ublk user copy code
- Various ublk related cleanups
- Fixup REQ_NOWAIT handling in loop/zloop, clearing NOWAIT when the
request is punted to a thread for handling
- Merge and then later revert loop dio nowait support, as it ended up
causing excessive stack usage for when the inline issue code needs to
dip back into the full file system code
- Improve auto integrity code, making it less deadlock prone
- Speedup polled IO handling, but manually managing the hctx lookups
- Fixes for blk-throttle for SSD devices
- Small series with fixes for the S390 dasd driver
- Add support for caching zones, avoiding unnecessary report zone
queries
- MD pull requests via Yu:
- fix null-ptr-dereference regression for dm-raid0
- fix IO hang for raid5 when array is broken with IO inflight
- remove legacy 1s delay to speed up system shutdown
- change maintainer's email address
- data can be lost if array is created with different lbs devices,
fix this problem and record lbs of the array in metadata
- fix rcu protection for md_thread
- fix mddev kobject lifetime regression
- enable atomic writes for md-linear
- some cleanups
- bcache updates via Coly
- remove useless discard and cache device code
- improve usage of per-cpu workqueues
- Reorganize the IO scheduler switching code, fixing some lockdep
reports as well
- Improve the block layer P2P DMA support
- Add support to the block tracing code for zoned devices
- Segment calculation improves, and memory alignment flexibility
improvements
- Set of prep and cleanups patches for ublk batching support. The
actual batching hasn't been added yet, but helps shrink down the
workload of getting that patchset ready for 6.20
- Fix for how the ps3 block driver handles segments offsets
- Improve how block plugging handles batch tag allocations
- nbd fixes for use-after-free of the configuration on device clear/put
- Set of improvements and fixes for zloop
- Add Damien as maintainer of the block zoned device code handling
- Various other fixes and cleanups
* tag 'for-6.19/block-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits)
block/rnbd: correct all kernel-doc complaints
blk-mq: use queue_hctx in blk_mq_map_queue_type
md: remove legacy 1s delay in md_notify_reboot
md/raid5: fix IO hang when array is broken with IO inflight
md: warn about updating super block failure
md/raid0: fix NULL pointer dereference in create_strip_zones() for dm-raid
sbitmap: fix all kernel-doc warnings
ublk: add helper of __ublk_fetch()
ublk: pass const pointer to ublk_queue_is_zoned()
ublk: refactor auto buffer register in ublk_dispatch_req()
ublk: add `union ublk_io_buf` with improved naming
ublk: add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg()
kfifo: add kfifo_alloc_node() helper for NUMA awareness
blk-mq: fix potential uaf for 'queue_hw_ctx'
blk-mq: use array manage hctx map instead of xarray
ublk: prevent invalid access with DEBUG
s390/dasd: Use scnprintf() instead of sprintf()
s390/dasd: Move device name formatting into separate function
s390/dasd: Remove unnecessary debugfs_create() return checks
s390/dasd: Fix gendisk parent after copy pair swap
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring updates from Jens Axboe:
- Unify how task_work cancelations are detected, placing it in the
task_work running state rather than needing to check the task state
- Series cleaning up and moving the cancelation code to where it
belongs, in cancel.c
- Cleanup of waitid and futex argument handling
- Add support for mixed sized SQEs. 6.18 added support for mixed sized
CQEs, improving flexibility and efficiency of workloads that need big
CQEs. This adds similar support for SQEs, where the occasional need
for a 128b SQE doesn't necessitate having all SQEs be 128b in size
- Introduce zcrx and SQ/CQ layout queries. The former returns what zcrx
features are available. And both return the ring size information to
help with allocation size calculation for user provided rings like
IORING_SETUP_NO_MMAP and IORING_MEM_REGION_TYPE_USER
- Zcrx updates for 6.19. It includes a bunch of small patches,
IORING_REGISTER_ZCRX_CTRL and RQ flushing and David's work on sharing
zcrx b/w multiple io_uring instances
- Series cleaning up ring initializations, notable deduplicating ring
size and offset calculations. It also moves most of the checking
before doing any allocations, making the code simpler
- Add support for getsockname and getpeername, which is mostly a
trivial hookup after a bit of refactoring on the networking side
- Various fixes and cleanups
* tag 'for-6.19/io_uring-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits)
io_uring: Introduce getsockname io_uring cmd
socket: Split out a getsockname helper for io_uring
socket: Unify getsockname and getpeername implementation
io_uring/query: drop unused io_handle_query_entry() ctx arg
io_uring/kbuf: remove obsolete buf_nr_pages and update comments
io_uring/register: use correct location for io_rings_layout
io_uring/zcrx: share an ifq between rings
io_uring/zcrx: add io_fill_zcrx_offsets()
io_uring/zcrx: export zcrx via a file
io_uring/zcrx: move io_zcrx_scrub() and dependencies up
io_uring/zcrx: count zcrx users
io_uring/zcrx: add sync refill queue flushing
io_uring/zcrx: introduce IORING_REGISTER_ZCRX_CTRL
io_uring/zcrx: elide passing msg flags
io_uring/zcrx: use folio_nr_pages() instead of shift operation
io_uring/zcrx: convert to use netmem_desc
io_uring/query: introduce rings info query
io_uring/query: introduce zcrx query
io_uring: move cq/sq user offset init around
io_uring: pre-calculate scq layout
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux
Pull Kbuild updates from Nicolas Schier:
- Enable -fms-extensions, allowing anonymous use of tagged struct or
union in struct/union (tag kbuild-ms-extensions-6.19). An exemplary
conversion patch is added here, too (btrfs).
[ Editor's note: the core of this actually came in early through a
shared branch and a few other trees - Linus ]
- Introduce architecture-specific CC_CAN_LINK and flags for userprogs
- Add new packaging target 'modules-cpio-pkg' for building a initramfs
cpio w/ kmods
- Handle included .c files in gen_compile_commands
- Minor kbuild changes:
- Use objtree for module signing key path, fixing oot kmod signing
- Improve documentation of KBUILD_BUILD_TIMESTAMP
- Reuse KBUILD_USERCFLAGS for UAPI, instead of defining twice
- Rename scripts/Makefile.extrawarn to Makefile.warn
- Drop obsolete types.h check from headers_check.pl
- Remove outdated config leak ignore entries
* tag 'kbuild-6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux:
kbuild: add target to build a cpio containing modules
initramfs: add gen_init_cpio to hostprogs unconditionally
kbuild: allow architectures to override CC_CAN_LINK
init: deduplicate cc-can-link.sh invocations
kbuild: don't enable CC_CAN_LINK if the dummy program generates warnings
scripts: headers_install.sh: Remove two outdated config leak ignore entries
scripts/clang-tools: Handle included .c files in gen_compile_commands
kbuild: uapi: Drop types.h check from headers_check.pl
kbuild: Rename Makefile.extrawarn to Makefile.warn
MAINTAINERS, .mailmap: Update mail address for Nicolas Schier
kbuild: uapi: reuse KBUILD_USERCFLAGS
kbuild: doc: improve KBUILD_BUILD_TIMESTAMP documentation
kbuild: Use objtree for module signing key path
btrfs: send: make use of -fms-extensions for defining struct fs_path
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull directory locking updates from Christian Brauner:
"This contains the work to add centralized APIs for directory locking
operations.
This series is part of a larger effort to change directory operation
locking to allow multiple concurrent operations in a directory. The
ultimate goal is to lock the target dentry(s) rather than the whole
parent directory.
To help with changing the locking protocol, this series centralizes
locking and lookup in new helper functions. The helpers establish a
pattern where it is the dentry that is being locked and unlocked
(currently the lock is held on dentry->d_parent->d_inode, but that can
change in the future).
This also changes vfs_mkdir() to unlock the parent on failure, as well
as dput()ing the dentry. This allows end_creating() to only require
the target dentry (which may be IS_ERR() after vfs_mkdir()), not the
parent"
* tag 'vfs-6.19-rc1.directory.locking' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
nfsd: fix end_creating() conversion
VFS: introduce end_creating_keep()
VFS: change vfs_mkdir() to unlock on failure.
ecryptfs: use new start_creating/start_removing APIs
Add start_renaming_two_dentries()
VFS/ovl/smb: introduce start_renaming_dentry()
VFS/nfsd/ovl: introduce start_renaming() and end_renaming()
VFS: add start_creating_killable() and start_removing_killable()
VFS: introduce start_removing_dentry()
smb/server: use end_removing_noperm for for target of smb2_create_link()
VFS: introduce start_creating_noperm() and start_removing_noperm()
VFS/nfsd/cachefiles/ovl: introduce start_removing() and end_removing()
VFS/nfsd/cachefiles/ovl: add start_creating() and end_creating()
VFS: tidy up do_unlinkat()
VFS: introduce start_dirop() and end_dirop()
debugfs: rename end_creating() to debugfs_end_creating()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull superblock lock guard updates from Christian Brauner:
"This starts the work of introducing guards for superblock related
locks.
Introduce super_write_guard for scoped superblock write protection.
This provides a guard-based alternative to the manual sb_start_write()
and sb_end_write() pattern, allowing the compiler to automatically
handle the cleanup"
* tag 'vfs-6.19-rc1.guards' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
xfs: use super write guard in xfs_file_ioctl()
open: use super write guard in do_ftruncate()
btrfs: use super write guard in relocating_repair_kthread()
ext4: use super write guard in write_mmp_block()
btrfs: use super write guard in sb_start_write()
btrfs: use super write guard btrfs_run_defrag_inode()
btrfs: use super write guard in btrfs_reclaim_bgs_work()
fs: add super_write_guard
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull fs header updates from Christian Brauner:
"This contains initial work to start splitting up fs.h.
Begin the long-overdue work of splitting up the monolithic fs.h
header. The header has grown to over 3000 lines and includes types and
functions for many different subsystems, making it difficult to
navigate and causing excessive compilation dependencies.
This series introduces new focused headers for superblock-related
code:
- Rename fs_types.h to fs_dirent.h to better reflect its actual
content (directory entry types)
- Add fs/super_types.h containing superblock type definitions
- Add fs/super.h containing superblock function declarations
This is the first step in a longer effort to modularize the VFS
headers.
Cleanups:
- Inode Field Layout Optimization (Mateusz Guzik)
Move inode fields used during fast path lookup closer together to
improve cache locality during path resolution.
- current_umask() Optimization (Mateusz Guzik)
Inline current_umask() and move it to fs_struct.h. This improves
performance by avoiding function call overhead for this
frequently-used function, and places it in a more appropriate
header since it operates on fs_struct"
* tag 'vfs-6.19-rc1.fs_header' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: move inode fields used during fast path lookup closer together
fs: inline current_umask() and move it to fs_struct.h
fs: add fs/super.h header
fs: add fs/super_types.h header
fs: rename fs_types.h to fs_dirent.h
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull folio updates from Christian Brauner:
"Add a new folio_next_pos() helper function that returns the file
position of the first byte after the current folio. This is a common
operation in filesystems when needing to know the end of the current
folio.
The helper is lifted from btrfs which already had its own version, and
is now used across multiple filesystems and subsystems:
- btrfs
- buffer
- ext4
- f2fs
- gfs2
- iomap
- netfs
- xfs
- mm
This fixes a long-standing bug in ocfs2 on 32-bit systems with files
larger than 2GiB. Presumably this is not a common configuration, but
the fix is backported anyway. The other filesystems did not have bugs,
they were just mildly inefficient.
This also introduce uoff_t as the unsigned version of loff_t. A recent
commit inadvertently changed a comparison from being unsigned (on
64-bit systems) to being signed (which it had always been on 32-bit
systems), leading to sporadic fstests failures.
Generally file sizes are restricted to being a signed integer, but in
places where -1 is passed to indicate "up to the end of the file", it
is convenient to have an unsigned type to ensure comparisons are
always unsigned regardless of architecture"
* tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: Add uoff_t
mm: Use folio_next_pos()
xfs: Use folio_next_pos()
netfs: Use folio_next_pos()
iomap: Use folio_next_pos()
gfs2: Use folio_next_pos()
f2fs: Use folio_next_pos()
ext4: Use folio_next_pos()
buffer: Use folio_next_pos()
btrfs: Use folio_next_pos()
filemap: Add folio_next_pos()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull writeback updates from Christian Brauner:
"Features:
- Allow file systems to increase the minimum writeback chunk size.
The relatively low minimal writeback size of 4MiB means that
written back inodes on rotational media are switched a lot. Besides
introducing additional seeks, this also can lead to extreme file
fragmentation on zoned devices when a lot of files are cached
relative to the available writeback bandwidth.
This adds a superblock field that allows the file system to
override the default size, and sets it to the zone size for zoned
XFS.
- Add logging for slow writeback when it exceeds
sysctl_hung_task_timeout_secs. This helps identify tasks waiting
for a long time and pinpoint potential issues. Recording the
starting jiffies is also useful when debugging a crashed vmcore.
- Wake up waiting tasks when finishing the writeback of a chunk
Cleanups:
- filemap_* writeback interface cleanups.
Adding filemap_fdatawrite_wbc ended up being a mistake, as all but
the original btrfs caller should be using better high level
interfaces instead.
This series removes all these low-level interfaces, switches btrfs
to a more specific interface, and cleans up other too low-level
interfaces. With this the writeback_control that is passed to the
writeback code is only initialized in three places.
- Remove __filemap_fdatawrite, __filemap_fdatawrite_range, and
filemap_fdatawrite_wbc
- Add filemap_flush_nr helper for btrfs
- Push struct writeback_control into start_delalloc_inodes in btrfs
- Rename filemap_fdatawrite_range_kick to filemap_flush_range
- Stop opencoding filemap_fdatawrite_range in 9p, ocfs2, and mm
- Make wbc_to_tag() inline and use it in fs"
* tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: Make wbc_to_tag() inline and use it in fs.
xfs: set s_min_writeback_pages for zoned file systems
writeback: allow the file system to override MIN_WRITEBACK_PAGES
writeback: cleanup writeback_chunk_size
mm: rename filemap_fdatawrite_range_kick to filemap_flush_range
mm: remove __filemap_fdatawrite_range
mm: remove filemap_fdatawrite_wbc
mm: remove __filemap_fdatawrite
mm,btrfs: add a filemap_flush_nr helper
btrfs: push struct writeback_control into start_delalloc_inodes
btrfs: use the local tmp_inode variable in start_delalloc_inodes
ocfs2: don't opencode filemap_fdatawrite_range in ocfs2_journal_submit_inode_data_buffers
9p: don't opencode filemap_fdatawrite_range in v9fs_mmap_vm_close
mm: don't opencode filemap_fdatawrite_range in filemap_invalidate_inode
writeback: Add logging for slow writeback (exceeds sysctl_hung_task_timeout_secs)
writeback: Wake up waiting tasks when finishing the writeback of a chunk.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode updates from Christian Brauner:
"Features:
- Hide inode->i_state behind accessors. Open-coded accesses prevent
asserting they are done correctly. One obvious aspect is locking,
but significantly more can be checked. For example it can be
detected when the code is clearing flags which are already missing,
or is setting flags when it is illegal (e.g., I_FREEING when
->i_count > 0)
- Provide accessors for ->i_state, converts all filesystems using
coccinelle and manual conversions (btrfs, ceph, smb, f2fs, gfs2,
overlayfs, nilfs2, xfs), and makes plain ->i_state access fail to
compile
- Rework I_NEW handling to operate without fences, simplifying the
code after the accessor infrastructure is in place
Cleanups:
- Move wait_on_inode() from writeback.h to fs.h
- Spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
for clarity
- Cosmetic fixes to LRU handling
- Push list presence check into inode_io_list_del()
- Touch up predicts in __d_lookup_rcu()
- ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage
- Assert on ->i_count in iput_final()
- Assert ->i_lock held in __iget()
Fixes:
- Add missing fences to I_NEW handling"
* tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits)
dcache: touch up predicts in __d_lookup_rcu()
fs: push list presence check into inode_io_list_del()
fs: cosmetic fixes to lru handling
fs: rework I_NEW handling to operate without fences
fs: make plain ->i_state access fail to compile
xfs: use the new ->i_state accessors
nilfs2: use the new ->i_state accessors
overlayfs: use the new ->i_state accessors
gfs2: use the new ->i_state accessors
f2fs: use the new ->i_state accessors
smb: use the new ->i_state accessors
ceph: use the new ->i_state accessors
btrfs: use the new ->i_state accessors
Manual conversion to use ->i_state accessors of all places not covered by coccinelle
Coccinelle-based conversion to use ->i_state accessors
fs: provide accessors for ->i_state
fs: spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
fs: move wait_on_inode() from writeback.h to fs.h
fs: add missing fences to I_NEW handling
ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Features:
- Cheaper MAY_EXEC handling for path lookup. This elides MAY_WRITE
permission checks during path lookup and adds the
IOP_FASTPERM_MAY_EXEC flag so filesystems like btrfs can avoid
expensive permission work.
- Hide dentry_cache behind runtime const machinery.
- Add German Maglione as virtiofs co-maintainer.
Cleanups:
- Tidy up and inline step_into() and walk_component() for improved
code generation.
- Re-enable IOCB_NOWAIT writes to files. This refactors file
timestamp update logic, fixing a layering bypass in btrfs when
updating timestamps on device files and improving FMODE_NOCMTIME
handling in VFS now that nfsd started using it.
- Path lookup optimizations extracting slowpaths into dedicated
routines and adding branch prediction hints for mntput_no_expire(),
fd_install(), lookup_slow(), and various other hot paths.
- Enable clang's -fms-extensions flag, requiring a JFS rename to
avoid conflicts.
- Remove spurious exports in fs/file_attr.c.
- Stop duplicating union pipe_index declaration. This depends on the
shared kbuild branch that brings in -fms-extensions support which
is merged into this branch.
- Use MD5 library instead of crypto_shash in ecryptfs.
- Use largest_zero_folio() in iomap_dio_zero().
- Replace simple_strtol/strtoul with kstrtoint/kstrtouint in init and
initrd code.
- Various typo fixes.
Fixes:
- Fix emergency sync for btrfs. Btrfs requires an explicit sync_fs()
call with wait == 1 to commit super blocks. The emergency sync path
never passed this, leaving btrfs data uncommitted during emergency
sync.
- Use local kmap in watch_queue's post_one_notification().
- Add hint prints in sb_set_blocksize() for LBS dependency on THP"
* tag 'vfs-6.19-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
MAINTAINERS: add German Maglione as virtiofs co-maintainer
fs: inline step_into() and walk_component()
fs: tidy up step_into() & friends before inlining
orangefs: use inode_update_timestamps directly
btrfs: fix the comment on btrfs_update_time
btrfs: use vfs_utimes to update file timestamps
fs: export vfs_utimes
fs: lift the FMODE_NOCMTIME check into file_update_time_flags
fs: refactor file timestamp update logic
include/linux/fs.h: trivial fix: regualr -> regular
fs/splice.c: trivial fix: pipes -> pipe's
fs: mark lookup_slow() as noinline
fs: add predicts based on nd->depth
fs: move mntput_no_expire() slowpath into a dedicated routine
fs: remove spurious exports in fs/file_attr.c
watch_queue: Use local kmap in post_one_notification()
fs: touch up predicts in path lookup
fs: move fd_install() slowpath into a dedicated routine and provide commentary
fs: hide dentry_cache behind runtime const machinery
fs: touch predicts in do_dentry_open()
...
|
|
Since commit e41f941a2311 ("Btrfs: move over to use ->update_time") this
is not a copy of the high-level file_update_time helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251120064859.2911749-6-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Btrfs updates the device node timestamps for block device special files
when it stop using the device.
Commit 8f96a5bfa150 ("btrfs: update the bdev time directly when closing")
switch that update from the correct layering to directly call the
low-level helper on the bdev inode. This is wrong and got fixed in
commit 54fde91f52f5 ("btrfs: update device path inode time instead of
bd_inode") by updating the file system inode instead of the bdev inode,
but this kept the incorrect bypassing of the VFS interfaces and file
system ->update_times method. Fix this by using the propet vfs_utimes
interface.
Fixes: 8f96a5bfa150 ("btrfs: update the bdev time directly when closing")
Fixes: 54fde91f52f5 ("btrfs: update device path inode time instead of bd_inode")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251120064859.2911749-5-hch@lst.de
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
KVM guest_memfd changes for 6.19:
- Add NUMA mempolicy support for guest_memfd, and clean up a variety of
rough edges in guest_memfd along the way.
- Define a CLASS to automatically handle get+put when grabbing a guest_memfd
from a memslot to make it harder to leak references.
- Enhance KVM selftests to make it easer to develop and debug selftests like
those added for guest_memfd NUMA support, e.g. where test and/or KVM bugs
often result in hard-to-debug SIGBUS errors.
- Misc cleanups.
|
|
We are setting up an inode key to lookup parent directory inode but all we
need is the inode's objectid. The use of the key was necessary in the past
but since commit 0202e83fdab0 ("btrfs: simplify iget helpers") we only
need the objectid.
So remove the key variable in the stack and use instead a simple u64 for
the inode's objectid.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We have allocated the root with kzalloc() so all the memory is already
zero initialized, therefore it's redundant to assign 0 and NULL to several
of the root members. Remove all of them except the atomic initializations
since atomic_t is an opaque type and it's not a good practice to assume
its internals.
This slightly reduces the binary size.
With gcc 14.2.0-19 from Debian on x86_64, before this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1939404 162963 15592 2117959 205147 fs/btrfs/btrfs.ko
After this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1939212 162963 15592 2117767 205087 fs/btrfs/btrfs.ko
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Do the remaining btrfs_path conversion to the auto cleaning, this seems
to be the last one. Most of the conversions are trivial, only adding the
declaration and removing the freeing, or changing the goto patterns to
return.
There are some functions with many changes, like __btrfs_free_extent(),
btrfs_remove_from_free_space_tree() or btrfs_add_to_free_space_tree()
but it still follows the same pattern.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When checking if xattrs were deleted we don't care about their data, but
we are allocating memory for the data and copying it, which only wastes
time and can result in an unnecessary error in case the allocation fails.
So stop allocating memory and copying data by making find_xattr() and
__find_xattr() skip those steps if the given data buffer is NULL.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There are several checks for unexpected overflows of buffers and path
lengths that makes us fail the send operation with an error if for some
highly unexpected reason they happen. So add the unlikely tag to those
checks to hint the compiler to generate better code, while also making
it more explicit in the source that it's highly unexpected.
With gcc 14.2.0-19 from Debian on x86_64, I also got a small reduction
the text size of the btrfs module.
Before:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1936917 162723 15592 2115232 2046a0 fs/btrfs/btrfs.ko
After:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1936789 162723 15592 2115104 204620 fs/btrfs/btrfs.ko
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of passing a root and the objectid of the parent directory, just
pass the directory inode, as like that we can extract both the root and
the objectid, reducing the number of arguments by one. It also makes the
function more consistent with other log tree functions in the sense that
we pass the inode and not only its objectid.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There's no need to pass the root as we can extract it from the directory
inode, so remove it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of testing and setting the BTRFS_DELAYED_NODE_DEL_IREF bit in the
delayed node's flags, use test_and_set_bit() which makes the code shorter
without compromising readability and getting rid of the label and goto.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We don't need to search back to the inode item, the directory inode
number is in key.offset, so simply use that. If we can't find the
directory we'll get an ENOENT at the iget().
Note: The patch was taken from v5 of fscrypt patchset
(https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/)
which was handled over time by various people: Omar Sandoval, Sweet Tea
Dorminy, Josef Bacik.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In our user safe ino resolve ioctl we'll just turn any ret into -EACCES
from inode_permission(). This is redundant, and could potentially be
wrong if we had an ENOMEM in the security layer or some such other
error, so simply return the actual return value.
Note: The patch was taken from v5 of fscrypt patchset
(https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/)
which was handled over time by various people: Omar Sandoval, Sweet Tea
Dorminy, Josef Bacik.
Fixes: 23d0b79dfaed ("btrfs: Add unprivileged version of ino_lookup ioctl")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When checksumming the encrypted bio on writes we need to know which
logical address this checksum is for. At the point where we get the
encrypted bio the bi_sector is the physical location on the target disk,
so we need to save the original logical offset in the btrfs_bio. Then
we can use this when checksumming the bio instead of the
bio->iter.bi_sector.
Note: The patch was taken from v5 of fscrypt patchset
(https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/)
which was handled over time by various people: Omar Sandoval, Sweet Tea
Dorminy, Josef Bacik.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Right now there isn't a way to encrypt things that aren't either
filenames in directories or data on blocks on disk with extent
encryption, so for now, disable verity usage with encryption on btrfs.
fscrypt with fsverity should be possible and it can be implemented
in the future.
Note: The patch was taken from v5 of fscrypt patchset
(https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/)
which was handled over time by various people: Omar Sandoval, Sweet Tea
Dorminy, Josef Bacik.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Initially, only normal data extents will be encrypted. This change
forbids various other bits:
- allows reflinking only if both inodes have the same encryption status
- disable inline data on encrypted inodes
Note: The patch was taken from v5 of fscrypt patchset
(https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/)
which was handled over time by various people: Omar Sandoval, Sweet Tea
Dorminy, Josef Bacik.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When btrfs_del_items() empties a leaf, it deletes the leaf unless it's
the root node. For the root leaf case, the code used to reset its level
to 0 via btrfs_set_header_level(). This is redundant as leaf nodes
always have level == 0.
Remove the unnecessary level assignment and invert the conditional to
handle only the non-root leaf deletion. The root leaf is correctly left
as-is.
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
After releasing the path in btrfs_next_old_leaf(), we need to re-check
the leaf because a balance operation may have added items or removed the
last item. The original code handled this with two separate conditional
blocks, the second marked with a lengthy comment explaining a "missed
case".
Merge these two blocks into a single logical structure that handles both
scenarios more clearly.
Also update the comment to be more concise and accurate, incorporating the
explanation directly into the main block rather than a separate annotation.
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of incrementing refcount on 'left' node when it's referenced by
path, simply transfer ownership to path and set left to NULL. This
eliminates:
- Unnecessary refcount increment/decrement operations
- Redundant conditional checks for left node cleanup
The path now consistently owns the left node reference when used.
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The balance_level() function is overly long and contains a cold code path
that handles promoting a child node to root when the root has only one item.
This code has distinct logic that is clearer and more maintainable when
isolated in its own function.
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The following functions are introduced as a middle step for bs > ps
support:
- rbio_streip_step_paddr()
- rbio_pstripe_step_paddr()
- rbio_qstripe_step_paddr()
- sector_step_paddr_in_rbio()
As there is already an existing function without the infix, and has a
different parameter list.
But the existing functions have been cleaned up, there is no need to
keep the "_step" infix, just remove it completely.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The support code for bs > ps is complete, enable it and update
assertions.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The function finish_parity_scrub() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.
Prepare it for bs > ps cases by:
- Introduce a helper, verify_one_parity_step()
Since the P/Q generation is always done in a vertical stripe, we have
to handle the range step by step.
- Only clear the rbio->dbitmap if all steps of an fs block match
- Remove rbio_stripe_paddr() and sector_paddr_in_rbio() helpers
Now we either use the paddrs version for checksum, or the step version
for P/Q generation/recovery.
- Make alloc_rbio_essential_pages() to handle bs > ps cases
Since for bs > ps cases, one fs block needs multiple pages, the
existing simple check against rbio->stripe_pages[] is not enough.
Extract a dedicated helper, alloc_rbio_sector_pages(), for the
existing alloc_rbio_essential_pages(), which is still based on sector
number.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The function rbio_bio_add_io_paddr() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.
Prepare it for bs > ps cases by:
- Introduce a helper bio_add_paddrs()
Previously we only need to add a single page to a bio for a fs block,
but now we need to add multiple pages, this means we can fail halfway.
In that case we need to properly revert the bio (only for its size
though) for halfway failed cases.
- Rename rbio_add_io_paddr() to rbio_add_io_paddrs()
And change the @paddr parameter to @paddrs[].
- Change all callers to use the updated rbio_add_io_paddrs()
For the @paddrs pointer used for the new function, it can be grabbed
using sector_paddrs_in_rbio() and rbio_stripe_paddrs() helpers.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The function steal_rbio() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.
Prepare it for bs > ps cases by:
- Introduce two helpers to calculate the sector number
Previously we assume one page will contain at least one fs block, thus
can use something like "sectors_per_page = PAGE_SIZE / sectorsize;",
but with bs > ps support that above number will be 0.
Instead introduce two helpers:
* page_nr_to_sector_nr()
Returns the sector number of the first sector covered by the page.
* page_nr_to_num_sectors()
Return how many sectors are covered by the page.
And use the returned values for bitmap operations other than
open-coded "PAGE_SIZE / sectorsize".
Those helpers also have extra ASSERT()s to catch weird numbers.
- Use above helpers
The involved functions are:
* steal_rbio_page()
* is_data_stripe_page()
* full_page_sectors_uptodate()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The function set_bio_pages_uptodate() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.
Prepare it for bs > ps cases by:
- Update find_stripe_sector_nr() to check only the first step paddr
We don't need to check each paddr, as the bios are still aligned to fs
block size, thus checking the first step is enough.
- Use step size to iterate the bio
This means we only need to find the sector number for the first step
of each fs block, and skip the remaining part.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The function verify_bio_data_sectors() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.
Prepare it for bs > ps cases by:
- Make get_bio_sector_nr() to consider bs > ps cases
The function is utilized to calculate the sector number of a device
bio submitted by btrfs raid56 layer.
- Assemble a local paddrs[] for checksum calculation
- Open code btrfs_check_block_csum()
btrfs_check_block_csum() only supports fs blocks backed by large
folios.
But for raid56 we can have fs blocks backed by multiple non-contiguous
pages, e.g. direct IO, encoded read/write/send.
So instead of using btrfs_check_block_csum(), open code it to use
btrfs_calculate_block_csum_pages().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The function verify_one_sector() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.
Prepare it for bs > ps cases by:
- Introduce helpers to get a paddrs pointer
Thankfully all the higher layer bio should still be aligned to fs
block size, thus a fs block should still be fully covered by the bio.
Introduce sector_paddrs_in_rbio() and rbio_stripe_paddrs(), which will
return a paddrs pointer inside btrfs_raid_bio::bio_paddrs[] or
stripe_paddrs[].
The pointer can be directly passed to
btrfs_calculate_block_csum_pages() to verify the checksum.
- Open code btrfs_check_block_csum()
btrfs_check_block_csum() only supports fs blocks backed by large
folios.
But for raid56 we can have fs blocks backed by multiple non-contiguous
pages, e.g. direct IO, encoded read/write/send.
So instead of using btrfs_check_block_csum(), open code it to use
btrfs_calculate_block_csum_pages().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently recover_vertical() assumes that every fs block can be mapped
by one page, this is blocking bs > ps support for raid56.
Prepare recover_vertical() to support bs > ps cases by:
- Introduce recover_vertical_step() helper
Which will recover a full step (min(PAGE_SIZE, sectorsize)).
Now recover_vertical() will do the error check for the specified
sector, do the recover step by step, then do the sector verification.
- Fix a spelling error of get_rbio_vertical_errors()
The old name has a typo: "veritical".
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Unlike btrfs_calculate_block_csum_pages(), we cannot handle multiple
pages at the same time for P/Q generation.
So here we introduce a new @step_nr, and various helpers to grab the
sub-block page from the rbio, and generate the P/Q stripe page by page.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|