summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)Author
2025-11-25btrfs: raid56: introduce a new parameter to locate a sectorQu Wenruo
Since we cannot ensure that all bios from the higher layer are backed by large folios (e.g. direct IO, encoded read/write/send), we need the ability to locate sub-block (aka, a page) inside a full stripe. So the existing @stripe_nr + @sector_nr combination is not enough to locate such page for bs > ps cases. Introduce a new parameter, @step_nr, to locate the page of a larger fs block. The naming is following the conventions used inside btrfs elsewhere, where one step is min(sectorsize, PAGE_SIZE). It's still a preparation, only touching the following aspects: - btrfs_dump_rbio() To show the new @sector_nsteps member. - btrfs_raid_bio::sector_nsteps Recording how many steps there are inside a fs block. - Enlarge btrfs_raid_bio::*_paddrs[] size To take @sector_nsteps into consideration. - index_one_bio() - index_stripe_sectors() - memcpy_from_bio_to_stripe() - cache_rbio_pages() - need_read_stripe_sectors() Those functions are iterating *_paddrs[], which needs to take sector_nsteps into consideration. - Rename rbio_stripe_sector_index() to rbio_sector_index() The "stripe" part is not that helpful. And an extra ASSERT() before returning the result. - Add a new rbio_paddr_index() helper This will take the extra @step_nr into consideration. - The comments of btrfs_raid_bio Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25btrfs: raid56: add an overview for the btrfs_raid_bio structureQu Wenruo
The structure needs to track both the pages from higher layer bio and internal pages, thus it can be a little complex to grasp. Add an overview of the structure, especially how we track different pages from higher layer bios and internal ones, to save some time for future developers. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: scrub: always update btrfs_scrub_progress::last_physicalQu Wenruo
[BUG] When a scrub failed immediately without any byte scrubbed, the returned btrfs_scrub_progress::last_physical will always be 0, even if there is a non-zero @start passed into btrfs_scrub_dev() for resume cases. This will reset the progress and make later scrub resume start from the beginning. [CAUSE] The function btrfs_scrub_dev() accepts a @progress parameter to copy its updated progress to the caller, there are cases where we either don't touch progress::last_physical at all or copy 0 into last_physical: - last_physical not updated at all If some error happened before scrubbing any super block or chunk, we will not copy the progress, leaving the @last_physical untouched. E.g. failed to allocate @sctx, scrubbing a missing device or even there is already a running scrub and so on. All those cases won't touch @progress at all, resulting the last_physical untouched and will be left as 0 for most cases. - Error out before scrubbing any bytes In those case we allocated @sctx, and sctx->stat.last_physical is all zero (initialized by kvzalloc()). Unfortunately some critical errors happened during scrub_enumerate_chunks() or scrub_supers() before any stripe is really scrubbed. In that case although we will copy sctx->stat back to @progress, since no byte is really scrubbed, last_physical will be overwritten to 0. [FIX] Make sure the parameter @progress always has its @last_physical member updated to @start parameter inside btrfs_scrub_dev(). At the very beginning of the function, set @progress->last_physical to @start, so that even if we error out without doing progress copying, last_physical is still at @start. Then after we got @sctx allocated, set sctx->stat.last_physical to @start, this will make sure even if we didn't get any byte scrubbed, at the progress copying stage the @last_physical is not left as zero. This should resolve the resume progress reset problem. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: place all boolean fields together in struct find_free_extent_ctlFilipe Manana
Move the 'retry_uncached' and 'hint' fields close to the other boolean fields so that we remove a hole from the structure and reduce its size from 136 bytes down to 128 bytes. Currently this structure is only allocated in the stack of btrfs_reserve_extent(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: use booleans for delalloc arguments and struct find_free_extent_ctlFilipe Manana
The struct find_free_extent_ctl uses an int for the 'delalloc' field but it's always used as a boolean, and its value is used to be passed to several functions to signal if we are dealing with delalloc. The same goes for the 'is_data' argument from btrfs_reserve_extent(). So change the type from int to bool and move the field definition in the find_free_extent_ctl structure so that it's close to other bool fields and reduces the size of the structure from 144 down to 136 bytes (at the moment it's only declared in the stack of btrfs_reserve_extent(), never allocated otherwise). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: use bool type for btrfs_path members used as booleansFilipe Manana
Many fields of struct btrfs_path are used as booleans but their type is an unsigned int (of one 1 bit width to save space). Change the type to bool keeping the :1 suffix so that they combine with the previous u8 fields in order to save space. This makes the code more clear by using explicit true/false and more in line with the preferred style, preserving the size of the structure. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: update check_skip variable after unlocking current nodeFilipe Manana
There's no need to update the local variable 'check_skip' to false inside the critical section delimited by the lock of the current node, so do it after unlocking the node. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: abort transaction on item count overflow in __push_leaf_left()Filipe Manana
If we try to push an item count from the right leaf that is greater than the number of items in the leaf, we just emit a warning. This should never happen but if it does we get an underflow in the new number of items in the right leaf and chaos follows from it. So replace the warning with proper error handling, by aborting the transaction and returning -EUCLEAN, and proper logging by using btrfs_crit() instead of WARN(), which gives us proper formatting and information about the filesystem. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: always use right leaf variable in __push_leaf_left()Filipe Manana
The 'right' variable points to path->nodes[0] and path->nodes[0] is never changed, but some places use 'right' while others refer to path->nodes[0]. Update all sites to use 'right' as not only it's shorter it's also easier to reason since it means the right leaf and avoids any confusion with the sibling left leaf. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: remove duplicated leaf dirty status clearing in __push_leaf_right()Filipe Manana
We have already called btrfs_clear_buffer_dirty() against the left leaf in the code above: btrfs_set_header_nritems(left, left_nritems); if (left_nritems) btrfs_mark_buffer_dirty(trans, left); else btrfs_clear_buffer_dirty(trans, left); So remove the second check for a 0 number of items in the left leaf and calling again btrfs_clear_buffer_dirty() against the left leaf. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: always use left leaf variable in __push_leaf_right()Filipe Manana
The 'left' variable points to path->nodes[0] and path->nodes[0] is never changed, but some places use 'left' while others refer to path->nodes[0]. Update all sites to use 'left' as not only it's shorter it's also easier to reason since it means the left leaf and avoids any confusion with the sibling right leaf. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: add unlikely to critical error in btrfs_extend_item()Filipe Manana
It's not expected to get a data size less than the leaf's free space, which would lead to a leaf dump and BUG(), so tag the if statement's expression as unlikely, hinting the compiler to potentially generate better code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: remove pointless return value update in btrfs_del_items()Filipe Manana
The call to btrfs_del_leaf() can only return an error (negative value) or zero (success). If we didn't get an error then 'ret' is zero, so it's pointless to set it to zero again. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: fix leaf leak in an error path in btrfs_del_items()Filipe Manana
If the call to btrfs_del_leaf() fails we return without decrementing the extra ref we took on the leaf, therefore leaking it. Fix this by ensuring we drop the ref count before returning the error. Fixes: 751a27615dda ("btrfs: do not BUG_ON() on tree mod log failures at btrfs_del_ptr()") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: fix incomplete parameter rename in btrfs_decompress()Zhen Ni
Commit 2c25716dcc25 ("btrfs: zlib: fix and simplify the inline extent decompression") renamed the 'start_byte' parameter to 'dest_pgoff' in the btrfs_decompress(). The remaining 'start_byte' references are inconsistent with the actual implementation and may cause confusion for developers. Ensure consistency between function declaration and implementation. Signed-off-by: Zhen Ni <zhen.ni@easystack.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: make a few more ASSERTs verboseDavid Sterba
We have support for optional string to be printed in ASSERT() (added in 19468a623a9109 ("btrfs: enhance ASSERT() to take optional format string")), it's not yet everywhere it could be so add a few more files. Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: enable encoded read/write/send for bs > ps casesQu Wenruo
Since the read verification and read repair are all supporting bs > ps without large folios now, we can enable encoded read/write/send. Now we can relax the alignment in assert_bbio_alignment() to min(blocksize, PAGE_SIZE). But also add the extra blocksize based alignment check for the logical and length of the bbio. There is a pitfall in btrfs_add_compress_bio_folios(), which relies on the folios passed in to meet the minimal folio order. But now we can pass regular page sized folios in, update it to check each folio's size instead of using the minimal folio size. This allows btrfs_add_compress_bio_folios() to even handle folios array with different sizes, thankfully we don't yet need to handle such crazy situation. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: make read verification handle bs > ps cases without large foliosQu Wenruo
The current read verification is also relying on large folios to support bs > ps cases, but that introduced quite some limits. To enhance read-repair to support bs > ps without large folios: - Make btrfs_data_csum_ok() to accept an array of paddrs Which can pass the paddrs[] direct into btrfs_calculate_block_csum_pages(). - Make repair_one_sector() to accept an array of paddrs So that it can submit a repair bio backed by regular pages, not only large folios. This requires us to allocate more slots at bio allocation time though. Also since the caller may have only partially advanced the saved_iter for bs > ps cases, we can not directly trust the logical bytenr from saved_iter (can be unaligned), thus a manual round down is necessary for the logical bytenr. - Make btrfs_check_read_bio() to build an array of paddrs The tricky part is that we can only call btrfs_data_csum_ok() after all involved pages are assembled. This means at the call time of btrfs_check_read_bio(), our offset inside the bio is already at the end of the fs block. Thus we must re-calculate @bio_offset for btrfs_data_csum_ok() and repair_one_sector(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: make btrfs_repair_io_failure() handle bs > ps cases without large foliosQu Wenruo
Currently btrfs_repair_io_failure() only accept a single @paddr parameter, and for bs > ps cases it's required that @paddr is backed by a large folio. That assumption has quite some limitations, preventing us from utilizing true zero-copy direct-io and encoded read/writes. To address the problem, enhance btrfs_repair_io_failure() by: - Accept an array of paddrs, up to 64K / PAGE_SIZE entries This kind of acts like a bio_vec, but with very limited entries, as the function is only utilized to repair one fs data block, or a tree block. Both have an upper size limit (BTRFS_MAX_BLOCK_SIZE, i.e. 64K), so we don't need the full bio_vec thing to handle it. - Allocate a bio with multiple slots Previously even for bs > ps cases, we only passed in a contiguous physical address range, thus a single slot will be enough. But not anymore, so we have to allocate a bio structure, other than using the on-stack one. - Use on-stack memory to allocate @paddrs array It's at most 16 pages (4K page size, 64K block size), will take up at most 128 bytes. I think the on-stack cost is still acceptable. - Add one extra check to make sure the repair bio is exactly one block - Utilize btrfs_repair_io_failure() to submit a single bio for metadata This should improve the read-repair performance for metadata, as now we submit a node sized bio then wait, other than submit each block of the metadata and wait for each submitted block. - Add one extra parameter indicating the step This is due to the fact that metadata step can be as large as nodesize, instead of sectorsize. So we need a way to distinguish metadata and data repair. - Reduce the width of @length parameter of btrfs_repair_io_failure() Since we only call btrfs_repair_io_failure() on a single data or metadata block, u64 is overkilled. Use u32 instead and add one extra ASSERT()s to make sure the length never exceed BTRFS_MAX_BLOCK_SIZE. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: make btrfs_csum_one_bio() handle bs > ps without large foliosQu Wenruo
For bs > ps cases, all folios passed into btrfs_csum_one_bio() are ensured to be backed by large folios. But that requirement excludes features like direct IO and encoded writes. To support bs > ps without large folios, enhance btrfs_csum_one_bio() by: - Split btrfs_calculate_block_csum() into two versions * btrfs_calculate_block_csum_folio() For call sites where a fs block is always backed by a large folio. This will do extra checks on the folio size, build a paddrs[] array, and pass it into the newer btrfs_calculate_block_csum_pages() helper. For now btrfs_check_block_csum() is still using this version. * btrfs_calculate_block_csum_pages() For call sites that may hit a fs block backed by noncontiguous pages. The pages are represented by paddrs[] array, which includes the offset inside the page. This function will do the proper sub-block handling. - Make btrfs_csum_one_bio() to use btrfs_calculate_block_csum_pages() This means we will need to build a local paddrs[] array, and after filling a fs block, do the checksum calculation. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: move struct reserve_ticket definition to space-info.cFilipe Manana
It's not used anywhere outside space-info.c so move it from space-info.h into space-info.c. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: move and rename CSUM_FMT definitionDavid Sterba
Move the CSUM_FMT* definitions to fs.h where is be the BTRFS_KEY_FMT and add the prefix for consistency. Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: tests: do trivial BTRFS_PATH_AUTO_FREE conversionsSun YangKai
Trivial pattern for the auto freeing where there are no operations between btrfs_free_path() and the function returns. Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: raid56: remove sector_ptr structureQu Wenruo
Since sector_ptr structure is now only containing a single paddr, there is no need to use that structure. Instead use phys_addr_t array for bio and stripe pointers. This means several helpers are also needed to accept a paddr instead of a sector_ptr pointer. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: raid56: move sector_ptr::uptodate into a dedicated bitmapQu Wenruo
The uptodate boolean member can be extracted into a bitmap, which will save us some space (1 bit in a byte vs 8 bits in a byte). Furthermore we do not need to record the uptodate bitmap for bio sectors, as if bio_sectors[].paddr is valid it means there is a bio and will be uptodate. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: raid56: remove sector_ptr::has_paddr memberQu Wenruo
We can use paddr -1 as an indicator for unset/uninitialized paddr. We can not use 0 paddr, unlike virtual address 0 which is never mapped thus will always trigger a page fault, physical address 0 may be a valid page. So here we follow swiotlb to use (paddr)-1 as a special indicator for invalid/unset physical address. Even if the PFN may still be valid, our usage of the physical address should always be aligned to fs block size (or page size for bs > ps cases), thus such -1 paddr should never be a valid one. With this special -1 paddr, we can get rid of has_paddr member and save 1 byte for sector_ptr structure. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: simplify list initialization in btrfs_compr_pool_scan()Baolin Liu
In btrfs_compr_pool_scan(), use LIST_HEAD() to declare and initialize the 'remove' list_head in one step instead of using INIT_LIST_HEAD() separately. Signed-off-by: Baolin Liu <liubaolin@kylinos.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: scrub: factor out parity scrub code into a helperQu Wenruo
The function scrub_raid56_parity_stripe() is handling the parity stripe by the following steps: - Scrub each data stripes And make sure everything is fine in each data stripe - Cache the data stripe into the raid bio - Use the cached raid bio to scrub the target parity stripe Extract the last two steps into a new helper, scrub_raid56_cached_parity(), as a cleanup and make the error handling more straightforward. With the following minor cleanups: - Use on-stack bio structure The bio is always empty thus we do not need any bio vector nor the block device. Thus there is no need to allocate a bio, the on-stack one is more than enough to cut it. - Remove the unnecessary btrfs_put_bioc() call if btrfs_map_block() failed If btrfs_map_block() is failed, @bioc_ret will not be touched thus there is no need to call btrfs_put_bioc() in this case. - Use a proper out: tag to do the cleanup Now the error cleanup is much shorter and simpler, just btrfs_bio_counter_dec() and bio_uninit(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: make sure extent and csum paths are always released in ↵Qu Wenruo
scrub_raid56_parity_stripe() Unlike queue_scrub_stripe() which uses the global sctx->extent_path and sctx->csum_path which are always released at the end of scrub_stripe(), scrub_raid56_parity_stripe() uses local extent_path and csum_path, as that function is going to handle the full stripe, whose bytenr may be smaller than the bytenr in the global sctx paths. However the cleanup of local extent/csum paths is only happening after we have successfully submitted an rbio. There are several error routes that we didn't release those two paths: - scrub_find_fill_first_stripe() errored out at csum tree search In that case extent_path is still valid, and that function itself will not release the extent_path passed in. And the function returns directly without releasing both paths. - The full stripe is empty - Some blocks failed to be recovered - btrfs_map_block() failed - raid56_parity_alloc_scrub_rbio() failed The function returns directly without releasing both paths. Fix it by covering btrfs_release_path() calls inside the out: tag. This is just a hot fix, in the long run we will go scoped based auto freeing for both local paths. Fixes: 1dc4888e725d ("btrfs: scrub: avoid unnecessary extent tree search preparing stripes") Fixes: 3c771c194402 ("btrfs: scrub: avoid unnecessary csum tree search preparing stripes") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: use kvcalloc for btrfs_bio::csum allocationQu Wenruo
[BUG] There is a report that memory allocation failed for btrfs_bio::csum during a large read: b2sum: page allocation failure: order:4, mode:0x40c40(GFP_NOFS|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0 CPU: 0 UID: 0 PID: 416120 Comm: b2sum Tainted: G W 6.17.0 #1 NONE Tainted: [W]=WARN Hardware name: Raspberry Pi 4 Model B Rev 1.5 (DT) Call trace: show_stack+0x18/0x30 (C) dump_stack_lvl+0x5c/0x7c dump_stack+0x18/0x24 warn_alloc+0xec/0x184 __alloc_pages_slowpath.constprop.0+0x21c/0x730 __alloc_frozen_pages_noprof+0x230/0x260 ___kmalloc_large_node+0xd4/0xf0 __kmalloc_noprof+0x1c8/0x260 btrfs_lookup_bio_sums+0x214/0x278 btrfs_submit_chunk+0xf0/0x3c0 btrfs_submit_bbio+0x2c/0x4c submit_one_bio+0x50/0xac submit_extent_folio+0x13c/0x340 btrfs_do_readpage+0x4b0/0x7a0 btrfs_readahead+0x184/0x254 read_pages+0x58/0x260 page_cache_ra_unbounded+0x170/0x24c page_cache_ra_order+0x360/0x3bc page_cache_async_ra+0x1a4/0x1d4 filemap_readahead.isra.0+0x44/0x74 filemap_get_pages+0x2b4/0x3b4 filemap_read+0xc4/0x3bc btrfs_file_read_iter+0x70/0x7c vfs_read+0x1ec/0x2c0 ksys_read+0x4c/0xe0 __arm64_sys_read+0x18/0x24 el0_svc_common.constprop.0+0x5c/0x130 do_el0_svc+0x1c/0x30 el0_svc+0x30/0xa0 el0t_64_sync_handler+0xa0/0xe4 el0t_64_sync+0x198/0x19c [CAUSE] Btrfs needs to allocate memory for btrfs_bio::csum for large reads, so that we can later verify the contents of the read. However nowadays a read bio can easily go beyond BIO_MAX_VECS * PAGE_SIZE (which is 1M for 4K page sizes), due to the multi-page bvec that one bvec can have more than one pages, as long as the pages are physically adjacent. This will become more common when the large folio support is moved out of experimental features. In the above case, a read larger than 4MiB with SHA256 checksum (32 bytes for each 4K block) will be able to trigger a order 4 allocation. The order 4 is larger than PAGE_ALLOC_COSTLY_ORDER (3), thus without extra flags such allocation will not retry. And if the system has very small amount of memory (e.g. RPI4 with low memory spec) or VMs with small vRAM, or the memory is heavily fragmented, such allocation will fail and cause the above warning. [FIX] Although btrfs is handling the memory allocation failure correctly, we do not really need the physically contiguous memory just to restore our checksum. In fact btrfs_csum_one_bio() is already using kvzalloc() to reduce the memory pressure. So follow the step to use kvcalloc() for btrfs_bio::csum. Reported-by: Calvin Owens <calvin@wbinvd.org> Link: https://lore.kernel.org/linux-btrfs/20251105180054.511528-1-calvin@wbinvd.org/ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: don't generate any code from ASSERT() in release buildsGladyshev Ilya
The current definition of ASSERT(cond) as (void)(cond) is redundant, since these checks have no side effects and don't affect code logic. However, some checks contain READ_ONCE() or other compiler-unfriendly constructs. For example, ASSERT(list_empty) in btrfs_add_dealloc_inode() was compiled to a redundant mov instruction due to this issue. Define ASSERT as BUILD_BUG_ON_INVALID for !CONFIG_BTRFS_ASSERT builds which uses sizeof(cond) trick. Also mark full_page_sectors_uptodate() as __maybe_unused to suppress "unneeded declaration" warning (it's needed in compile time) Signed-off-by: Gladyshev Ilya <foxido@foxido.dev> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: introduce btrfs_bio::async_csumQu Wenruo
[ENHANCEMENT] Btrfs currently calculates data checksums then submits the bio. But after commit 968f19c5b1b7 ("btrfs: always fallback to buffered write if the inode requires checksum"), any writes with data checksum will fallback to buffered IO, meaning the content will not change during writeback. This means we're safe to calculate the data checksum and submit the bio in parallel, and only need the following new behavior: - Wait the csum generation to finish before calling btrfs_bio::end_io() Or this can lead to use-after-free for the csum generation worker. - Save the current bi_iter for csum_one_bio() As the submission part can advance btrfs_bio::bio.bi_iter, if not saved csum_one_bio() may got an empty bi_iter and do not generate any checksum. Unfortunately this means we have to increase the size of btrfs_bio for 16 bytes, but this is still acceptable. As usual, such new feature is hidden behind the experimental flag. [THEORETIC ANALYZE] Consider the following theoretic hardware performance, which should be more or less close to modern mainstream hardware: Memory bandwidth: 50GiB/s CRC32C bandwidth: 45GiB/s SSD bandwidth: 8GiB/s Then write bandwidth with data checksum before the patch is: 1 / ( 1 / 50 + 1 / 45 + 1 / 8) = 5.98 GiB/s After the patch, the bandwidth is: 1 / ( 1 / 50 + max( 1 / 45 + 1 / 8)) = 6.90 GiB/s The difference is 15.32% improvement. [REAL WORLD BENCHMARK] I'm using a Zen5 (HX 370) as the host, the VM has 4GiB memory, 10 vCPUs, the storage is backed by a PCIe gen3 x4 NVMe. The test is a direct IO write, with 1MiB block size, write 7GiB data into a btrfs mount with data checksum. Thus the direct write will fallback to buffered one: Vanilla Datasum: 1619.97 GiB/s Patched Datasum: 1792.26 GiB/s Diff +10.6 % In my case, the bottleneck is the storage, thus the improvement is not reaching the theoretic one, but still some observable improvement. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: relax btrfs_inode::ordered_tree_lock IRQ locking contextQu Wenruo
We used IRQ version of spinlock for ordered_tree_lock, as btrfs_finish_ordered_extent() can be called in end_bbio_data_write() which was in IRQ context. However since we're moving all the btrfs_bio::end_io() calls into task context, there is no more need to support IRQ context thus we can relax to regular spin_lock()/spin_unlock() for btrfs_inode::ordered_tree_lock. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: remove btrfs_fs_info::compressed_write_workersQu Wenruo
The reason why end_bbio_compressed_write() queues a work into compressed_write_workers wq is for end_compressed_writeback() call, as it will grab all the involved folios and clear the writeback flags, which may sleep. However now we always run btrfs_bio::end_io() in task context, there is no need to queue the work anymore. Just remove btrfs_fs_info::compressed_write_workers and compressed_bio::write_end_work. There is a comment about the works queued into compressed_write_workers, now change to flush endio wq instead, which is responsible to handle all data endio functions. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: make sure all btrfs_bio::end_io are called in task contextQu Wenruo
[BACKGROUND] Btrfs has a lot of different bi_end_io functions, to handle different raid profiles. But they introduced a lot of different contexts for btrfs_bio::end_io() calls: - Simple read bios Run in task context, backed by either endio_meta_workers or endio_workers. - Simple write bios Run in IRQ context. - RAID56 write or rebuild bios Run in task context, backed by rmw_workers. - Mirrored write bios Run in irq context. This is inconsistent, and contributes to the number of workqueues used in btrfs. [ENHANCEMENT] Make all the above bios call their btrfs_bio::end_io() in task context, backed by either endio_meta_workers for metadata, or endio_workers for data. For simple write bios, merge the handling into simple_end_io_work(). For mirrored write bios, it will be a little more complex, since both the original or the cloned bios can run the final btrfs_bio::end_io(). Here we make sure the cloned bios are using btrfs_bioset, to reuse the end_io_work, and run both original and cloned work inside the workqueue. Add extra ASSERT()s to make sure btrfs_bio_end_io() is running in task context. This not only unifies the context for btrfs_bio::end_io() functions, but also opens a new door for further btrfs_bio::end_io() related cleanups. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: remove btrfs_bio::fs_info by extracting it from btrfs_bio::inodeQu Wenruo
Currently there is only one caller which doesn't populate btrfs_bio::inode, and that's scrub. The idea is scrub doesn't want any automatic csum verification nor read-repair, as everything will be handled by scrub itself. However that behavior is really no different than metadata inode, thus we can reuse btree_inode as btrfs_bio::inode for scrub. The only exception is in btrfs_submit_chunk() where if a bbio is from scrub or data reloc inode, we set rst_search_commit_root to true. This means we still need a way to distinguish scrub from metadata, but that can be done by a new flag inside btrfs_bio. Now btrfs_bio::inode is a mandatory parameter, we can extract fs_info from that inode thus can remove btrfs_bio::fs_info to save 8 bytes from btrfs_bio structure. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: headers cleanup to remove unnecessary local includesQu Wenruo
[BUG] When I tried to remove btrfs_bio::fs_info and use btrfs_bio::inode to grab the fs_info, the header "btrfs_inode.h" is needed to access the full btrfs_inode structure. Then btrfs will fail to compile. [CAUSE] There is a recursive including chain: "bio.h" -> "btrfs_inode.h" -> "extent_map.h" -> "compression.h" -> "bio.h" That recursive including is causing problems for btrfs. [ENHANCEMENT] To reduce the risk of recursive including: - Remove unnecessary local includes from btrfs headers Either the included header is pulled in by other headers, or is completely unnecessary. - Remove btrfs local includes if the header only requires a pointer In that case let the implementing C file to pull the required header. This is especially important for headers like "btrfs_inode.h" which pulls in a lot of other btrfs headers, thus it's a mine field of recursive including. - Remove unnecessary temporary structure definition Either if we have included the header defining the structure, or completely unused. Now including "btrfs_inode.h" inside "bio.h" is completely fine, although "btrfs_inode.h" still includes "extent_map.h", but that header only includes "fs.h", no more chain back to "bio.h". Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: replace BTRFS_MAX_BIO_SECTORS with BIO_MAX_VECSQu Wenruo
It's impossible to have a btrfs bio with more than BIO_MAX_VECS vectors anyway. And there is only one location utilizing that macro, just replace it with BIO_MAX_VECS. Both have the same value. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: replace const_ilog2() with ilog2()Andy Shevchenko
const_ilog2() was a workaround of some sparse issue, which has never appeared in the C functions. Replace it with ilog2(). Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: zoned: show statistics for zoned filesystemsJohannes Thumshirn
Provide statistics for zoned filesystems. These statistics include, the number of active block-groups, how many of them are reclaimable or unused, if the filesystem needs to be reclaimed, the currently assigned relocation and treelog block-groups if they're present and a list of active zones. Example: active block-groups: 4   reclaimable: 0   unused: 2   need reclaim: false data relocation block-group: 4294967296 active zones:   start: 1610612736, wp: 344064 used: 16384, reserved: 0, unusable: 327680   start: 1879048192, wp: 34963456 used: 131072, reserved: 0, unusable: 34832384   start: 4026531840, wp: 0 used: 0, reserved: 0, unusable: 0   start: 4294967296, wp: 0 used: 0, reserved: 0, unusable: 0 Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: add ASSERTs on prealloc in qgroup functionsMiquel Sabaté Solà
The prealloc variable in these functions is always initialized to NULL. Whenever we allocate memory for it, if it fails then NULL is preserved, otherwise we delegate the ownership of the pointer to add_qgroup_rb() and set it right after to NULL. Since in any case the pointer ends up being NULL at the end of its usage, we can safely remove calls to kfree() for it, while adding an ASSERT as an extra check. Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: apply the AUTO_K(V)FREE macros throughout the codeMiquel Sabaté Solà
Apply the AUTO_KFREE and AUTO_KVFREE macros wherever it makes sense. Since this macro is expected to improve code readability, it has been avoided in places where the lifetime of objects wasn't easy to follow and a cleanup attribute would've made things worse; or when the cleanup section of a function involved many other things and thus there was no readability impact anyways. This change has also not been applied in extremely short functions where readability was clearly not an issue. Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: define the AUTO_KFREE/AUTO_KVFREE helper macrosMiquel Sabaté Solà
These are two simple macros which ensure that a pointer is initialized to NULL and with the proper cleanup attribute for it. Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: declare free_ipath() via DEFINE_FREE()Miquel Sabaté Solà
The free_ipath() function was being used as a cleanup function everywhere. Declare it via DEFINE_FREE() so we can use this function with the __free() helper. The name has also been adjusted so it's closer to the type's name. Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: scrub: cancel the run if there is a pending signalQu Wenruo
Unlike relocation, scrub never checks pending signals, and even for relocation is only explicitly checking for fatal signal (SIGKILL), not for regular ones. Thankfully relocation can still be interrupted by regular signals by the usage of wait_on_bit(), which is called with TASK_INTERRUPTIBLE. Do the same for scrub/dev-replace, so that regular signals can also cancel the scrub/replace run, and more importantly handle v2 cgroup freezing which is based on signal handling code inside the kernel, and freezing() function will not return true for v2 cgroup freezing. This will address the problem that systemd slice freezing will timeout on long running scrub/dev-replace. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: scrub: cancel the run if the process or fs is being frozenQu Wenruo
It's a known bug that btrfs scrub/dev-replace can prevent the system from suspending. There are at least two factors involved: - Holding super_block::s_writers for the whole scrub/dev-replace duration We hold that percpu rw semaphore through mnt_want_write_file() for the whole scrub/dev-replace duration. That will prevent the fs being frozen, which can be initiated by either the user (e.g. fsfreeze) or power management suspend/hibernate. - Stuck in the kernel space for a long time During suspend all user processes (and some kernel threads) will be frozen. But if a user space progress has fallen into kernel (scrub ioctl) and do not return for a long time, it will make process freezing time out. Unfortunately scrub/dev-replace is a long running ioctl, and it will prevent the btrfs process from returning to the user space, thus make PM suspend/hibernate time out. Address them in one go: - Introduce a new helper should_cancel_scrub() Which includes the existing cancel request and new fs/process freezing checks. Here we have to check both fs and process freezing for PM suspend/hibernate. PM can be configured to freeze filesystems before processes. (The current default is not to freeze filesystems, but planned to freeze the filesystems as the new default.) Checking only fs freezing will fail PM without fs freezing, as the process freezing will time out. Checking only process freezing will fail PM with fs freezing since the fs freezing happens before process freezing. And the return value will indicate the reason, -ECANCLED for the explicitly canceled runs, and -EINTR for fs freeze or PM reasons. - Cancel the run if should_cancel_scrub() is true Unfortunately canceling is the only feasible solution here, pausing is not possible as we will still stay in the kernel space thus will still prevent the process from being frozen. This will cause a user impacting behavior change: Dev-replace can be interrupted by PM, and there is no way to resume but start from the beginning again. This means dev-replace may fail on newer kernels, and end users will need extra steps like using systemd-inhibit to prevent suspend/hibernate, to get back the old uninterrupted behavior. This behavior change will need extra documentation updates and communication with projects involving scrub/dev-replace including btrfs-progs. Reviewed-by: Filipe Manana <fdmanana@suse.com> Link: https://lore.kernel.org/linux-btrfs/d93b2a2d-6ad9-4c49-809f-11d769a6f30a@app.fastmail.com/ Reported-by: Chris Murphy <lists@colorremedies.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: scrub: add cancel/pause/removed bg checks for raid56 parity stripesQu Wenruo
For raid56, data and parity stripes are handled differently. For data stripes they are handled just like regular RAID1/RAID10 stripes, going through the regular scrub_simple_mirror(). But for parity stripes we have to read out all involved data stripes and do any needed verification and repair, then scrub the parity stripe. This process will take a much longer time than a regular stripe, but unlike scrub_simple_mirror(), we do not check if we should cancel/pause or the block group is already removed. Aligned the behavior of scrub_raid56_parity_stripe() to scrub_simple_mirror(), by adding: - Cancel check - Pause check - Removed block group check Since those checks are the same from the scrub_simple_mirror(), also update the comments of scrub_simple_mirror() by: - Remove too obvious comments We do not need extra comments on what we're checking, it's really too obvious. - Remove a stale comment about pausing Now the scrub is always queuing all involved stripes, and submit them in one go, there is no more submission part during pausing. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: annotate as unlikely fs aborted checks in space flushing codeFilipe Manana
It's not expected to have the fs in an aborted state, so surround the abortion checks with unlikely to make it clear it's unexpected and to hint the compiler to generate better code. Also at maybe_fail_all_tickets() untangle all repeated checks for the abortion into a single if-then-else. This makes things more readable and makes the compiler generate less code. On x86_64 with gcc 14.2.0-19 from Debian I got the following object size differences. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 2021606 179704 25088 2226398 21f8de fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 2021458 179704 25088 2226250 21f84a fs/btrfs/btrfs.ko Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: avoid space_info locking when checking if tickets are servedFilipe Manana
When checking if a ticket was served, we take the space_info's spinlock. If the ticket was served (its ->bytes is 0) or had an error (its ->error it not 0) then we just unlock the space_info and return. This however causes contention on the space_info's spinlock, which is heavily used (space reservation, space flushing, allocating and deallocating an extent from a block group (btrfs_update_block_group()), etc). Instead of using the space_info's spinlock to check if a ticket was served, use a per ticket spinlock which isn't used by anyone other than the task that created the ticket (stack allocated) and the task that serves the ticket (a reclaim task or any task deallocating space that ends up at btrfs_try_granting_tickets()). After applying this patch and all previous patches from the same patchset (many attempt to reduce space_info critical sections), lockstat showed some improvements for a fs_mark test regarding the space_info's spinlock 'lock'. The lockstat results: Before patchset: con-bounces: 13733858 contentions: 15902322 waittime-total: 264902529.72 acq-bounces: 28161791 acquisitions: 38679282 After patchset: con-bounces: 12032220 contentions: 13598034 waittime-total: 221806127.28 acq-bounces: 24717947 acquisitions: 34103281 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: move ticket wakeup and finalization to remove_ticket()Filipe Manana
Instead of repeating the wakeup and setup of the ->bytes or ->error field, move those steps to remove_ticket() to avoid duplication. This is also needed for the next patch in the series, so that we avoid duplicating more logic. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>