summaryrefslogtreecommitdiff
path: root/drivers/vfio
AgeCommit message (Collapse)Author
2025-12-04Merge tag 'drm-next-2025-12-05' of https://gitlab.freedesktop.org/drm/kernelLinus Torvalds
Pull more drm updates from Dave Airlie: "There was some additional intel code for color operations we wanted to land. However I discovered I missed a pull for the xe vfio driver which I had sorted into 6.20 in my brain, until Thomas mentioned it. This contains the xe vfio code, a bunch of xe fixes that were waiting and the i915 color management support. I'd like to include it as part of keeping the two main vendors on the same page and giving a good cross-driver experience for userspace when it starts using it. vfio: - add a vfio_pci variant driver for Intel xe/i915 display: - add plane color management support xe: - Add scope-based cleanup helper for runtime PM - vfio xe driver prerequisites and exports - fix vfio link error - Fix a memory leak - Fix a 64-bit division - vf migration fix - LRC pause fix" * tag 'drm-next-2025-12-05' of https://gitlab.freedesktop.org/drm/kernel: (25 commits) drm/i915/color: Enable Plane Color Pipelines drm/i915/color: Add 3D LUT to color pipeline drm/i915/color: Add registers for 3D LUT drm/i915/color: Program Plane Post CSC Registers drm/i915/color: Program Pre-CSC registers drm/i915/color: Add framework to program PRE/POST CSC LUT drm/i915: Add register definitions for Plane Post CSC drm/i915: Add register definitions for Plane Degamma drm/i915/color: Add plane CTM callback for D12 and beyond drm/i915/color: Preserve sign bit when int_bits is Zero drm/i915/color: Add framework to program CSC drm/i915/color: Create a transfer function color pipeline drm/i915/color: Add helper to create intel colorop drm/i915: Add intel_color_op drm/i915/display: Add identifiers for driver specific blocks drm/xe/pf: fix VFIO link error drm/xe: Protect against unset LRC when pausing submissions drm/xe/vf: Start re-emission from first unsignaled job during VF migration drm/xe/pf: Use div_u64 when calculating GGTT profile drm/xe: Fix memory leak when handling pagefault vma ...
2025-12-04Merge tag 'for-linus-iommufd' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd Pull iommufd updates from Jason Gunthorpe: "This is a pretty consequential cycle for iommufd, though this pull is not too big. It is based on a shared branch with VFIO that introduces VFIO_DEVICE_FEATURE_DMA_BUF a DMABUF exporter for VFIO device's MMIO PCI BARs. This was a large multiple series journey over the last year and a half. Based on that work IOMMUFD gains support for VFIO DMABUF's in its existing IOMMU_IOAS_MAP_FILE, which closes the last major gap to support PCI peer to peer transfers within VMs. In Joerg's iommu tree we have the "generic page table" work which aims to consolidate all the duplicated page table code in every iommu driver into a single algorithm. This will be used by iommufd to implement unique page table operations to start adding new features and improve performance. In here: - Expand IOMMU_IOAS_MAP_FILE to accept a DMABUF exported from VFIO. This is the first step to broader DMABUF support in iommufd, right now it only works with VFIO. This closes the last functional gap with classic VFIO type 1 to safely support PCI peer to peer DMA by mapping the VFIO device's MMIO into the IOMMU. - Relax SMMUv3 restrictions on nesting domains to better support qemu's sequence to have an identity mapping before the vSID is established" * tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: iommu/arm-smmu-v3-iommufd: Allow attaching nested domain for GBPA cases iommufd/selftest: Add some tests for the dmabuf flow iommufd: Accept a DMABUF through IOMMU_IOAS_MAP_FILE iommufd: Have iopt_map_file_pages convert the fd to a file iommufd: Have pfn_reader process DMABUF iopt_pages iommufd: Allow MMIO pages in a batch iommufd: Allow a DMABUF to be revoked iommufd: Do not map/unmap revoked DMABUFs iommufd: Add DMABUF to iopt_pages vfio/pci: Add vfio_pci_dma_buf_iommufd_map()
2025-12-04Merge tag 'vfio-v6.19-rc1' of https://github.com/awilliam/linux-vfioLinus Torvalds
Pull VFIO updates from Alex Williamson: - Move libvfio selftest artifacts in preparation of more tightly coupled integration with KVM selftests (David Matlack) - Fix comment typo in mtty driver (Chu Guangqing) - Support for new hardware revision in the hisi_acc vfio-pci variant driver where the migration registers can now be accessed via the PF. When enabled for this support, the full BAR can be exposed to the user (Longfang Liu) - Fix vfio cdev support for VF token passing, using the correct size for the kernel structure, thereby actually allowing userspace to provide a non-zero UUID token. Also set the match token callback for the hisi_acc, fixing VF token support for this this vfio-pci variant driver (Raghavendra Rao Ananta) - Introduce internal callbacks on vfio devices to simplify and consolidate duplicate code for generating VFIO_DEVICE_GET_REGION_INFO data, removing various ioctl intercepts with a more structured solution (Jason Gunthorpe) - Introduce dma-buf support for vfio-pci devices, allowing MMIO regions to be exposed through dma-buf objects with lifecycle managed through move operations. This enables low-level interactions such as a vfio-pci based SPDK drivers interacting directly with dma-buf capable RDMA devices to enable peer-to-peer operations. IOMMUFD is also now able to build upon this support to fill a long standing feature gap versus the legacy vfio type1 IOMMU backend with an implementation of P2P support for VM use cases that better manages the lifecycle of the P2P mapping (Leon Romanovsky, Jason Gunthorpe, Vivek Kasireddy) - Convert eventfd triggering for error and request signals to use RCU mechanisms in order to avoid a 3-way lockdep reported deadlock issue (Alex Williamson) - Fix a 32-bit overflow introduced via dma-buf support manifesting with large DMA buffers (Alex Mastro) - Convert nvgrace-gpu vfio-pci variant driver to insert mappings on fault rather than at mmap time. This conversion serves both to make use of huge PFNMAPs but also to both avoid corrected RAS events during reset by now being subject to vfio-pci-core's use of unmap_mapping_range(), and to enable a device readiness test after reset (Ankit Agrawal) - Refactoring of vfio selftests to support multi-device tests and split code to provide better separation between IOMMU and device objects. This work also enables a new test suite addition to measure parallel device initialization latency (David Matlack) * tag 'vfio-v6.19-rc1' of https://github.com/awilliam/linux-vfio: (65 commits) vfio: selftests: Add vfio_pci_device_init_perf_test vfio: selftests: Eliminate INVALID_IOVA vfio: selftests: Split libvfio.h into separate header files vfio: selftests: Move vfio_selftests_*() helpers into libvfio.c vfio: selftests: Rename vfio_util.h to libvfio.h vfio: selftests: Stop passing device for IOMMU operations vfio: selftests: Move IOVA allocator into iova_allocator.c vfio: selftests: Move IOMMU library code into iommu.c vfio: selftests: Rename struct vfio_dma_region to dma_region vfio: selftests: Upgrade driver logging to dev_err() vfio: selftests: Prefix logs with device BDF where relevant vfio: selftests: Eliminate overly chatty logging vfio: selftests: Support multiple devices in the same container/iommufd vfio: selftests: Introduce struct iommu vfio: selftests: Rename struct vfio_iommu_mode to iommu_mode vfio: selftests: Allow passing multiple BDFs on the command line vfio: selftests: Split run.sh into separate scripts vfio: selftests: Move run.sh into scripts directory vfio/nvgrace-gpu: wait for the GPU mem to be ready vfio/nvgrace-gpu: Inform devmem unmapped after reset ...
2025-12-01Merge tag 'vfs-6.19-rc1.fd_prepare.fs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull fd prepare updates from Christian Brauner: "This adds the FD_ADD() and FD_PREPARE() primitive. They simplify the common pattern of get_unused_fd_flags() + create file + fd_install() that is used extensively throughout the kernel and currently requires cumbersome cleanup paths. FD_ADD() - For simple cases where a file is installed immediately: fd = FD_ADD(O_CLOEXEC, vfio_device_open_file(device)); if (fd < 0) vfio_device_put_registration(device); return fd; FD_PREPARE() - For cases requiring access to the fd or file, or additional work before publishing: FD_PREPARE(fdf, O_CLOEXEC, sync_file->file); if (fdf.err) { fput(sync_file->file); return fdf.err; } data.fence = fd_prepare_fd(fdf); if (copy_to_user((void __user *)arg, &data, sizeof(data))) return -EFAULT; return fd_publish(fdf); The primitives are centered around struct fd_prepare. FD_PREPARE() encapsulates all allocation and cleanup logic and must be followed by a call to fd_publish() which associates the fd with the file and installs it into the caller's fdtable. If fd_publish() isn't called, both are deallocated automatically. FD_ADD() is a shorthand that does fd_publish() immediately and never exposes the struct to the caller. I've implemented this in a way that it's compatible with the cleanup infrastructure while also being usable separately. IOW, it's centered around struct fd_prepare which is aliased to class_fd_prepare_t and so we can make use of all the basica guard infrastructure" * tag 'vfs-6.19-rc1.fd_prepare.fs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits) io_uring: convert io_create_mock_file() to FD_PREPARE() file: convert replace_fd() to FD_PREPARE() vfio: convert vfio_group_ioctl_get_device_fd() to FD_ADD() tty: convert ptm_open_peer() to FD_ADD() ntsync: convert ntsync_obj_get_fd() to FD_PREPARE() media: convert media_request_alloc() to FD_PREPARE() hv: convert mshv_ioctl_create_partition() to FD_ADD() gpio: convert linehandle_create() to FD_PREPARE() pseries: port papr_rtas_setup_file_interface() to FD_ADD() pseries: convert papr_platform_dump_create_handle() to FD_ADD() spufs: convert spufs_gang_open() to FD_PREPARE() papr-hvpipe: convert papr_hvpipe_dev_create_handle() to FD_PREPARE() spufs: convert spufs_context_open() to FD_PREPARE() net/socket: convert __sys_accept4_file() to FD_ADD() net/socket: convert sock_map_fd() to FD_ADD() net/kcm: convert kcm_ioctl() to FD_PREPARE() net/handshake: convert handshake_nl_accept_doit() to FD_PREPARE() secretmem: convert memfd_secret() to FD_ADD() memfd: convert memfd_create() to FD_ADD() bpf: convert bpf_token_create() to FD_PREPARE() ...
2025-12-01vfio/xe: Add device specific vfio_pci driver variant for Intel graphicsMichał Winiarski
In addition to generic VFIO PCI functionality, the driver implements VFIO migration uAPI, allowing userspace to enable migration for Intel Graphics SR-IOV Virtual Functions. The driver binds to VF device and uses API exposed by Xe driver to transfer the VF migration data under the control of PF device. Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Alex Williamson <alex@shazbot.org> Link: https://patch.msgid.link/20251127093934.1462188-5-michal.winiarski@intel.com Link: https://lore.kernel.org/all/20251128125322.34edbeaf.alex@shazbot.org/ Signed-off-by: Michał Winiarski <michal.winiarski@intel.com> (cherry picked from commit 2e38c50ae4929f0b954fee69d428db7121452867) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-11-28vfio/nvgrace-gpu: wait for the GPU mem to be readyAnkit Agrawal
Speculative prefetches from CPU to GPU memory until the GPU is ready after reset can cause harmless corrected RAS events to be logged on Grace systems. It is thus preferred that the mapping not be re-established until the GPU is ready post reset. The GPU readiness can be checked through BAR0 registers similar to the checking at the time of device probe. It can take several seconds for the GPU to be ready. So it is desirable that the time overlaps as much of the VM startup as possible to reduce impact on the VM bootup time. The GPU readiness state is thus checked on the first fault/huge_fault request or read/write access which amortizes the GPU readiness time. The first fault and read/write checks the GPU state when the reset_done flag - which denotes whether the GPU has just been reset. The memory_lock is taken across map/access to avoid races with GPU reset. Also check if the memory is enabled, before waiting for GPU to be ready. Otherwise the readiness check would block for 30s. Lastly added PM handling wrapping on read/write access. Cc: Shameer Kolothum <skolothumtho@nvidia.com> Cc: Alex Williamson <alex@shazbot.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Vikram Sethi <vsethi@nvidia.com> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com> Suggested-by: Alex Williamson <alex@shazbot.org> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251127170632.3477-7-ankita@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28vfio/nvgrace-gpu: Inform devmem unmapped after resetAnkit Agrawal
Introduce a new flag reset_done to notify that the GPU has just been reset and the mapping to the GPU memory is zapped. Implement the reset_done handler to set this new variable. It will be used later in the patches to wait for the GPU memory to be ready before doing any mapping or access. Cc: Jason Gunthorpe <jgg@ziepe.ca> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com> Suggested-by: Alex Williamson <alex@shazbot.org> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251127170632.3477-6-ankita@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28vfio/nvgrace-gpu: split the code to wait for GPU readyAnkit Agrawal
Split the function that check for the GPU device being ready on the probe. Move the code to wait for the GPU to be ready through BAR0 register reads to a separate function. This would help reuse the code. This also fixes a bug where the return status in case of timeout gets overridden by return from pci_enable_device. With the fix, a timeout generate an error as initially intended. Fixes: d85f69d520e6 ("vfio/nvgrace-gpu: Check the HBM training and C2C link status") Reviewed-by: Zhi Wang <zhiw@nvidia.com> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251127170632.3477-5-ankita@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28vfio: use vfio_pci_core_setup_barmap to map bar in mmapAnkit Agrawal
Remove code duplication in vfio_pci_core_mmap by calling vfio_pci_core_setup_barmap to perform the bar mapping. No functional change is intended. Cc: Donald Dutile <ddutile@redhat.com> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com> Reviewed-by: Zhi Wang <zhiw@nvidia.com> Suggested-by: Alex Williamson <alex@shazbot.org> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251127170632.3477-4-ankita@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28vfio/nvgrace-gpu: Add support for huge pfnmapAnkit Agrawal
NVIDIA's Grace based systems have large device memory. The device memory is mapped as VM_PFNMAP in the VMM VMA. The nvgrace-gpu module could make use of the huge PFNMAP support added in mm [1]. To make use of the huge pfnmap support, fault/huge_fault ops based mapping mechanism needs to be implemented. Currently nvgrace-gpu module relies on remap_pfn_range to do the mapping during VM bootup. Replace it to instead rely on fault and use vfio_pci_vmf_insert_pfn to setup the mapping. Moreover to enable huge pfnmap, nvgrace-gpu module is updated by adding huge_fault ops implementation. The implementation establishes mapping according to the order request. Note that if the PFN or the VMA address is unaligned to the order, the mapping fallbacks to the PTE level. Link: https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/ [1] Cc: Shameer Kolothum <skolothumtho@nvidia.com> Cc: Alex Williamson <alex@shazbot.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Vikram Sethi <vsethi@nvidia.com> Reviewed-by: Zhi Wang <zhiw@nvidia.com> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251127170632.3477-3-ankita@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28vfio: refactor vfio_pci_mmap_huge_fault functionAnkit Agrawal
Refactor vfio_pci_mmap_huge_fault to take out the implementation to map the VMA to the PTE/PMD/PUD as a separate function. Export the new function to be used by nvgrace-gpu module. Move the alignment check code to verify that pfn and VMA VA is aligned to the page order to the header file and make it inline. No functional change is intended. Cc: Shameer Kolothum <skolothumtho@nvidia.com> Cc: Alex Williamson <alex@shazbot.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251127170632.3477-2-ankita@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28vfio/pci: Use RCU for error/request triggers to avoid circular lockingAlex Williamson
Thanks to a device generating an ACS violation during bus reset, lockdep reported the following circular locking issue: CPU0: SET_IRQS (MSI/X): holds igate, acquires memory_lock CPU1: HOT_RESET: holds memory_lock, acquires pci_bus_sem CPU2: AER: holds pci_bus_sem, acquires igate This results in a potential 3-way deadlock. Remove the pci_bus_sem->igate leg of the triangle by using RCU to peek at the eventfd rather than locking it with igate. Fixes: 3be3a074cf5b ("vfio-pci: Don't use device_lock around AER interrupt setup") Signed-off-by: Alex Williamson <alex.williamson@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20251124223623.2770706-1-alex@shazbot.org Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28vfio: convert vfio_group_ioctl_get_device_fd() to FD_ADD()Christian Brauner
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-43-b6efa1706cfd@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25vfio/pci: Add vfio_pci_dma_buf_iommufd_map()Jason Gunthorpe
This function is used to establish the "private interconnect" between the VFIO DMABUF exporter and the iommufd DMABUF importer. This is intended to be a temporary API until the core DMABUF interface is improved to natively support a private interconnect and revocable negotiation. This function should only be called by iommufd when trying to map a DMABUF. For now iommufd will only support VFIO DMABUFs. The following improvements are needed in the DMABUF API to generically support more exporters with iommufd/kvm type importers that cannot use the DMA API: 1) Revoke semantics. VFIO needs to be able to prevent access to the MMIO during FLR, and so it will use dma_buf_move_notify() to prevent access. iommmufd does not support fault handling so it cannot implement the full move_notify. Instead if revoke is negotiated the exporter promises not to use move_notify() unless the importer can experiance failures. iommufd will unmap the dmabuf from the iommu page tables while it is revoked. 2) Private interconnect negotiation. iommufd will only be able to map a "private interconnect" that provides a phys_addr_t and a struct p2pdma_provider * to describe the memory. It cannot use a DMA mapped scatterlist since it is directly calling iommu_map(). 3) NULL device during dma_buf_dynamic_attach(). Since iommufd doesn't use the DMA API it doesn't have a DMAable struct device to pass here. Link: https://patch.msgid.link/r/1-v2-b2c110338e3f+5c2-iommufd_dmabuf_jgg@nvidia.com Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Shuai Xue <xueshuai@linux.alibaba.com> Acked-by: Alex Williamson <alex@shazbot.org> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-11-20Merge tag 'vfio-v6.19-dma-buf-v9+' into v6.19/vfio/nextAlex Williamson
[v9] vfio/pci: Allow MMIO regions to be exported through dma-buf https://lore.kernel.org/all/20251120-dmabuf-vfio-v9-0-d7f71607f371@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-20vfio/nvgrace: Support get_dmabuf_physJason Gunthorpe
Call vfio_pci_core_fill_phys_vec() with the proper physical ranges for the synthetic BAR 2 and BAR 4 regions. Otherwise use the normal flow based on the PCI bar. This demonstrates a DMABUF that follows the region info report to only allow mapping parts of the region that are mmapable. Since the BAR is power of two sized and the "CXL" region is just page aligned the there can be a padding region at the end that is not mmaped or passed into the DMABUF. The "CXL" ranges that are remapped into BAR 2 and BAR 4 areas are not PCI MMIO, they actually run over the CXL-like coherent interconnect and for the purposes of DMA behave identically to DRAM. We don't try to model this distinction between true PCI BAR memory that takes a real PCI path and the "CXL" memory that takes a different path in the p2p framework for now. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Alex Mastro <amastro@fb.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Ankit Agrawal <ankita@nvidia.com> Reviewed-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-11-d7f71607f371@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-20vfio/pci: Add dma-buf export support for MMIO regionsLeon Romanovsky
Add support for exporting PCI device MMIO regions through dma-buf, enabling safe sharing of non-struct page memory with controlled lifetime management. This allows RDMA and other subsystems to import dma-buf FDs and build them into memory regions for PCI P2P operations. The implementation provides a revocable attachment mechanism using dma-buf move operations. MMIO regions are normally pinned as BARs don't change physical addresses, but access is revoked when the VFIO device is closed or a PCI reset is issued. This ensures kernel self-defense against potentially hostile userspace. Currently VFIO can take MMIO regions from the device's BAR and map them into a PFNMAP VMA with special PTEs. This mapping type ensures the memory cannot be used with things like pin_user_pages(), hmm, and so on. In practice only the user process CPU and KVM can safely make use of these VMA. When VFIO shuts down these VMAs are cleaned by unmap_mapping_range() to prevent any UAF of the MMIO beyond driver unbind. However, VFIO type 1 has an insecure behavior where it uses follow_pfnmap_*() to fish a MMIO PFN out of a VMA and program it back into the IOMMU. This has a long history of enabling P2P DMA inside VMs, but has serious lifetime problems by allowing a UAF of the MMIO after the VFIO driver has been unbound. Introduce DMABUF as a new safe way to export a FD based handle for the MMIO regions. This can be consumed by existing DMABUF importers like RDMA or DRM without opening an UAF. A following series will add an importer to iommufd to obsolete the type 1 code and allow safe UAF-free MMIO P2P in VM cases. DMABUF has a built in synchronous invalidation mechanism called move_notify. VFIO keeps track of all drivers importing its MMIO and can invoke a synchronous invalidation callback to tell the importing drivers to DMA unmap and forget about the MMIO pfns. This process is being called revoke. This synchronous invalidation fully prevents any lifecycle problems. VFIO will do this before unbinding its driver ensuring there is no UAF of the MMIO beyond the driver lifecycle. Further, VFIO has additional behavior to block access to the MMIO during things like Function Level Reset. This is because some poor platforms may experience a MCE type crash when touching MMIO of a PCI device that is undergoing a reset. Today this is done by using unmap_mapping_range() on the VMAs. Extend that into the DMABUF world and temporarily revoke the MMIO from the DMABUF importers during FLR as well. This will more robustly prevent an errant P2P from possibly upsetting the platform. A DMABUF FD is a preferred handle for MMIO compared to using something like a pgmap because: - VFIO is supported, including its P2P feature, on archs that don't support pgmap - PCI devices have all sorts of BAR sizes, including ones smaller than a section so a pgmap cannot always be created - It is undesirable to waste a lot of memory for struct pages, especially for a case like a GPU with ~100GB of BAR size - We want a synchronous revoke semantic to support FLR with light hardware requirements Use the P2P subsystem to help generate the DMA mapping. This is a significant upgrade over the abuse of dma_map_resource() that has historically been used by DMABUF exporters. Experience with an OOT version of this patch shows that real systems do need this. This approach deals with all the P2P scenarios: - Non-zero PCI bus_offset - ACS flags routing traffic to the IOMMU - ACS flags that bypass the IOMMU - though vfio noiommu is required to hit this. There will be further work to formalize the revoke semantic in DMABUF. For now this acts like a move_notify dynamic exporter where importer fault handling will get a failure when they attempt to map. This means that only fully restartable fault capable importers can import the VFIO DMABUFs. A future revoke semantic should open this up to more HW as the HW only needs to invalidate, not handle restartable faults. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-10-d7f71607f371@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-20vfio/pci: Enable peer-to-peer DMA transactions by defaultLeon Romanovsky
Make sure that all VFIO PCI devices have peer-to-peer capabilities enables, so we would be able to export their MMIO memory through DMABUF, VFIO has always supported P2P mappings with itself. VFIO type 1 insecurely reads PFNs directly out of a VMA's PTEs and programs them into the IOMMU allowing any two VFIO devices to perform P2P to each other. All existing VMMs use this capability to export P2P into a VM where the VM could setup any kind of DMA it likes. Projects like DPDK/SPDK are also known to make use of this, though less frequently. As a first step to more properly integrating VFIO with the P2P subsystem unconditionally enable P2P support for VFIO PCI devices. The struct p2pdma_provider will act has a handle to the P2P subsystem to do things like DMA mapping. While real PCI devices have to support P2P (they can't even tell if an IOVA is P2P or not) there may be fake PCI devices that may trigger some kind of catastrophic system failure. To date VFIO has never tripped up on such a case, but if one is discovered the plan is to add a PCI quirk and have pcim_p2pdma_init() fail. This will fully block the broken device throughout any users of the P2P subsystem in the kernel. Thus P2P through DMABUF will follow the historical VFIO model and be unconditionally enabled by vfio-pci. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Alex Mastro <amastro@fb.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-9-d7f71607f371@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-20vfio/pci: Share the core device pointer while invoking feature functionsVivek Kasireddy
There is no need to share the main device pointer (struct vfio_device *) with all the feature functions as they only need the core device pointer. Therefore, extract the core device pointer once in the caller (vfio_pci_core_ioctl_feature) and share it instead. Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Alex Mastro <amastro@fb.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-8-d7f71607f371@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-20vfio: Export vfio device get and put registration helpersVivek Kasireddy
These helpers are useful for managing additional references taken on the device from other associated VFIO modules. Original-patch-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Alex Mastro <amastro@fb.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-7-d7f71607f371@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12vfio: Remove the get_region_info opJason Gunthorpe
No driver uses it now, all are using get_region_info_caps(). Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/22-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12vfio: Move the remaining drivers to get_region_info_capsJason Gunthorpe
Remove the duplicate code and change info to a pointer. caps are not used. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Pranjal Shrivastava <praan@google.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/21-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12vfio/platform: Convert to get_region_info_capsJason Gunthorpe
Remove the duplicate code and change info to a pointer. caps are not used. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Mostafa Saleh <smostafa@google.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12vfio/pci: Convert all PCI drivers to get_region_info_capsJason Gunthorpe
Since the core function signature changes it has to flow up to all drivers. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Reviewed-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/19-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12vfio: Add get_region_info_caps opJason Gunthorpe
This op does the copy to/from user for the info and can return back a cap chain through a vfio_info_cap * result. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/15-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12vfio: Require drivers to implement get_region_infoJason Gunthorpe
Remove the fallback through the ioctl callback, no drivers use this now. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Reviewed-by: Mostafa Saleh <smostafa@google.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/14-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12vfio/cdx: Provide a get_region_info opJason Gunthorpe
Change the signature of vfio_cdx_ioctl_get_region_info() and hook it to the op. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/11-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12vfio/fsl: Provide a get_region_info opJason Gunthorpe
Move it out of vfio_fsl_mc_ioctl() and re-indent it. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Pranjal Shrivastava <praan@google.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/10-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12vfio/platform: Provide a get_region_info opJason Gunthorpe
Move it out of vfio_platform_ioctl() and re-indent it. Add it to all platform drivers. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Reviewed-by: Mostafa Saleh <smostafa@google.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/9-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12vfio/pci: Fill in the missing get_region_info opsJason Gunthorpe
Now that every variant driver provides a get_region_info op remove the ioctl based dispatch from vfio_pci_core_ioctl(). Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/5-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12vfio/nvgrace: Convert to the get_region_info opJason Gunthorpe
Change the signature of nvgrace_gpu_ioctl_get_region_info() Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Ankit Agrawal <ankita@nvidia.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/4-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12vfio/virtio: Convert to the get_region_info opJason Gunthorpe
Remove virtiovf_vfio_pci_core_ioctl() and change the signature of virtiovf_pci_ioctl_get_region_info(). Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/3-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12vfio/hisi: Convert to the get_region_info opJason Gunthorpe
Change the function signature of hisi_acc_vfio_pci_ioctl() and re-indent it. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Pranjal Shrivastava <praan@google.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/2-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12vfio: Provide a get_region_info opJason Gunthorpe
Instead of hooking the general ioctl op, have the core code directly decode VFIO_DEVICE_GET_REGION_INFO and call an op just for it. This is intended to allow mechanical changes to the drivers to pull their VFIO_DEVICE_GET_REGION_INFO int oa function. Later patches will improve the function signature to consolidate more code. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-06hisi_acc_vfio_pci: Add .match_token_uuid callback in hisi_acc_vfio_pci_migrn_opsRaghavendra Rao Ananta
The commit, <86624ba3b522> ("vfio/pci: Do vf_token checks for VFIO_DEVICE_BIND_IOMMUFD") accidentally ignored including the .match_token_uuid callback in the hisi_acc_vfio_pci_migrn_ops struct. Introduce the missed callback here. Fixes: 86624ba3b522 ("vfio/pci: Do vf_token checks for VFIO_DEVICE_BIND_IOMMUFD") Cc: stable@vger.kernel.org Suggested-by: Longfang Liu <liulongfang@huawei.com> Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Reviewed-by: Longfang Liu <liulongfang@huawei.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20251031170603.2260022-3-rananta@google.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-06vfio: Fix ksize arg while copying user struct in vfio_df_ioctl_bind_iommufd()Raghavendra Rao Ananta
For the cases where user includes a non-zero value in 'token_uuid_ptr' field of 'struct vfio_device_bind_iommufd', the copy_struct_from_user() in vfio_df_ioctl_bind_iommufd() fails with -E2BIG. For the 'minsz' passed, copy_struct_from_user() expects the newly introduced field to be zero-ed, which would be incorrect in this case. Fix this by passing the actual size of the kernel struct. If working with a newer userspace, copy_struct_from_user() would copy the 'token_uuid_ptr' field, and if working with an old userspace, it would zero out this field, thus still retaining backward compatibility. Fixes: 86624ba3b522 ("vfio/pci: Do vf_token checks for VFIO_DEVICE_BIND_IOMMUFD") Cc: stable@vger.kernel.org Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20251031170603.2260022-2-rananta@google.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-05hisi_acc_vfio_pci: adapt to new migration configurationLongfang Liu
On new platforms greater than QM_HW_V3, the migration region has been relocated from the VF to the PF. The VF's own configuration space is restored to the complete 64KB, and there is no need to divide the size of the BAR configuration space equally. The driver should be modified accordingly to adapt to the new hardware device. On the older hardware platform QM_HW_V3, the live migration configuration region is placed in the latter 32K portion of the VF's BAR2 configuration space. On the new hardware platform QM_HW_V4, the live migration configuration region also exists in the same 32K area immediately following the VF's BAR2, just like on QM_HW_V3. However, access to this region is now controlled by hardware. Additionally, a copy of the live migration configuration region is present in the PF's BAR2 configuration space. On the new hardware platform QM_HW_V4, when an older version of the driver is loaded, it behaves like QM_HW_V3 and uses the configuration region in the VF, ensuring that the live migration function continues to work normally. When the new version of the driver is loaded, it directly uses the configuration region in the PF. Meanwhile, hardware configuration disables the live migration configuration region in the VF's BAR2: reads return all 0xF values, and writes are silently ignored. Signed-off-by: Longfang Liu <liulongfang@huawei.com> Reviewed-by: Shameer Kolothum <shameerkolothum@gmail.com> Link: https://lore.kernel.org/r/20251030015744.131771-3-liulongfang@huawei.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-10-28vfio/type1: handle DMA map/unmap up to the addressable limitAlex Mastro
Before this commit, it was possible to create end of address space mappings, but unmapping them via VFIO_IOMMU_UNMAP_DMA, replaying them for newly added iommu domains, and querying their dirty pages via VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP was broken due to bugs caused by comparisons against (iova + size) expressions, which overflow to zero. Additionally, there appears to be a page pinning leak in the vfio_iommu_type1_release() path, since vfio_unmap_unpin()'s loop body where unmap_unpin_*() are called will never be entered due to overflow of (iova + size) to zero. This commit handles DMA map/unmap operations up to the addressable limit by comparing against inclusive end-of-range limits, and changing iteration to perform relative traversals across range sizes, rather than absolute traversals across addresses. vfio_link_dma() inserts a zero-sized vfio_dma into the rb-tree, and is only used for that purpose, so discard the size from consideration for the insertion point. Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation") Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Alex Mastro <amastro@fb.com> Link: https://lore.kernel.org/r/20251028-fix-unmap-v6-3-2542b96bcc8e@fb.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-10-28vfio/type1: move iova increment to unmap_unpin_*() callerAlex Mastro
Move incrementing iova to the caller of these functions as part of preparing to handle end of address space map/unmap. Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation") Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Alex Mastro <amastro@fb.com> Link: https://lore.kernel.org/r/20251028-fix-unmap-v6-2-2542b96bcc8e@fb.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-10-28vfio/type1: sanitize for overflow using check_*_overflow()Alex Mastro
Adopt check_*_overflow() functions to clearly express overflow check intent. Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation") Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Alex Mastro <amastro@fb.com> Link: https://lore.kernel.org/r/20251028-fix-unmap-v6-1-2542b96bcc8e@fb.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-10-06vfio: Dump migration features under debugfsCédric Le Goater
A debugfs directory was recently added for VFIO devices. Add a new "features" file under the migration sub-directory to expose which features the device supports. Signed-off-by: Cédric Le Goater <clg@redhat.com> Link: https://lore.kernel.org/r/20250918121928.1921871-1-clg@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-10-06vfio/type1: optimize vfio_unpin_pages_remote()Li Zhe
When vfio_unpin_pages_remote() is called with a range of addresses that includes large folios, the function currently performs individual put_pfn() operations for each page. This can lead to significant performance overheads, especially when dealing with large ranges of pages. It would be very rare for reserved PFNs and non reserved will to be mixed within the same range. So this patch utilizes the has_rsvd variable introduced in the previous patch to determine whether batch put_pfn() operations can be performed. Moreover, compared to put_pfn(), unpin_user_page_range_dirty_lock() is capable of handling large folio scenarios more efficiently. The performance test results for completing the 16G VFIO IOMMU DMA unmapping are as follows. Base(v6.16): ------- AVERAGE (MADV_HUGEPAGE) -------- VFIO UNMAP DMA in 0.141 s (113.7 GB/s) ------- AVERAGE (MAP_POPULATE) -------- VFIO UNMAP DMA in 0.307 s (52.2 GB/s) ------- AVERAGE (HUGETLBFS) -------- VFIO UNMAP DMA in 0.135 s (118.6 GB/s) With this patchset: ------- AVERAGE (MADV_HUGEPAGE) -------- VFIO UNMAP DMA in 0.044 s (363.2 GB/s) ------- AVERAGE (MAP_POPULATE) -------- VFIO UNMAP DMA in 0.289 s (55.3 GB/s) ------- AVERAGE (HUGETLBFS) -------- VFIO UNMAP DMA in 0.044 s (361.3 GB/s) For large folio, we achieve an over 67% performance improvement in the VFIO UNMAP DMA item. For small folios, the performance test results appear to show a slight improvement. Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> Signed-off-by: Li Zhe <lizhe.67@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20250814064714.56485-6-lizhe.67@bytedance.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-10-06vfio/type1: introduce a new member has_rsvd for struct vfio_dmaLi Zhe
Introduce a new member has_rsvd for struct vfio_dma. This member is used to indicate whether there are any reserved or invalid pfns in the region represented by this vfio_dma. If it is true, it indicates that there is at least one pfn in this region that is either reserved or invalid. Signed-off-by: Li Zhe <lizhe.67@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20250814064714.56485-5-lizhe.67@bytedance.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-10-06vfio/type1: batch vfio_find_vpfn() in function vfio_unpin_pages_remote()Li Zhe
The function vpfn_pages() can help us determine the number of vpfn nodes on the vpfn rb tree within a specified range. This allows us to avoid searching for each vpfn individually in the function vfio_unpin_pages_remote(). This patch batches the vfio_find_vpfn() calls in function vfio_unpin_pages_remote(). Signed-off-by: Li Zhe <lizhe.67@bytedance.com> Link: https://lore.kernel.org/r/20250814064714.56485-4-lizhe.67@bytedance.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-10-06vfio/type1: optimize vfio_pin_pages_remote()Li Zhe
When vfio_pin_pages_remote() is called with a range of addresses that includes large folios, the function currently performs individual statistics counting operations for each page. This can lead to significant performance overheads, especially when dealing with large ranges of pages. Batch processing of statistical counting operations can effectively enhance performance. In addition, the pages obtained through longterm GUP are neither invalid nor reserved. Therefore, we can reduce the overhead associated with some calls to function is_invalid_reserved_pfn(). The performance test results for completing the 16G VFIO IOMMU DMA mapping are as follows. Base(v6.16): ------- AVERAGE (MADV_HUGEPAGE) -------- VFIO MAP DMA in 0.049 s (328.5 GB/s) ------- AVERAGE (MAP_POPULATE) -------- VFIO MAP DMA in 0.268 s (59.6 GB/s) ------- AVERAGE (HUGETLBFS) -------- VFIO MAP DMA in 0.051 s (310.9 GB/s) With this patch: ------- AVERAGE (MADV_HUGEPAGE) -------- VFIO MAP DMA in 0.025 s (629.8 GB/s) ------- AVERAGE (MAP_POPULATE) -------- VFIO MAP DMA in 0.253 s (63.1 GB/s) ------- AVERAGE (HUGETLBFS) -------- VFIO MAP DMA in 0.030 s (530.5 GB/s) For large folio, we achieve an over 40% performance improvement. For small folios, the performance test results indicate a slight improvement. Signed-off-by: Li Zhe <lizhe.67@bytedance.com> Co-developed-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Eric Farman <farman@linux.ibm.com> Link: https://lore.kernel.org/r/20250814064714.56485-3-lizhe.67@bytedance.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-10-04Merge tag 'vfio-v6.18-rc1' of https://github.com/awilliam/linux-vfioLinus Torvalds
Pull VFIO updates from Alex Williamson: - Use fdinfo to expose the sysfs path of a device represented by a vfio device file (Alex Mastro) - Mark vfio-fsl-mc, vfio-amba, and the reset functions for vfio-platform for removal as these are either orphaned or believed to be unused (Alex Williamson) - Add reviewers for vfio-platform to save it from also being marked for removal (Mostafa Saleh, Pranjal Shrivastava) - VFIO selftests, including basic sanity testing and minimal userspace drivers for testing against real hardware. This is also expected to provide integration with KVM selftests for KVM-VFIO interfaces (David Matlack, Josh Hilke) - Fix drivers/cdx and vfio/cdx to build without CONFIG_GENERIC_MSI_IRQ (Nipun Gupta) - Fix reference leak in hisi_acc (Miaoqian Lin) - Use consistent return for unsupported device feature (Alex Mastro) - Unwind using the correct memory free callback in vfio/pds (Zilin Guan) - Use IRQ_DISABLE_LAZY flag to improve handling of pre-PCI2.3 INTx and resolve stalled interrupt on ppc64 (Timothy Pearson) - Enable GB300 in nvgrace-gpu vfio-pci variant driver (Tushar Dave) - Misc: - Drop unnecessary ternary conversion in vfio/pci (Xichao Zhao) - Grammatical fix in nvgrace-gpu (Morduan Zang) - Update Shameer's email address (Shameer Kolothum) - Fix document build warning (Alex Williamson) * tag 'vfio-v6.18-rc1' of https://github.com/awilliam/linux-vfio: (48 commits) vfio/nvgrace-gpu: Add GB300 SKU to the devid table vfio/pci: Fix INTx handling on legacy non-PCI 2.3 devices vfio/pds: replace bitmap_free with vfree vfio: return -ENOTTY for unsupported device feature hisi_acc_vfio_pci: Fix reference leak in hisi_acc_vfio_debug_init vfio/platform: Mark reset drivers for removal vfio/amba: Mark for removal MAINTAINERS: Add myself as VFIO-platform reviewer MAINTAINERS: Add myself as VFIO-platform reviewer docs: proc.rst: Fix VFIO Device title formatting vfio: selftests: Fix .gitignore for already tracked files vfio/cdx: update driver to build without CONFIG_GENERIC_MSI_IRQ cdx: don't select CONFIG_GENERIC_MSI_IRQ MAINTAINERS: Update Shameer Kolothum's email address vfio: selftests: Add a script to help with running VFIO selftests vfio: selftests: Make iommufd the default iommu_mode vfio: selftests: Add iommufd mode vfio: selftests: Add iommufd_compat_type1{,v2} modes vfio: selftests: Add vfio_type1v2_mode vfio: selftests: Replicate tests across all iommu_modes ...
2025-09-26vfio/nvgrace-gpu: Add GB300 SKU to the devid tableTushar Dave
GB300 is NVIDIA's Grace Blackwell Ultra Superchip. Add the GB300 SKU device-id to nvgrace_gpu_vfio_pci_table. Signed-off-by: Tushar Dave <tdave@nvidia.com> Reviewed-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20250925170935.121587-1-tdave@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-09-26vfio/pci: Fix INTx handling on legacy non-PCI 2.3 devicesTimothy Pearson
PCI devices prior to PCI 2.3 both use level interrupts and do not support interrupt masking, leading to a failure when passed through to a KVM guest on at least the ppc64 platform. This failure manifests as receiving and acknowledging a single interrupt in the guest, while the device continues to assert the level interrupt indicating a need for further servicing. When lazy IRQ masking is used on DisINTx- (non-PCI 2.3) hardware, the following sequence occurs: * Level IRQ assertion on device * IRQ marked disabled in kernel * Host interrupt handler exits without clearing the interrupt on the device * Eventfd is delivered to userspace * Guest processes IRQ and clears device interrupt * Device de-asserts INTx, then re-asserts INTx while the interrupt is masked * Newly asserted interrupt acknowledged by kernel VMM without being handled * Software mask removed by VFIO driver * Device INTx still asserted, host controller does not see new edge after EOI The behavior is now platform-dependent. Some platforms (amd64) will continue to spew IRQs for as long as the INTX line remains asserted, therefore the IRQ will be handled by the host as soon as the mask is dropped. Others (ppc64) will only send the one request, and if it is not handled no further interrupts will be sent. The former behavior theoretically leaves the system vulnerable to interrupt storm, and the latter will result in the device stalling after receiving exactly one interrupt in the guest. Work around this by disabling lazy IRQ masking for DisINTx- INTx devices. Signed-off-by: Timothy Pearson <tpearson@raptorengineering.com> Link: https://lore.kernel.org/r/333803015.1744464.1758647073336.JavaMail.zimbra@raptorengineeringinc.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-09-21vfio/pci: drop nth_page() usage within SG entryDavid Hildenbrand
It's no longer required to use nth_page() when iterating pages within a single SG entry, so let's drop the nth_page() usage. Link: https://lkml.kernel.org/r/20250901150359.867252-33-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Brett Creeley <brett.creeley@amd.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Yishai Hadas <yishaih@nvidia.com> Cc: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Cc: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-19vfio/pds: replace bitmap_free with vfreeZilin Guan
host_seq_bmp is allocated with vzalloc but is currently freed with bitmap_free, which uses kfree internally. This mismach prevents the resource from being released properly and may result in memory leaks or other issues. Fix this by freeing host_seq_bmp with vfree to match the vzalloc allocation. Fixes: f232836a9152 ("vfio/pds: Add support for dirty page tracking") Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Reviewed-by: Brett Creeley <brett.creeley@amd.com> Link: https://lore.kernel.org/r/20250913153154.1028835-1-zilin@seu.edu.cn Signed-off-by: Alex Williamson <alex.williamson@redhat.com>