summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/x86.c
AgeCommit message (Collapse)Author
4 daysMerge tag 'kvm-x86-fixes-6.19-rc1' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM fixes for 6.19-rc1 - Add a missing "break" to fix param parsing in the rseq selftest. - Apply runtime updates to the _current_ CPUID when userspace is setting CPUID, e.g. as part of vCPU hotplug, to fix a false positive and to avoid dropping the pending update. - Disallow toggling KVM_MEM_GUEST_MEMFD on an existing memslot, as it's not supported by KVM and leads to a use-after-free due to KVM failing to unbind the memslot from the previously-associated guest_memfd instance. - Harden against similar KVM_MEM_GUEST_MEMFD goofs, and prepare for supporting flags-only changes on KVM_MEM_GUEST_MEMFD memlslots, e.g. for dirty logging. - Set exit_code[63:32] to -1 (all 0xffs) when synthesizing a nested SVM_EXIT_ERR (a.k.a. VMEXIT_INVALID) #VMEXIT, as VMEXIT_INVALID is defined as -1ull (a 64-bit value). - Update SVI when activating APICv to fix a bug where a post-activation EOI for an in-service IRQ would effective be lost due to SVI being stale. - Immediately refresh APICv controls (if necessary) on a nested VM-Exit instead of deferring the update via KVM_REQ_APICV_UPDATE, as the request is effectively ignored because KVM thinks the vCPU already has the correct APICv settings.
2025-12-08KVM: VMX: Update SVI during runtime APICv activationDongli Zhang
The APICv (apic->apicv_active) can be activated or deactivated at runtime, for instance, because of APICv inhibit reasons. Intel VMX employs different mechanisms to virtualize LAPIC based on whether APICv is active. When APICv is activated at runtime, GUEST_INTR_STATUS is used to configure and report the current pending IRR and ISR states. Unless a specific vector is explicitly included in EOI_EXIT_BITMAP, its EOI will not be trapped to KVM. Intel VMX automatically clears the corresponding ISR bit based on the GUEST_INTR_STATUS.SVI field. When APICv is deactivated at runtime, the VM_ENTRY_INTR_INFO_FIELD is used to specify the next interrupt vector to invoke upon VM-entry. The VMX IDT_VECTORING_INFO_FIELD is used to report un-invoked vectors on VM-exit. EOIs are always trapped to KVM, so the software can manually clear pending ISR bits. There are scenarios where, with APICv activated at runtime, a guest-issued EOI may not be able to clear the pending ISR bit. Taking vector 236 as an example, here is one scenario. 1. Suppose APICv is inactive. Vector 236 is pending in the IRR. 2. To handle KVM_REQ_EVENT, KVM moves vector 236 from the IRR to the ISR, and configures the VM_ENTRY_INTR_INFO_FIELD via vmx_inject_irq(). 3. After VM-entry, vector 236 is invoked through the guest IDT. At this point, the data in VM_ENTRY_INTR_INFO_FIELD is no longer valid. The guest interrupt handler for vector 236 is invoked. 4. Suppose a VM exit occurs very early in the guest interrupt handler, before the EOI is issued. 5. Nothing is reported through the IDT_VECTORING_INFO_FIELD because vector 236 has already been invoked in the guest. 6. Now, suppose APICv is activated. Before the next VM-entry, KVM calls kvm_vcpu_update_apicv() to activate APICv. 7. Unfortunately, GUEST_INTR_STATUS.SVI is not configured, although vector 236 is still pending in the ISR. 8. After VM-entry, the guest finally issues the EOI for vector 236. However, because SVI is not configured, vector 236 is not cleared. 9. ISR is stalled forever on vector 236. Here is another scenario. 1. Suppose APICv is inactive. Vector 236 is pending in the IRR. 2. To handle KVM_REQ_EVENT, KVM moves vector 236 from the IRR to the ISR, and configures the VM_ENTRY_INTR_INFO_FIELD via vmx_inject_irq(). 3. VM-exit occurs immediately after the next VM-entry. The vector 236 is not invoked through the guest IDT. Instead, it is saved to the IDT_VECTORING_INFO_FIELD during the VM-exit. 4. KVM calls kvm_queue_interrupt() to re-queue the un-invoked vector 236 into vcpu->arch.interrupt. A KVM_REQ_EVENT is requested. 5. Now, suppose APICv is activated. Before the next VM-entry, KVM calls kvm_vcpu_update_apicv() to activate APICv. 6. Although APICv is now active, KVM still uses the legacy VM_ENTRY_INTR_INFO_FIELD to re-inject vector 236. GUEST_INTR_STATUS.SVI is not configured. 7. After the next VM-entry, vector 236 is invoked through the guest IDT. Finally, an EOI occurs. However, due to the lack of GUEST_INTR_STATUS.SVI configuration, vector 236 is not cleared from the ISR. 8. ISR is stalled forever on vector 236. Using QEMU as an example, vector 236 is stuck in ISR forever. (qemu) info lapic 1 dumping local APIC state for CPU 1 LVT0 0x00010700 active-hi edge masked ExtINT (vec 0) LVT1 0x00010400 active-hi edge masked NMI LVTPC 0x00000400 active-hi edge NMI LVTERR 0x000000fe active-hi edge Fixed (vec 254) LVTTHMR 0x00010000 active-hi edge masked Fixed (vec 0) LVTT 0x000400ec active-hi edge tsc-deadline Fixed (vec 236) Timer DCR=0x0 (divide by 2) initial_count = 0 current_count = 0 SPIV 0x000001ff APIC enabled, focus=off, spurious vec 255 ICR 0x000000fd physical edge de-assert no-shorthand ICR2 0x00000000 cpu 0 (X2APIC ID) ESR 0x00000000 ISR 236 IRR 37(level) 236 The issue isn't applicable to AMD SVM as KVM simply writes vmcb01 directly irrespective of whether L1 (vmcs01) or L2 (vmcb02) is active (unlike VMX, there is no need/cost to switch between VMCBs). In addition, APICV_INHIBIT_REASON_IRQWIN ensures AMD SVM AVIC is not activated until the last interrupt is EOI'd. Fix the bug by configuring Intel VMX GUEST_INTR_STATUS.SVI if APICv is activated at runtime. Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://patch.msgid.link/20251110063212.34902-1-dongli.zhang@oracle.com [sean: call out that SVM writes vmcb01 directly, tweak comment] Link: https://patch.msgid.link/20251205231913.441872-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-26Merge tag 'kvm-x86-svm-6.19' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM SVM changes for 6.19: - Fix a few missing "VMCB dirty" bugs. - Fix the worst of KVM's lack of EFER.LMSLE emulation. - Add AVIC support for addressing 4k vCPUs in x2AVIC mode. - Fix incorrect handling of selective CR0 writes when checking intercepts during emulation of L2 instructions. - Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32] on VMRUN and #VMEXIT. - Fix a bug where KVM corrupt the guest code stream when re-injecting a soft interrupt if the guest patched the underlying code after the VM-Exit, e.g. when Linux patches code with a temporary INT3. - Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits to userspace, and extend KVM "support" to all policy bits that don't require any actual support from KVM.
2025-11-26Merge tag 'kvm-x86-tdx-6.19' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM TDX changes for 6.19: - Overhaul the TDX code to address systemic races where KVM (acting on behalf of userspace) could inadvertantly trigger lock contention in the TDX-Module, which KVM was either working around in weird, ugly ways, or was simply oblivious to (as proven by Yan tripping several KVM_BUG_ON()s with clever selftests). - Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a vCPU if creating said vCPU failed partway through. - Fix a few sparse warnings (bad annotation, 0 != NULL). - Use struct_size() to simplify copying capabilities to userspace.
2025-11-26Merge tag 'kvm-x86-misc-6.19' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 misc changes for 6.19: - Fix an async #PF bug where KVM would clear the completion queue when the guest transitioned in and out of paging mode, e.g. when handling an SMI and then returning to paged mode via RSM. - Fix a bug where TDX would effectively corrupt user-return MSR values if the TDX Module rejects VP.ENTER and thus doesn't clobber host MSRs as expected. - Leave the user-return notifier used to restore MSRs registered when disabling virtualization, and instead pin kvm.ko. Restoring host MSRs via IPI callback is either pointless (clean reboot) or dangerous (forced reboot) since KVM has no idea what code it's interrupting. - Use the checked version of {get,put}_user(), as Linus wants to kill them off, and they're measurably faster on modern CPUs due to the unchecked versions containing an LFENCE. - Fix a long-lurking bug where KVM's lack of catch-up logic for periodic APIC timers can result in a hard lockup in the host. - Revert the periodic kvmclock sync logic now that KVM doesn't use a clocksource that's subject to NPT corrections. - Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the latter behind CONFIG_CPU_MITIGATIONS. - Context switch XCR0, XSS, and PKRU outside of the entry/exit fastpath as the only reason they were handled in the faspath was to paper of a bug in the core #MC code that has long since been fixed. - Add emulator support for AVX MOV instructions to play nice with emulated devices whose PCI BARs guest drivers like to access with large multi-byte instructions.
2025-11-19KVM: x86: Add x86_emulate_ops.get_xcr() callbackPaolo Bonzini
This will be necessary in order to check whether AVX is enabled. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com> Link: https://patch.msgid.link/20251114003633.60689-7-pbonzini@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-19KVM: x86: Add a helper to dedup loading guest/host XCR0 and XSSBinbin Wu
Add and use a helper, kvm_load_xfeatures(), to dedup the code that loads guest/host xfeatures. Opportunistically return early if X86_CR4_OSXSAVE is not set to reduce indentations. No functional change intended. Suggested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://patch.msgid.link/20251110050539.3398759-1-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-19KVM: x86: Load guest/host PKRU outside of the fastpath run loopSean Christopherson
Move KVM's swapping of PKRU outside of the fastpath loop, as there is no KVM code anywhere in the fastpath that accesses guest/userspace memory, i.e. that can consume protection keys. As documented by commit 1be0e61c1f25 ("KVM, pkeys: save/restore PKRU when guest/host switches"), KVM just needs to ensure the host's PKRU is loaded when KVM (or the kernel at-large) may access userspace memory. And at the time of commit 1be0e61c1f25, KVM didn't have a fastpath, and PKU was strictly contained to VMX, i.e. there was no reason to swap PKRU outside of vmx_vcpu_run(). Over time, the "need" to swap PKRU close to VM-Enter was likely falsely solidified by the association with XFEATUREs in commit 37486135d3a7 ("KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.c"), and XFEATURE swapping was in turn moved close to VM-Enter/VM-Exit as a KVM hack-a-fix ution for an #MC handler bug by commit 1811d979c716 ("x86/kvm: move kvm_load/put_guest_xcr0 into atomic context"). Deferring the PKRU loads shaves ~40 cycles off the fastpath for Intel, and ~60 cycles for AMD. E.g. using INVD in KVM-Unit-Test's vmexit.c, with extra hacks to enable CR4.PKE and PKRU=(-1u & ~0x3), latency numbers for AMD Turin go from ~1560 => ~1500, and for Intel Emerald Rapids, go from ~810 => ~770. Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Jon Kohler <jon@nutanix.com> Link: https://patch.msgid.link/20251118222328.2265758-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-19KVM: x86: Load guest/host XCR0 and XSS outside of the fastpath run loopSean Christopherson
Move KVM's swapping of XFEATURE masks, i.e. XCR0 and XSS, out of the fastpath loop now that the guts of the #MC handler runs in task context, i.e. won't invoke schedule() with preemption disabled and clobber state (or crash the kernel) due to trying to context switch XSTATE with a mix of host and guest state. For all intents and purposes, this reverts commit 1811d979c716 ("x86/kvm: move kvm_load/put_guest_xcr0 into atomic context"), which papered over an egregious bug/flaw in the #MC handler where it would do schedule() even though IRQs are disabled. E.g. the call stack from the commit: kvm_load_guest_xcr0 ... kvm_x86_ops->run(vcpu) vmx_vcpu_run vmx_complete_atomic_exit kvm_machine_check do_machine_check do_memory_failure memory_failure lock_page Commit 1811d979c716 "fixed" the immediate issue of XRSTORS exploding, but completely ignored that scheduling out a vCPU task while IRQs and preemption is wildly broken. Thankfully, commit 5567d11c21a1 ("x86/mce: Send #MC singal from task work") (somewhat incidentally?) fixed that flaw by pushing the meat of the work to the user-return path, i.e. to task context. KVM has also hardened itself against #MC goofs by moving #MC forwarding to kvm_x86_ops.handle_exit_irqoff(), i.e. out of the fastpath. While that's by no means a robust fix, restoring as much state as possible before handling the #MC will hopefully provide some measure of protection in the event that #MC handling goes off the rails again. Note, KVM always intercepts XCR0 writes for vCPUs without protected state, e.g. there's no risk of consuming a stale XCR0 when determining if a PKRU update is needed; kvm_load_host_xfeatures() only reads, and never writes, vcpu->arch.xcr0. Deferring the XCR0 and XSS loads shaves ~300 cycles off the fastpath for Intel, and ~500 cycles for AMD. E.g. using INVD in KVM-Unit-Test's vmexit.c, which an extra hack to enable CR4.OXSAVE, latency numbers for AMD Turin go from ~2000 => 1500, and for Intel Emerald Rapids, go from ~1300 => ~1000. Cc: Jon Kohler <jon@nutanix.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Jon Kohler <jon@nutanix.com> Link: https://patch.msgid.link/20251118222328.2265758-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-18KVM: x86: Unify L1TF flushing under per-CPU variableBrendan Jackman
Currently the tracking of the need to flush L1D for L1TF is tracked by two bits: one per-CPU and one per-vCPU. The per-vCPU bit is always set when the vCPU shows up on a core, so there is no interesting state that's truly per-vCPU. Indeed, this is a requirement, since L1D is a part of the physical CPU. So simplify this by combining the two bits. The vCPU bit was being written from preemption-enabled regions. To play nice with those cases, wrap all calls from KVM and use a raw write so that request a flush with preemption enabled doesn't trigger what would effectively be DEBUG_PREEMPT false positives. Preemption doesn't need to be disabled, as kvm_arch_vcpu_load() will mark the new CPU as needing a flush if the vCPU task is migrated, or if userspace runs the vCPU on a different task. Signed-off-by: Brendan Jackman <jackmanb@google.com> [sean: put raw write in KVM instead of in a hardirq.h variant] Link: https://patch.msgid.link/20251113233746.1703361-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-18KVM: x86: Allocate/free user_return_msrs at kvm.ko (un)loading timeChao Gao
Move user_return_msrs allocation/free from vendor modules (kvm-intel.ko and kvm-amd.ko) (un)loading time to kvm.ko's to make it less risky to access user_return_msrs in kvm.ko. Tying the lifetime of user_return_msrs to vendor modules makes every access to user_return_msrs prone to use-after-free issues as vendor modules may be unloaded at any time. Opportunistically turn the per-CPU variable into full structs, as there's no practical difference between statically allocating the memory and allocating it unconditionally during module_init(). Zero out kvm_nr_uret_msrs on vendor module exit to further minimize the chances of consuming stale data, and WARN on vendor module load if KVM thinks there are existing user-return MSRs. Note! The user-return MSRs also need to be "destroyed" if ops->hardware_setup() fails, as both SVM and VMX expect common KVM to clean up (because common code, not vendor code, is responsible for kvm_nr_uret_msrs). Signed-off-by: Chao Gao <chao.gao@intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20251108013601.902918-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-17KVM: x86: remove comment about ntp correction sync forLei Chen
Since vcpu local clock is no longer affected by ntp, remove comment about ntp correction sync for function kvm_gen_kvmclock_update. Signed-off-by: Lei Chen <lei.chen@smartx.com> Link: https://patch.msgid.link/20250819152027.1687487-4-lei.chen@smartx.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-17Revert "x86: kvm: rate-limit global clock updates"Lei Chen
This reverts commit 7e44e4495a398eb553ce561f29f9148f40a3448f. Commit 7e44e4495a39 ("x86: kvm: rate-limit global clock updates") intends to use a kvmclock_update_work to sync ntp corretion across all vcpus kvmclock, which is based on commit 0061d53daf26f ("KVM: x86: limit difference between kvmclock updates") Since kvmclock has been switched to mono raw, this commit can be reverted. Signed-off-by: Lei Chen <lei.chen@smartx.com> Link: https://patch.msgid.link/20250819152027.1687487-3-lei.chen@smartx.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-17Revert "x86: kvm: introduce periodic global clock updates"Lei Chen
This reverts commit 332967a3eac06f6379283cf155c84fe7cd0537c2. Commit 332967a3eac0 ("x86: kvm: introduce periodic global clock updates") introduced a 300s interval work to sync ntp corrections across all vcpus. Since commit 53fafdbb8b21 ("KVM: x86: switch KVMCLOCK base to monotonic raw clock"), kvmclock switched to mono raw clock, we can no longer take ntp into consideration. Signed-off-by: Lei Chen <lei.chen@smartx.com> Link: https://patch.msgid.link/20250819152027.1687487-2-lei.chen@smartx.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-13KVM: SVM: Don't skip unrelated instruction if INT3/INTO is replacedOmar Sandoval
When re-injecting a soft interrupt from an INT3, INT0, or (select) INTn instruction, discard the exception and retry the instruction if the code stream is changed (e.g. by a different vCPU) between when the CPU executes the instruction and when KVM decodes the instruction to get the next RIP. As effectively predicted by commit 6ef88d6e36c2 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction"), failure to verify that the correct INTn instruction was decoded can effectively clobber guest state due to decoding the wrong instruction and thus specifying the wrong next RIP. The bug most often manifests as "Oops: int3" panics on static branch checks in Linux guests. Enabling or disabling a static branch in Linux uses the kernel's "text poke" code patching mechanism. To modify code while other CPUs may be executing that code, Linux (temporarily) replaces the first byte of the original instruction with an int3 (opcode 0xcc), then patches in the new code stream except for the first byte, and finally replaces the int3 with the first byte of the new code stream. If a CPU hits the int3, i.e. executes the code while it's being modified, then the guest kernel must look up the RIP to determine how to handle the #BP, e.g. by emulating the new instruction. If the RIP is incorrect, then this lookup fails and the guest kernel panics. The bug reproduces almost instantly by hacking the guest kernel to repeatedly check a static branch[1] while running a drgn script[2] on the host to constantly swap out the memory containing the guest's TSS. [1]: https://gist.github.com/osandov/44d17c51c28c0ac998ea0334edf90b5a [2]: https://gist.github.com/osandov/10e45e45afa29b11e0c7209247afc00b Fixes: 6ef88d6e36c2 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction") Cc: stable@vger.kernel.org Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Link: https://patch.msgid.link/1cc6dcdf36e3add7ee7c8d90ad58414eeb6c3d34.1762278762.git.osandov@fb.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-07KVM: x86: Don't disable IRQs when unregistering user-return notifierHou Wenlong
Remove the code to disable IRQs when unregistering KVM's user-return notifier now that KVM doesn't invoke kvm_on_user_return() when disabling virtualization via IPI function call, i.e. now that there's no need to guard against re-entrancy via IPI callback. Note, disabling IRQs has largely been unnecessary since commit a377ac1cd9d7b ("x86/entry: Move user return notifier out of loop") moved fire_user_return_notifiers() into the section with IRQs disabled. In doing so, the commit somewhat inadvertently fixed the underlying issue that was papered over by commit 1650b4ebc99d ("KVM: Disable irq while unregistering user notifier"). I.e. in practice, the code and comment has been stale since commit a377ac1cd9d7b. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> [sean: rewrite changelog after rebasing, drop lockdep assert] Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030191528.3380553-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-07KVM: x86: Leave user-return notifier registered on reboot/shutdownSean Christopherson
Leave KVM's user-return notifier registered in the unlikely case that the notifier is registered when disabling virtualization via IPI callback in response to reboot/shutdown. On reboot/shutdown, keeping the notifier registered is ok as far as MSR state is concerned (arguably better then restoring MSRs at an unknown point in time), as the callback will run cleanly and restore host MSRs if the CPU manages to return to userspace before the system goes down. The only wrinkle is that if kvm.ko module unload manages to race with reboot/shutdown, then leaving the notifier registered could lead to use-after-free due to calling into unloaded kvm.ko module code. But such a race is only possible on --forced reboot/shutdown, because otherwise userspace tasks would be frozen before kvm_shutdown() is called, i.e. on a "normal" reboot/shutdown, it should be impossible for the CPU to return to userspace after kvm_shutdown(). Furthermore, on a --forced reboot/shutdown, unregistering the user-return hook from IRQ context doesn't fully guard against use-after-free, because KVM could immediately re-register the hook, e.g. if the IRQ arrives before kvm_user_return_register_notifier() is called. Rather than trying to guard against the IPI in the "normal" user-return code, which is difficult and noisy, simply leave the user-return notifier registered on a reboot, and bump the kvm.ko module refcount to defend against a use-after-free due to kvm.ko unload racing against reboot. Alternatively, KVM could allow kvm.ko and try to drop the notifiers during kvm_x86_exit(), but that's also a can of worms as registration is per-CPU, and so KVM would need to blast an IPI, and doing so while a reboot/shutdown is in-progress is far risky than preventing userspace from unloading KVM. Link: https://patch.msgid.link/20251030191528.3380553-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-07KVM: x86: WARN if user-return MSR notifier is registered on exitSean Christopherson
When freeing the per-CPU user-return MSRs structures, WARN if any CPU has a registered notifier to help detect and/or debug potential use-after-free issues. The lifecycle of the notifiers is rather convoluted, and has several non-obvious paths where notifiers are unregistered, i.e. isn't exactly the most robust code possible. The notifiers they are registered on-demand in KVM, on the first WRMSR to a tracked register. _Usually_ the notifier is unregistered whenever the CPU returns to userspace. But because any given CPU isn't guaranteed to return to userspace, e.g. the CPU could be offlined before doing so, KVM also "drops", a.k.a. unregisters, the notifiers when virtualization is disabled on the CPU. Further complicating the unregister path is the fact that the calls to disable virtualization come from common KVM, and the per-CPU calls are guarded by a per-CPU flag (to harden _that_ code against bugs, e.g. due to mishandling reboot). Reboot/shutdown in particular is problematic, as KVM disables virtualization via IPI function call, i.e. from IRQ context, instead of using the cpuhp framework, which runs in task context. I.e. on reboot/shutdown, drop_user_return_notifiers() is called asynchronously. Forced reboot/shutdown is the most problematic scenario, as userspace tasks are not frozen before kvm_shutdown() is invoked, i.e. KVM could be actively manipulating the user-return MSR lists and/or notifiers when the IPI arrives. To a certain extent, all bets are off when userspace forces a reboot/shutdown, but KVM should at least avoid a use-after-free, e.g. to avoid crashing the kernel when trying to reboot. Link: https://patch.msgid.link/20251030191528.3380553-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-07KVM: TDX: Explicitly set user-return MSRs that *may* be clobbered by the ↵Sean Christopherson
TDX-Module Set all user-return MSRs to their post-TD-exit value when preparing to run a TDX vCPU to ensure the value that KVM expects to be loaded after running the vCPU is indeed the value that's loaded in hardware. If the TDX-Module doesn't actually enter the guest, i.e. doesn't do VM-Enter, then it won't "restore" VMM state, i.e. won't clobber user-return MSRs to their expected post-run values, in which case simply updating KVM's "cached" value will effectively corrupt the cache due to hardware still holding the original value. In theory, KVM could conditionally update the current user-return value if and only if tdh_vp_enter() succeeds, but in practice "success" doesn't guarantee the TDX-Module actually entered the guest, e.g. if the TDX-Module synthesizes an EPT Violation because it suspects a zero-step attack. Force-load the expected values instead of trying to decipher whether or not the TDX-Module restored/clobbered MSRs, as the risk doesn't justify the benefits. Effectively avoiding four WRMSRs once per run loop (even if the vCPU is scheduled out, user-return MSRs only need to be reloaded if the CPU exits to userspace or runs a non-TDX vCPU) is likely in the noise when amortized over all entries, given the cost of running a TDX vCPU. E.g. the cost of the WRMSRs is somewhere between ~300 and ~500 cycles, whereas the cost of a _single_ roundtrip to/from a TDX guest is thousands of cycles. Fixes: e0b4f31a3c65 ("KVM: TDX: restore user ret MSRs") Cc: stable@vger.kernel.org Cc: Yan Zhao <yan.y.zhao@intel.com> Cc: Xiaoyao Li <xiaoyao.li@intel.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://patch.msgid.link/20251030191528.3380553-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-07KVM: x86: Don't clear async #PF queue when CR0.PG is disabled (e.g. on #SMI)Maxim Levitsky
Fix an interaction between SMM and PV asynchronous #PFs where an #SMI can cause KVM to drop an async #PF ready event, and thus result in guest tasks becoming permanently stuck due to the task that encountered the #PF never being resumed. Specifically, don't clear the completion queue when paging is disabled, and re-check for completed async #PFs if/when paging is enabled. Prior to commit 2635b5c4a0e4 ("KVM: x86: interrupt based APF 'page ready' event delivery"), flushing the APF queue without notifying the guest of completed APF requests when paging is disabled was "necessary", in that delivering a #PF to the guest when paging is disabled would likely confuse and/or crash the guest. And presumably the original async #PF development assumed that a guest would only disable paging when there was no intent to ever re-enable paging. That assumption fails in several scenarios, most visibly on an emulated SMI, as entering SMM always disables CR0.PG (i.e. initially runs with paging disabled). When the SMM handler eventually executes RSM, the interrupted paging-enabled is restored, and the async #PF event is lost. Similarly, invoking firmware, e.g. via EFI runtime calls, might require a transition through paging modes and thus also disable paging with valid entries in the competion queue. To avoid dropping completion events, drop the "clear" entirely, and handle paging-enable transitions in the same way KVM already handles APIC enable/disable events: if a vCPU's APIC is disabled, APF completion events are not kept pending and not injected while APIC is disabled. Once a vCPU's APIC is re-enabled, KVM raises KVM_REQ_APF_READY so that the vCPU recognizes any pending pending #APF ready events. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20251015033258.50974-4-mlevitsk@redhat.com [sean: rework changelog to call out #PF injection, drop "real mode" references, expand the code comment] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-07KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued()Maxim Levitsky
Fix a semi theoretical race condition related to a lack of memory barriers when dealing with vcpu->arch.apf.pageready_pending. In theory, the "ready" side could see a stale pageready_pending and neglect to kick the vCPU, and thus allow the vCPU to enter the guest with a pending KVM_REQ_APF_READY and no kick/IPI on the way, in which case the KVM would fail to deliver a completed async #PF event to the guest in a timely manner as the request would be recognized only on the next (coincidental) VM-Exit. kvm_arch_async_page_present_queued() running in workqueue context: kvm_make_request(KVM_REQ_APF_READY, vcpu); /* memory barrier is missing here*/ if (!vcpu->arch.apf.pageready_pending) kvm_vcpu_kick(vcpu); kvm_set_msr_common() running in task context: vcpu->arch.apf.pageready_pending = false; /* memory barrier is missing here*/ And later, vcpu_enter_guest() running in task context: if (kvm_check_request(KVM_REQ_APF_READY, vcpu)) kvm_check_async_pf_completion(vcpu) Add missing full memory barriers in both cases to avoid theoretical case of not kicking the vCPU thread. Note that the bug is mostly theoretical because kvm_make_request() uses an atomic operation, which is always serializing on x86, requiring only for documentation purposes the smp_mb__after_atomic() after it (smp_mb__after_atomic() is a NOP on x86). The second missing barrier, between kvm_set_msr_common() and vcpu_enter_guest(), isn't strictly needed because KVM executes several barriers in between calling these functions, however it still makes sense to have an explicit barrier to be on the safe side and to document the ordering dependencies. Finally, also use READ_ONCE/WRITE_ONCE. Thanks a lot to Paolo for the help with this patch. Link: https://lore.kernel.org/all/7c7a5a75-a786-4a05-a836-4368582ca4c2@redhat.com Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://patch.msgid.link/20251015033258.50974-3-mlevitsk@redhat.com [sean: explain the race and its impact in more detail] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05KVM: TDX: Convert INIT_MEM_REGION and INIT_VCPU to "unlocked" vCPU ioctlSean Christopherson
Handle the KVM_TDX_INIT_MEM_REGION and KVM_TDX_INIT_VCPU vCPU sub-ioctls in the unlocked variant, i.e. outside of vcpu->mutex, in anticipation of taking kvm->lock along with all other vCPU mutexes, at which point the sub-ioctls _must_ start without vcpu->mutex held. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-24-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05KVM: Rename kvm_arch_vcpu_async_ioctl() to kvm_arch_vcpu_unlocked_ioctl()Sean Christopherson
Rename the "async" ioctl API to "unlocked" so that upcoming usage in x86's TDX code doesn't result in a massive misnomer. To avoid having to retry SEAMCALLs, TDX needs to acquire kvm->lock *and* all vcpu->mutex locks, and acquiring all of those locks after/inside the current vCPU's mutex is a non-starter. However, TDX also needs to acquire the vCPU's mutex and load the vCPU, i.e. the handling is very much not async to the vCPU. No functional change intended. Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05KVM: Make support for kvm_arch_vcpu_async_ioctl() mandatorySean Christopherson
Implement kvm_arch_vcpu_async_ioctl() "natively" in x86 and arm64 instead of relying on an #ifdef'd stub, and drop HAVE_KVM_VCPU_ASYNC_IOCTL in anticipation of using the API on x86. Once x86 uses the API, providing a stub for one architecture and having all other architectures opt-in requires more code than simply implementing the API in the lone holdout. Eliminating the Kconfig will also reduce churn if the API is renamed in the future (spoiler alert). No functional change intended. Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-04KVM: x86: Add a helper to dedup reporting of unhandled VM-ExitsSean Christopherson
Add and use a helper, kvm_prepare_unexpected_reason_exit(), to dedup the code that fills the exit reason and CPU when KVM encounters a VM-Exit that KVM doesn't know how to handle. Reviewed-by: yaoyuan@linux.alibaba.com Reviewed-by: Yao Yuan <yaoyuan@linux.alibaba.com> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030185004.3372256-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-04KVM: x86: Call out MSR_IA32_S_CET is not handled by XSAVESChao Gao
Update the comment above is_xstate_managed_msr() to note that MSR_IA32_S_CET isn't saved/restored by XSAVES/XRSTORS. MSR_IA32_S_CET isn't part of CET_U/S state as the SDM states: The register state used by Control-Flow Enforcement Technology (CET) comprises the two 64-bit MSRs (IA32_U_CET and IA32_PL3_SSP) that manage CET when CPL = 3 (CET_U state); and the three 64-bit MSRs (IA32_PL0_SSP–IA32_PL2_SSP) that manage CET when CPL < 3 (CET_S state). Opportunistically shift the snippet about the safety of loading certain MSRs to the function comment for kvm_access_xstate_msr(), which is where the MSRs are actually loaded into hardware. Fixes: e44eb58334bb ("KVM: x86: Load guest FPU state when access XSAVE-managed MSRs") Signed-off-by: Chao Gao <chao.gao@intel.com> Link: https://patch.msgid.link/20251028060142.29830-1-chao.gao@intel.com [sean: shift snippet about safety to kvm_access_xstate_msr()] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-04KVM: x86: Harden KVM against imbalanced load/put of guest FPU stateSean Christopherson
Assert, via KVM_BUG_ON(), that guest FPU state isn't/is in use when loading/putting the FPU to help detect KVM bugs without needing an assist from KASAN. If an imbalanced load/put is detected, skip the redundant load/put to avoid clobbering guest state and/or crashing the host. Note, kvm_access_xstate_msr() already provides a similar assertion. Reviewed-by: Yao Yuan <yaoyuan@linux.alibaba.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://patch.msgid.link/20251030185802.3375059-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-04KVM: x86: Unload "FPU" state on INIT if and only if its currently in-useSean Christopherson
Replace the hack added by commit f958bd2314d1 ("KVM: x86: Fix potential put_fpu() w/o load_fpu() on MPX platform") with a more robust approach of unloading+reloading guest FPU state based on whether or not the vCPU's FPU is currently in-use, i.e. currently loaded. This fixes a bug on hosts that support CET but not MPX, where kvm_arch_vcpu_ioctl_get_mpstate() neglects to load FPU state (it only checks for MPX support) and leads to KVM attempting to put FPU state due to kvm_apic_accept_events() triggering INIT emulation. E.g. on a host with CET but not MPX, syzkaller+KASAN generates: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000004: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000020-0x0000000000000027] CPU: 211 UID: 0 PID: 20451 Comm: syz.9.26 Tainted: G S 6.18.0-smp-DEV #7 NONE Tainted: [S]=CPU_OUT_OF_SPEC Hardware name: Google Izumi/izumi, BIOS 0.20250729.1-0 07/29/2025 RIP: 0010:fpu_swap_kvm_fpstate+0x3ce/0x610 ../arch/x86/kernel/fpu/core.c:377 RSP: 0018:ff1100410c167cc0 EFLAGS: 00010202 RAX: 0000000000000004 RBX: 0000000000000020 RCX: 00000000000001aa RDX: 00000000000001ab RSI: ffffffff817bb960 RDI: 0000000022600000 RBP: dffffc0000000000 R08: ff110040d23c8007 R09: 1fe220081a479000 R10: dffffc0000000000 R11: ffe21c081a479001 R12: ff110040d23c8d98 R13: 00000000fffdc578 R14: 0000000000000000 R15: ff110040d23c8d90 FS: 00007f86dd1876c0(0000) GS:ff11007fc969b000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f86dd186fa8 CR3: 00000040d1dfa003 CR4: 0000000000f73ef0 PKRU: 80000000 Call Trace: <TASK> kvm_vcpu_reset+0x80d/0x12c0 ../arch/x86/kvm/x86.c:11818 kvm_apic_accept_events+0x1cb/0x500 ../arch/x86/kvm/lapic.c:3489 kvm_arch_vcpu_ioctl_get_mpstate+0xd0/0x4e0 ../arch/x86/kvm/x86.c:12145 kvm_vcpu_ioctl+0x5e2/0xed0 ../virt/kvm/kvm_main.c:4539 __se_sys_ioctl+0x11d/0x1b0 ../fs/ioctl.c:51 do_syscall_x64 ../arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x6e/0x940 ../arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f86de71d9c9 </TASK> with a very simple reproducer: r0 = openat$kvm(0xffffffffffffff9c, &(0x7f0000000000), 0x80b00, 0x0) r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0) ioctl$KVM_CREATE_IRQCHIP(r1, 0xae60) r2 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x0) ioctl$KVM_SET_IRQCHIP(r1, 0x8208ae63, ...) ioctl$KVM_GET_MP_STATE(r2, 0x8004ae98, &(0x7f00000000c0)) Alternatively, the MPX hack in GET_MP_STATE could be extended to cover CET, but from a "don't break existing functionality" perspective, that isn't any less risky than peeking at the state of in_use, and it's far less robust for a long term solution (as evidenced by this bug). Reported-by: Alexander Potapenko <glider@google.com> Fixes: 69cc3e886582 ("KVM: x86: Add XSS support for CET_KERNEL and CET_USER") Reviewed-by: Yao Yuan <yaoyuan@linux.alibaba.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://patch.msgid.link/20251030185802.3375059-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-10-18Merge tag 'kvm-x86-fixes-6.18-rc2' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 fixes for 6.18: - Expand the KVM_PRE_FAULT_MEMORY selftest to add a regression test for the bug fixed by commit 3ccbf6f47098 ("KVM: x86/mmu: Return -EAGAIN if userspace deletes/moves memslot during prefault") - Don't try to get PMU capabbilities from perf when running a CPU with hybrid CPUs/PMUs, as perf will rightly WARN. - Rework KVM_CAP_GUEST_MEMFD_MMAP (newly introduced in 6.18) into a more generic KVM_CAP_GUEST_MEMFD_FLAGS - Add a guest_memfd INIT_SHARED flag and require userspace to explicitly set said flag to initialize memory as SHARED, irrespective of MMAP. The behavior merged in 6.18 is that enabling mmap() implicitly initializes memory as SHARED, which would result in an ABI collision for x86 CoCo VMs as their memory is currently always initialized PRIVATE. - Allow mmap() on guest_memfd for x86 CoCo VMs, i.e. on VMs with private memory, to enable testing such setups, i.e. to hopefully flush out any other lurking ABI issues before 6.18 is officially released. - Add testcases to the guest_memfd selftest to cover guest_memfd without MMAP, and host userspace accesses to mmap()'d private memory.
2025-10-10KVM: guest_memfd: Allow mmap() on guest_memfd for x86 VMs with private memorySean Christopherson
Allow mmap() on guest_memfd instances for x86 VMs with private memory as the need to track private vs. shared state in the guest_memfd instance is only pertinent to INIT_SHARED. Doing mmap() on private memory isn't terrible useful (yet!), but it's now possible, and will be desirable when guest_memfd gains support for other VMA-based syscalls, e.g. mbind() to set NUMA policy. Lift the restriction now, before MMAP support is officially released, so that KVM doesn't need to add another capability to enumerate support for mmap() on private memory. Fixes: 3d3a04fad25a ("KVM: Allow and advertise support for host mmap() on guest_memfd files") Reviewed-by: Ackerley Tng <ackerleytng@google.com> Tested-by: Ackerley Tng <ackerleytng@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20251003232606.4070510-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-10-07Merge tag 'hyperv-next-signed-20251006' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: - Unify guest entry code for KVM and MSHV (Sean Christopherson) - Switch Hyper-V MSI domain to use msi_create_parent_irq_domain() (Nam Cao) - Add CONFIG_HYPERV_VMBUS and limit the semantics of CONFIG_HYPERV (Mukesh Rathor) - Add kexec/kdump support on Azure CVMs (Vitaly Kuznetsov) - Deprecate hyperv_fb in favor of Hyper-V DRM driver (Prasanna Kumar T S M) - Miscellaneous enhancements, fixes and cleanups (Abhishek Tiwari, Alok Tiwari, Nuno Das Neves, Wei Liu, Roman Kisel, Michael Kelley) * tag 'hyperv-next-signed-20251006' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: hyperv: Remove the spurious null directive line MAINTAINERS: Mark hyperv_fb driver Obsolete fbdev/hyperv_fb: deprecate this in favor of Hyper-V DRM driver Drivers: hv: Make CONFIG_HYPERV bool Drivers: hv: Add CONFIG_HYPERV_VMBUS option Drivers: hv: vmbus: Fix typos in vmbus_drv.c Drivers: hv: vmbus: Fix sysfs output format for ring buffer index Drivers: hv: vmbus: Clean up sscanf format specifier in target_cpu_store() x86/hyperv: Switch to msi_create_parent_irq_domain() mshv: Use common "entry virt" APIs to do work in root before running guest entry: Rename "kvm" entry code assets to "virt" to genericize APIs entry/kvm: KVM: Move KVM details related to signal/-EINTR into KVM proper mshv: Handle NEED_RESCHED_LAZY before transferring to guest x86/hyperv: Add kexec/kdump support on Azure CVMs Drivers: hv: Simplify data structures for VMBus channel close message Drivers: hv: util: Cosmetic changes for hv_utils_transport.c mshv: Add support for a new parent partition configuration clocksource: hyper-v: Skip unnecessary checks for the root partition hyperv: Add missing field to hv_output_map_device_interrupt
2025-09-30entry/kvm: KVM: Move KVM details related to signal/-EINTR into KVM properSean Christopherson
Move KVM's morphing of pending signals into userspace exits into KVM proper, and drop the @vcpu param from xfer_to_guest_mode_handle_work(). How KVM responds to -EINTR is a detail that really belongs in KVM itself, and invoking kvm_handle_signal_exit() from kernel code creates an inverted module dependency. E.g. attempting to move kvm_handle_signal_exit() into kvm_main.c would generate an linker error when building kvm.ko as a module. Dropping KVM details will also converting the KVM "entry" code into a more generic virtualization framework so that it can be used when running as a Hyper-V root partition. Lastly, eliminating usage of "struct kvm_vcpu" outside of KVM is also nice to have for KVM x86 developers, as keeping the details of kvm_vcpu purely within KVM allows changing the layout of the structure without having to boot into a new kernel, e.g. allows rebuilding and reloading kvm.ko with a modified kvm_vcpu structure as part of debug/development. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-09-30KVM: x86: Export KVM-internal symbols for sub-modules onlySean Christopherson
Rework almost all of KVM x86's exports to expose symbols only to KVM's vendor modules, i.e. to kvm-{amd,intel}.ko. Keep the generic exports that are guarded by CONFIG_KVM_EXTERNAL_WRITE_TRACKING=y, as they're explicitly designed/intended for external usage. Link: https://lore.kernel.org/r/20250919003303.1355064-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-09-30KVM: x86: Drop pointless exports of kvm_arch_xxx() hooksSean Christopherson
Drop the exporting of several kvm_arch_xxx() hooks that are only called from arch-neutral code, i.e. that are only called from kvm.ko. Link: https://lore.kernel.org/r/20250919003303.1355064-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-09-30Merge tag 'kvm-x86-cet-6.18' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 CET virtualization support for 6.18 Add support for virtualizing Control-flow Enforcement Technology (CET) on Intel (Shadow Stacks and Indirect Branch Tracking) and AMD (Shadow Stacks). CET is comprised of two distinct features, Shadow Stacks (SHSTK) and Indirect Branch Tracking (IBT), that can be utilized by software to help provide Control-flow integrity (CFI). SHSTK defends against backward-edge attacks (a.k.a. Return-oriented programming (ROP)), while IBT defends against forward-edge attacks (a.k.a. similarly CALL/JMP-oriented programming (COP/JOP)). Attackers commonly use ROP and COP/JOP methodologies to redirect the control- flow to unauthorized targets in order to execute small snippets of code, a.k.a. gadgets, of the attackers choice. By chaining together several gadgets, an attacker can perform arbitrary operations and circumvent the system's defenses. SHSTK defends against backward-edge attacks, which execute gadgets by modifying the stack to branch to the attacker's target via RET, by providing a second stack that is used exclusively to track control transfer operations. The shadow stack is separate from the data/normal stack, and can be enabled independently in user and kernel mode. When SHSTK is is enabled, CALL instructions push the return address on both the data and shadow stack. RET then pops the return address from both stacks and compares the addresses. If the return addresses from the two stacks do not match, the CPU generates a Control Protection (#CP) exception. IBT defends against backward-edge attacks, which branch to gadgets by executing indirect CALL and JMP instructions with attacker controlled register or memory state, by requiring the target of indirect branches to start with a special marker instruction, ENDBRANCH. If an indirect branch is executed and the next instruction is not an ENDBRANCH, the CPU generates a #CP. Note, ENDBRANCH behaves as a NOP if IBT is disabled or unsupported. From a virtualization perspective, CET presents several problems. While SHSTK and IBT have two layers of enabling, a global control in the form of a CR4 bit, and a per-feature control in user and kernel (supervisor) MSRs (U_CET and S_CET respectively), the {S,U}_CET MSRs can be context switched via XSAVES/XRSTORS. Practically speaking, intercepting and emulating XSAVES/XRSTORS is not a viable option due to complexity, and outright disallowing use of XSTATE to context switch SHSTK/IBT state would render the features unusable to most guests. To limit the overall complexity without sacrificing performance or usability, simply ignore the potential virtualization hole, but ensure that all paths in KVM treat SHSTK/IBT as usable by the guest if the feature is supported in hardware, and the guest has access to at least one of SHSTK or IBT. I.e. allow userspace to advertise one of SHSTK or IBT if both are supported in hardware, even though doing so would allow a misbehaving guest to use the unadvertised feature. Fully emulating SHSTK and IBT would also require significant complexity, e.g. to track and update branch state for IBT, and shadow stack state for SHSTK. Given that emulating large swaths of the guest code stream isn't necessary on modern CPUs, punt on emulating instructions that meaningful impact or consume SHSTK or IBT. However, instead of doing nothing, explicitly reject emulation of such instructions so that KVM's emulator can't be abused to circumvent CET. Disable support for SHSTK and IBT if KVM is configured such that emulation of arbitrary guest instructions may be required, specifically if Unrestricted Guest (Intel only) is disabled, or if KVM will emulate a guest.MAXPHYADDR that is smaller than host.MAXPHYADDR. Lastly disable SHSTK support if shadow paging is enabled, as the protections for the shadow stack are novel (shadow stacks require Writable=0,Dirty=1, so that they can't be directly modified by software), i.e. would require non-trivial support in the Shadow MMU. Note, AMD CPUs currently only support SHSTK. Explicitly disable IBT support so that KVM doesn't over-advertise if AMD CPUs add IBT, and virtualizing IBT in SVM requires KVM modifications.
2025-09-30Merge tag 'kvm-x86-misc-6.18' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 changes for 6.18 - Don't (re)check L1 intercepts when completing userspace I/O to fix a flaw where a misbehaving usersepace (a.k.a. syzkaller) could swizzle L1's intercepts and trigger a variety of WARNs in KVM. - Emulate PERF_CNTR_GLOBAL_STATUS_SET for PerfMonV2 guests, as the MSR is supposed to exist for v2 PMUs. - Allow Centaur CPU leaves (base 0xC000_0000) for Zhaoxin CPUs. - Clean up KVM's vector hashing code for delivering lowest priority IRQs. - Clean up the fastpath handler code to only handle IPIs and WRMSRs that are actually "fast", as opposed to handling those that KVM _hopes_ are fast, and in the process of doing so add fastpath support for TSC_DEADLINE writes on AMD CPUs. - Clean up a pile of PMU code in anticipation of adding support for mediated vPMUs. - Add support for the immediate forms of RDMSR and WRMSRNS, sans full emulator support (KVM should never need to emulate the MSRs outside of forced emulation and other contrived testing scenarios). - Clean up the MSR APIs in preparation for CET and FRED virtualization, as well as mediated vPMU support. - Rejecting a fully in-kernel IRQCHIP if EOIs are protected, i.e. for TDX VMs, as KVM can't faithfully emulate an I/O APIC for such guests. - KVM_REQ_MSR_FILTER_CHANGED into a generic RECALC_INTERCEPTS in preparation for mediated vPMU support, as KVM will need to recalculate MSR intercepts in response to PMU refreshes for guests with mediated vPMUs. - Misc cleanups and minor fixes.
2025-09-30Merge tag 'kvm-x86-svm-6.18' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM SVM changes for 6.18 - Require a minimum GHCB version of 2 when starting SEV-SNP guests via KVM_SEV_INIT2 so that invalid GHCB versions result in immediate errors instead of latent guest failures. - Add support for Secure TSC for SEV-SNP guests, which prevents the untrusted host from tampering with the guest's TSC frequency, while still allowing the the VMM to configure the guest's TSC frequency prior to launch. - Mitigate the potential for TOCTOU bugs when accessing GHCB fields by wrapping all accesses via READ_ONCE(). - Validate the XCR0 provided by the guest (via the GHCB) to avoid tracking a bogous XCR0 value in KVM's software model. - Save an SEV guest's policy if and only if LAUNCH_START fully succeeds to avoid leaving behind stale state (thankfully not consumed in KVM). - Explicitly reject non-positive effective lengths during SNP's LAUNCH_UPDATE instead of subtly relying on guest_memfd to do the "heavy" lifting. - Reload the pre-VMRUN TSC_AUX on #VMEXIT for SEV-ES guests, not the host's desired TSC_AUX, to fix a bug where KVM could clobber a different vCPU's TSC_AUX due to hardware not matching the value cached in the user-return MSR infrastructure. - Enable AVIC by default for Zen4+ if x2AVIC (and other prereqs) is supported, and clean up the AVIC initialization code along the way.
2025-09-30Merge tag 'loongarch-kvm-6.18' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.18 1. Add PTW feature detection on new hardware. 2. Add sign extension with kernel MMIO/IOCSR emulation. 3. Improve in-kernel IPI emulation. 4. Improve in-kernel PCH-PIC emulation. 5. Move kvm_iocsr tracepoint out of generic code.
2025-09-23KVM: x86: Add XSS support for CET_KERNEL and CET_USERYang Weijiang
Add CET_KERNEL and CET_USER to KVM's set of supported XSS bits when IBT *or* SHSTK is supported. Like CR4.CET, XFEATURE support for IBT and SHSTK are bundle together under the CET umbrella, and thus prone to virtualization holes if KVM or the guest supports only one of IBT or SHSTK, but hardware supports both. However, again like CR4.CET, such virtualization holes are benign from the host's perspective so long as KVM takes care to always honor the "or" logic. Require CET_KERNEL and CET_USER to come as a pair, and refuse to support IBT or SHSTK if one (or both) features is missing, as the (host) kernel expects them to come as a pair, i.e. may get confused and corrupt state if only one of CET_KERNEL or CET_USER is supported. Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Signed-off-by: Mathias Krause <minipli@grsecurity.net> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> [sean: split to separate patch, write changelog, add XFEATURE_MASK_CET_ALL] Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-26-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: x86: Emulate SSP[63:32]!=0 #GP(0) for FAR JMP to 32-bit modeSean Christopherson
Emulate the Shadow Stack restriction that the current SSP must be a 32-bit value on a FAR JMP from 64-bit mode to compatibility mode. From the SDM's pseudocode for FAR JMP: IF ShadowStackEnabled(CPL) IF (IA32_EFER.LMA and DEST(segment selector).L) = 0 (* If target is legacy or compatibility mode then the SSP must be in low 4GB *) IF (SSP & 0xFFFFFFFF00000000 != 0); THEN #GP(0); FI; FI; FI; Note, only the current CPL needs to be considered, as FAR JMP can't be used for inter-privilege level transfers, and KVM rejects emulation of all other far branch instructions when Shadow Stacks are enabled. To give the emulator access to GUEST_SSP, special case handling MSR_KVM_INTERNAL_GUEST_SSP in emulator_get_msr() to treat the access as a host access (KVM doesn't allow guest accesses to internal "MSRs"). The ->get_msr() API is only used for implicit accesses from the emulator, i.e. is only used with hardcoded MSR indices, and so any access to MSR_KVM_INTERNAL_GUEST_SSP is guaranteed to be from KVM, i.e. not from the guest via RDMSR. Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-21-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: x86: Don't emulate task switches when IBT or SHSTK is enabledSean Christopherson
Exit to userspace with KVM_INTERNAL_ERROR_EMULATION if the guest triggers task switch emulation with Indirect Branch Tracking or Shadow Stacks enabled, as attempting to do the right thing would require non-trivial effort and complexity, KVM doesn't support emulating CET generally, and it's extremely unlikely that any guest will do task switches while also utilizing CET. Defer taking on the complexity until someone cares enough to put in the time and effort to add support. Per the SDM: If shadow stack is enabled, then the SSP of the task is located at the 4 bytes at offset 104 in the 32-bit TSS and is used by the processor to establish the SSP when a task switch occurs from a task associated with this TSS. Note that the processor does not write the SSP of the task initiating the task switch to the TSS of that task, and instead the SSP of the previous task is pushed onto the shadow stack of the new task. Note, per the SDM's pseudocode on TASK SWITCHING, IBT state for the new privilege level is updated. To keep things simple, check both S_CET and U_CET (again, anyone that wants more precise checking can have the honor of implementing support). Reported-by: Binbin Wu <binbin.wu@linux.intel.com> Closes: https://lore.kernel.org/all/819bd98b-2a60-4107-8e13-41f1e4c706b1@linux.intel.com Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-20-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: VMX: Set host constant supervisor states to VMCS fieldsYang Weijiang
Save constant values to HOST_{S_CET,SSP,INTR_SSP_TABLE} field explicitly. Kernel IBT is supported and the setting in MSR_IA32_S_CET is static after post-boot(The exception is BIOS call case but vCPU thread never across it) and KVM doesn't need to refresh HOST_S_CET field before every VM-Enter/ VM-Exit sequence. Host supervisor shadow stack is not enabled now and SSP is not accessible to kernel mode, thus it's safe to set host IA32_INT_SSP_TAB/SSP VMCS field to 0s. When shadow stack is enabled for CPL3, SSP is reloaded from PL3_SSP before it exits to userspace. Check SDM Vol 2A/B Chapter 3/4 for SYSCALL/ SYSRET/SYSENTER SYSEXIT/RDSSP/CALL etc. Prevent KVM module loading if host supervisor shadow stack SHSTK_EN is set in MSR_IA32_S_CET as KVM cannot co-exit with it correctly. Suggested-by: Sean Christopherson <seanjc@google.com> Suggested-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> [sean: snapshot host S_CET if SHSTK *or* IBT is supported] Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-18-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: VMX: Emulate read and write to CET MSRsYang Weijiang
Add emulation interface for CET MSR access. The emulation code is split into common part and vendor specific part. The former does common checks for MSRs, e.g., accessibility, data validity etc., then passes operation to either XSAVE-managed MSRs via the helpers or CET VMCS fields. SSP can only be read via RDSSP. Writing even requires destructive and potentially faulting operations such as SAVEPREVSSP/RSTORSSP or SETSSBSY/CLRSSBSY. Let the host use a pseudo-MSR that is just a wrapper for the GUEST_SSP field of the VMCS. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> [sean: drop call to kvm_set_xstate_msr() for S_CET, consolidate code] Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: x86: Enable guest SSP read/write interface with new uAPIsYang Weijiang
Add a KVM-defined ONE_REG register, KVM_REG_GUEST_SSP, to let userspace save and restore the guest's Shadow Stack Pointer (SSP). On both Intel and AMD, SSP is a hardware register that can only be accessed by software via dedicated ISA (e.g. RDSSP) or via VMCS/VMCB fields (used by hardware to context switch SSP at entry/exit). As a result, SSP doesn't fit in any of KVM's existing interfaces for saving/restoring state. Internally, treat SSP as a fake/synthetic MSR, as the semantics of writes to SSP follow that of several other Shadow Stack MSRs, e.g. the PLx_SSP MSRs. Use a translation layer to hide the KVM-internal MSR index so that the arbitrary index doesn't become ABI, e.g. so that KVM can rework its implementation as needed, so long as the ONE_REG ABI is maintained. Explicitly reject accesses to SSP if the vCPU doesn't have Shadow Stack support to avoid running afoul of ignore_msrs, which unfortunately applies to host-initiated accesses (which is a discussion for another day). I.e. ensure consistent behavior for KVM-defined registers irrespective of ignore_msrs. Link: https://lore.kernel.org/all/aca9d389-f11e-4811-90cf-d98e345a5cc2@intel.com Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-14-seanjc@google.com Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: x86: Report KVM supported CET MSRs as to-be-savedYang Weijiang
Add CET MSRs to the list of MSRs reported to userspace if the feature, i.e. IBT or SHSTK, associated with the MSRs is supported by KVM. Suggested-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: x86: Add fault checks for guest CR4.CET settingYang Weijiang
Check potential faults for CR4.CET setting per Intel SDM requirements. CET can be enabled if and only if CR0.WP == 1, i.e. setting CR4.CET == 1 faults if CR0.WP == 0 and setting CR0.WP == 0 fails if CR4.CET == 1. Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250919223258.1604852-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: x86: Load guest FPU state when access XSAVE-managed MSRsSean Christopherson
Load the guest's FPU state if userspace is accessing MSRs whose values are managed by XSAVES. Introduce two helpers, kvm_{get,set}_xstate_msr(), to facilitate access to such kind of MSRs. If MSRs supported in kvm_caps.supported_xss are passed through to guest, the guest MSRs are swapped with host's before vCPU exits to userspace and after it reenters kernel before next VM-entry. Because the modified code is also used for the KVM_GET_MSRS device ioctl(), explicitly check @vcpu is non-null before attempting to load guest state. The XSAVE-managed MSRs cannot be retrieved via the device ioctl() without loading guest FPU state (which doesn't exist). Note that guest_cpuid_has() is not queried as host userspace is allowed to access MSRs that have not been exposed to the guest, e.g. it might do KVM_SET_MSRS prior to KVM_SET_CPUID2. The two helpers are put here in order to manifest accessing xsave-managed MSRs requires special check and handling to guarantee the correctness of read/write to the MSRs. Co-developed-by: Yang Weijiang <weijiang.yang@intel.com> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> [sean: drop S_CET, add big comment, move accessors to x86.c] Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Xin Li (Intel) <xin@zytor.com> Link: https://lore.kernel.org/r/20250919223258.1604852-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: x86: Initialize kvm_caps.supported_xssYang Weijiang
Set original kvm_caps.supported_xss to (host_xss & KVM_SUPPORTED_XSS) if XSAVES is supported. host_xss contains the host supported xstate feature bits for thread FPU context switch, KVM_SUPPORTED_XSS includes all KVM enabled XSS feature bits, the resulting value represents the supervisor xstates that are available to guest and are backed by host FPU framework for swapping {guest,host} XSAVE-managed registers/MSRs. [sean: relocate and enhance comment about PT / XSS[8] ] Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSSYang Weijiang
Update CPUID.(EAX=0DH,ECX=1).EBX to reflect current required xstate size due to XSS MSR modification. CPUID(EAX=0DH,ECX=1).EBX reports the required storage size of all enabled xstate features in (XCR0 | IA32_XSS). The CPUID value can be used by guest before allocate sufficient xsave buffer. Note, KVM does not yet support any XSS based features, i.e. supported_xss is guaranteed to be zero at this time. Opportunistically skip CPUID updates if XSS value doesn't change. Suggested-by: Sean Christopherson <seanjc@google.com> Co-developed-by: Zhang Yi Z <yi.z.zhang@linux.intel.com> Signed-off-by: Zhang Yi Z <yi.z.zhang@linux.intel.com> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: x86: Check XSS validity against guest CPUIDsChao Gao
Maintain per-guest valid XSS bits and check XSS validity against them rather than against KVM capabilities. This is to prevent bits that are supported by KVM but not supported for a guest from being set. Opportunistically return KVM_MSR_RET_UNSUPPORTED on IA32_XSS MSR accesses if guest CPUID doesn't enumerate X86_FEATURE_XSAVES. Since KVM_MSR_RET_UNSUPPORTED takes care of host_initiated cases, drop the host_initiated check. Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>