summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/svm
AgeCommit message (Collapse)Author
4 daysMerge tag 'kvm-x86-fixes-6.19-rc1' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM fixes for 6.19-rc1 - Add a missing "break" to fix param parsing in the rseq selftest. - Apply runtime updates to the _current_ CPUID when userspace is setting CPUID, e.g. as part of vCPU hotplug, to fix a false positive and to avoid dropping the pending update. - Disallow toggling KVM_MEM_GUEST_MEMFD on an existing memslot, as it's not supported by KVM and leads to a use-after-free due to KVM failing to unbind the memslot from the previously-associated guest_memfd instance. - Harden against similar KVM_MEM_GUEST_MEMFD goofs, and prepare for supporting flags-only changes on KVM_MEM_GUEST_MEMFD memlslots, e.g. for dirty logging. - Set exit_code[63:32] to -1 (all 0xffs) when synthesizing a nested SVM_EXIT_ERR (a.k.a. VMEXIT_INVALID) #VMEXIT, as VMEXIT_INVALID is defined as -1ull (a 64-bit value). - Update SVI when activating APICv to fix a bug where a post-activation EOI for an in-service IRQ would effective be lost due to SVI being stale. - Immediately refresh APICv controls (if necessary) on a nested VM-Exit instead of deferring the update via KVM_REQ_APICV_UPDATE, as the request is effectively ignored because KVM thinks the vCPU already has the correct APICv settings.
2025-12-04KVM: nSVM: Set exit_code_hi to -1 when synthesizing SVM_EXIT_ERR (failed VMRUN)Sean Christopherson
Set exit_code_hi to -1u as a temporary band-aid to fix a long-standing (effectively since KVM's inception) bug where KVM treats the exit code as a 32-bit value, when in reality it's a 64-bit value. Per the APM, offset 0x70 is a single 64-bit value: 070h 63:0 EXITCODE And a sane reading of the error values defined in "Table C-1. SVM Intercept Codes" is that negative values use the full 64 bits: –1 VMEXIT_INVALID Invalid guest state in VMCB. –2 VMEXIT_BUSYBUSY bit was set in the VMSA –3 VMEXIT_IDLE_REQUIREDThe sibling thread is not in an idle state -4 VMEXIT_INVALID_PMC Invalid PMC state And that interpretation is confirmed by testing on Milan and Turin (by setting bits in CR0[63:32] to generate VMEXIT_INVALID on VMRUN). Furthermore, Xen has treated exitcode as a 64-bit value since HVM support was adding in 2006 (see Xen commit d1bd157fbc ("Big merge the HVM full-virtualisation abstractions.")). Cc: Jim Mattson <jmattson@google.com> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: stable@vger.kernel.org Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251113225621.1688428-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-12-04KVM: nSVM: Clear exit_code_hi in VMCB when synthesizing nested VM-ExitsSean Christopherson
Explicitly clear exit_code_hi in the VMCB when synthesizing "normal" nested VM-Exits, as the full exit code is a 64-bit value (spoiler alert), and all exit codes for non-failing VMRUN use only bits 31:0. Cc: Jim Mattson <jmattson@google.com> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: stable@vger.kernel.org Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251113225621.1688428-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-26Merge tag 'kvm-x86-svm-6.19' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM SVM changes for 6.19: - Fix a few missing "VMCB dirty" bugs. - Fix the worst of KVM's lack of EFER.LMSLE emulation. - Add AVIC support for addressing 4k vCPUs in x2AVIC mode. - Fix incorrect handling of selective CR0 writes when checking intercepts during emulation of L2 instructions. - Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32] on VMRUN and #VMEXIT. - Fix a bug where KVM corrupt the guest code stream when re-injecting a soft interrupt if the guest patched the underlying code after the VM-Exit, e.g. when Linux patches code with a temporary INT3. - Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits to userspace, and extend KVM "support" to all policy bits that don't require any actual support from KVM.
2025-11-26Merge tag 'kvm-x86-misc-6.19' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 misc changes for 6.19: - Fix an async #PF bug where KVM would clear the completion queue when the guest transitioned in and out of paging mode, e.g. when handling an SMI and then returning to paged mode via RSM. - Fix a bug where TDX would effectively corrupt user-return MSR values if the TDX Module rejects VP.ENTER and thus doesn't clobber host MSRs as expected. - Leave the user-return notifier used to restore MSRs registered when disabling virtualization, and instead pin kvm.ko. Restoring host MSRs via IPI callback is either pointless (clean reboot) or dangerous (forced reboot) since KVM has no idea what code it's interrupting. - Use the checked version of {get,put}_user(), as Linus wants to kill them off, and they're measurably faster on modern CPUs due to the unchecked versions containing an LFENCE. - Fix a long-lurking bug where KVM's lack of catch-up logic for periodic APIC timers can result in a hard lockup in the host. - Revert the periodic kvmclock sync logic now that KVM doesn't use a clocksource that's subject to NPT corrections. - Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the latter behind CONFIG_CPU_MITIGATIONS. - Context switch XCR0, XSS, and PKRU outside of the entry/exit fastpath as the only reason they were handled in the faspath was to paper of a bug in the core #MC code that has long since been fixed. - Add emulator support for AVX MOV instructions to play nice with emulated devices whose PCI BARs guest drivers like to access with large multi-byte instructions.
2025-11-19KVM: x86: Load guest/host PKRU outside of the fastpath run loopSean Christopherson
Move KVM's swapping of PKRU outside of the fastpath loop, as there is no KVM code anywhere in the fastpath that accesses guest/userspace memory, i.e. that can consume protection keys. As documented by commit 1be0e61c1f25 ("KVM, pkeys: save/restore PKRU when guest/host switches"), KVM just needs to ensure the host's PKRU is loaded when KVM (or the kernel at-large) may access userspace memory. And at the time of commit 1be0e61c1f25, KVM didn't have a fastpath, and PKU was strictly contained to VMX, i.e. there was no reason to swap PKRU outside of vmx_vcpu_run(). Over time, the "need" to swap PKRU close to VM-Enter was likely falsely solidified by the association with XFEATUREs in commit 37486135d3a7 ("KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.c"), and XFEATURE swapping was in turn moved close to VM-Enter/VM-Exit as a KVM hack-a-fix ution for an #MC handler bug by commit 1811d979c716 ("x86/kvm: move kvm_load/put_guest_xcr0 into atomic context"). Deferring the PKRU loads shaves ~40 cycles off the fastpath for Intel, and ~60 cycles for AMD. E.g. using INVD in KVM-Unit-Test's vmexit.c, with extra hacks to enable CR4.PKE and PKRU=(-1u & ~0x3), latency numbers for AMD Turin go from ~1560 => ~1500, and for Intel Emerald Rapids, go from ~810 => ~770. Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Jon Kohler <jon@nutanix.com> Link: https://patch.msgid.link/20251118222328.2265758-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-18KVM: SVM: Handle #MCs in guest outside of fastpathSean Christopherson
Handle Machine Checks (#MC) that happen in the guest (by forwarding them to the host) outside of KVM's fastpath so that as much host state as possible is re-loaded before invoking the kernel's #MC handler. The only requirement is that KVM invokes the #MC handler before enabling IRQs (and even that could _probably_ be relaxed to handling #MCs before enabling preemption). Waiting to handle #MCs until "more" host state is loaded hardens KVM against flaws in the #MC handler, which has historically been quite brittle. E.g. prior to commit 5567d11c21a1 ("x86/mce: Send #MC singal from task work"), the #MC code could trigger a schedule() with IRQs and preemption disabled. That led to a KVM hack-a-fix in commit 1811d979c716 ("x86/kvm: move kvm_load/put_guest_xcr0 into atomic context"). Note, except for #MCs on VM-Enter, VMX already handles #MCs outside of the fastpath. Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Jon Kohler <jon@nutanix.com> Link: https://patch.msgid.link/20251118222328.2265758-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-18x86/bugs: KVM: Move VM_CLEAR_CPU_BUFFERS into SVM as SVM_CLEAR_CPU_BUFFERSSean Christopherson
Now that VMX encodes its own sequence for clearing CPU buffers, move VM_CLEAR_CPU_BUFFERS into SVM to minimize the chances of KVM botching a mitigation in the future, e.g. using VM_CLEAR_CPU_BUFFERS instead of checking multiple mitigation flags. No functional change intended. Reviewed-by: Brendan Jackman <jackmanb@google.com> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://patch.msgid.link/20251113233746.1703361-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-18KVM: SVM: Fix redundant updates of LBR MSR interceptsYosry Ahmed
Don't update the LBR MSR intercept bitmaps if they're already up-to-date, as unconditionally updating the intercepts forces KVM to recalculate the MSR bitmaps for vmcb02 on every nested VMRUN. The redundant updates are functionally okay; however, they neuter an optimization in Hyper-V nested virtualization enlightenments and this manifests as a self-test failure. In particular, Hyper-V lets L1 mark "nested enlightenments" as clean, i.e. tell KVM that no changes were made to the MSR bitmap since the last VMRUN. The hyperv_svm_test KVM selftest intentionally changes the MSR bitmap "without telling KVM about it" to verify that KVM honors the clean hint, correctly fails because KVM notices the changed bitmap anyway: ==== Test Assertion Failure ==== x86/hyperv_svm_test.c:120: vmcb->control.exit_code == 0x081 pid=193558 tid=193558 errno=4 - Interrupted system call 1 0x0000000000411361: assert_on_unhandled_exception at processor.c:659 2 0x0000000000406186: _vcpu_run at kvm_util.c:1699 3 (inlined by) vcpu_run at kvm_util.c:1710 4 0x0000000000401f2a: main at hyperv_svm_test.c:175 5 0x000000000041d0d3: __libc_start_call_main at libc-start.o:? 6 0x000000000041f27c: __libc_start_main_impl at ??:? 7 0x00000000004021a0: _start at ??:? vmcb->control.exit_code == SVM_EXIT_VMMCALL Do *not* fix this by skipping svm_hv_vmcb_dirty_nested_enlightenments() when svm_set_intercept_for_msr() performs a no-op change. changes to the L0 MSR interception bitmap are only triggered by full CPUID updates and MSR filter updates, both of which should be rare. Changing svm_set_intercept_for_msr() risks hiding unintended pessimizations like this one, and is actually more complex than this change. Fixes: fbe5e5f030c2 ("KVM: nSVM: Always recalculate LBR MSR intercepts in svm_update_lbrv()") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251112013017.1836863-1-yosry.ahmed@linux.dev [Rewritten commit message based on mailing list discussion. - Paolo] Reviewed-by: Sean Christopherson <seanjc@google.com> Tested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-11-14KVM: SEV: Add known supported SEV-SNP policy bitsTom Lendacky
Add to the known supported SEV-SNP policy bits that don't require any implementation support from KVM in order to successfully use them. At this time, this includes: - CXL_ALLOW - MEM_AES_256_XTS - RAPL_DIS - CIPHERTEXT_HIDING_DRAM - PAGE_SWAP_DISABLE Arguably, RAPL_DIS and CIPHERTEXT_HIDING_DRAM require KVM and the CCP driver to enable these features in order for the setting of the policy bits to be successfully handled. But, a guest owner may not wish their guest to run on a system that doesn't provide support for those features, so allowing the specification of these bits accomplishes that. Whether or not the bit is supported by SEV firmware, a system that doesn't support these features will either fail during the KVM validation of supported policy bits before issuing the LAUNCH_START or fail during the LAUNCH_START. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://patch.msgid.link/ec040de9864099cf592a97c201dc4cc110b2b0cf.1761593632.git.thomas.lendacky@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-14KVM: SEV: Publish supported SEV-SNP policy bitsTom Lendacky
Define the set of policy bits that KVM currently knows as not requiring any implementation support within KVM. Provide this value to userspace via the KVM_GET_DEVICE_ATTR ioctl. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://patch.msgid.link/c596f7529518f3f826a57970029451d9385949e5.1761593632.git.thomas.lendacky@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-14KVM: SEV: Consolidate the SEV policy bits in a single header fileTom Lendacky
Consolidate SEV policy bit definitions into a single file. Use include/linux/psp-sev.h to hold the definitions and remove the current definitions from the arch/x86/kvm/svm/sev.c and arch/x86/include/svm.h files. No functional change intended. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://patch.msgid.link/d9639f88a0b521a1a67aeac77cc609fdea1f90bd.1761593632.git.thomas.lendacky@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-13KVM: SVM: Don't skip unrelated instruction if INT3/INTO is replacedOmar Sandoval
When re-injecting a soft interrupt from an INT3, INT0, or (select) INTn instruction, discard the exception and retry the instruction if the code stream is changed (e.g. by a different vCPU) between when the CPU executes the instruction and when KVM decodes the instruction to get the next RIP. As effectively predicted by commit 6ef88d6e36c2 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction"), failure to verify that the correct INTn instruction was decoded can effectively clobber guest state due to decoding the wrong instruction and thus specifying the wrong next RIP. The bug most often manifests as "Oops: int3" panics on static branch checks in Linux guests. Enabling or disabling a static branch in Linux uses the kernel's "text poke" code patching mechanism. To modify code while other CPUs may be executing that code, Linux (temporarily) replaces the first byte of the original instruction with an int3 (opcode 0xcc), then patches in the new code stream except for the first byte, and finally replaces the int3 with the first byte of the new code stream. If a CPU hits the int3, i.e. executes the code while it's being modified, then the guest kernel must look up the RIP to determine how to handle the #BP, e.g. by emulating the new instruction. If the RIP is incorrect, then this lookup fails and the guest kernel panics. The bug reproduces almost instantly by hacking the guest kernel to repeatedly check a static branch[1] while running a drgn script[2] on the host to constantly swap out the memory containing the guest's TSS. [1]: https://gist.github.com/osandov/44d17c51c28c0ac998ea0334edf90b5a [2]: https://gist.github.com/osandov/10e45e45afa29b11e0c7209247afc00b Fixes: 6ef88d6e36c2 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction") Cc: stable@vger.kernel.org Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Link: https://patch.msgid.link/1cc6dcdf36e3add7ee7c8d90ad58414eeb6c3d34.1762278762.git.osandov@fb.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-09KVM: nSVM: Fix and simplify LBR virtualization handling with nestedYosry Ahmed
The current scheme for handling LBRV when nested is used is very complicated, especially when L1 does not enable LBRV (i.e. does not set LBR_CTL_ENABLE_MASK). To avoid copying LBRs between VMCB01 and VMCB02 on every nested transition, the current implementation switches between using VMCB01 or VMCB02 as the source of truth for the LBRs while L2 is running. If L2 enables LBR, VMCB02 is used as the source of truth. When L2 disables LBR, the LBRs are copied to VMCB01 and VMCB01 is used as the source of truth. This introduces significant complexity, and incorrect behavior in some cases. For example, on a nested #VMEXIT, the LBRs are only copied from VMCB02 to VMCB01 if LBRV is enabled in VMCB01. This is because L2's writes to MSR_IA32_DEBUGCTLMSR to enable LBR are intercepted and propagated to VMCB01 instead of VMCB02. However, LBRV is only enabled in VMCB02 when L2 is running. This means that if L2 enables LBR and exits to L1, the LBRs will not be propagated from VMCB02 to VMCB01, because LBRV is disabled in VMCB01. There is no meaningful difference in CPUID rate in L2 when copying LBRs on every nested transition vs. the current approach, so do the simple and correct thing and always copy LBRs between VMCB01 and VMCB02 on nested transitions (when LBRV is disabled by L1). Drop the conditional LBRs copying in __svm_{enable/disable}_lbrv() as it is now unnecessary. VMCB02 becomes the only source of truth for LBRs when L2 is running, regardless of LBRV being enabled by L1, drop svm_get_lbr_vmcb() and use svm->vmcb directly in its place. Fixes: 1d5a1b5860ed ("KVM: x86: nSVM: correctly virtualize LBR msrs when L2 is running") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251108004524.1600006-4-yosry.ahmed@linux.dev Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-11-09KVM: nSVM: Always recalculate LBR MSR intercepts in svm_update_lbrv()Yosry Ahmed
svm_update_lbrv() is called when MSR_IA32_DEBUGCTLMSR is updated, and on nested transitions where LBRV is used. It checks whether LBRV enablement needs to be changed in the current VMCB, and if it does, it also recalculate intercepts to LBR MSRs. However, there are cases where intercepts need to be updated even when LBRV enablement doesn't. Example scenario: - L1 has MSR_IA32_DEBUGCTLMSR cleared. - L1 runs L2 without LBR_CTL_ENABLE (no LBRV). - L2 sets DEBUGCTLMSR_LBR in MSR_IA32_DEBUGCTLMSR, svm_update_lbrv() sets LBR_CTL_ENABLE in VMCB02 and disables intercepts to LBR MSRs. - L2 exits to L1, svm_update_lbrv() is not called on this transition. - L1 clears MSR_IA32_DEBUGCTLMSR, svm_update_lbrv() finds that LBR_CTL_ENABLE is already cleared in VMCB01 and does nothing. - Intercepts remain disabled, L1 reads to LBR MSRs read the host MSRs. Fix it by always recalculating intercepts in svm_update_lbrv(). Fixes: 1d5a1b5860ed ("KVM: x86: nSVM: correctly virtualize LBR msrs when L2 is running") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251108004524.1600006-3-yosry.ahmed@linux.dev Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-11-09KVM: SVM: Mark VMCB_LBR dirty when MSR_IA32_DEBUGCTLMSR is updatedYosry Ahmed
The APM lists the DbgCtlMsr field as being tracked by the VMCB_LBR clean bit. Always clear the bit when MSR_IA32_DEBUGCTLMSR is updated. The history is complicated, it was correctly cleared for L1 before commit 1d5a1b5860ed ("KVM: x86: nSVM: correctly virtualize LBR msrs when L2 is running"). At that point svm_set_msr() started to rely on svm_update_lbrv() to clear the bit, but when nested virtualization is enabled the latter does not always clear it even if MSR_IA32_DEBUGCTLMSR changed. Go back to clearing it directly in svm_set_msr(). Fixes: 1d5a1b5860ed ("KVM: x86: nSVM: correctly virtualize LBR msrs when L2 is running") Reported-by: Matteo Rizzo <matteorizzo@google.com> Reported-by: evn@google.com Co-developed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251108004524.1600006-2-yosry.ahmed@linux.dev Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-11-06KVM: SVM: Ensure SPEC_CTRL[63:32] is context switched between guest and hostUros Bizjak
SPEC_CTRL is an MSR, i.e. a 64-bit value, but the VMRUN assembly code assumes bits 63:32 are always zero. The bug is _currently_ benign because neither KVM nor the kernel support setting any of bits 63:32, but it's still a bug that needs to be fixed. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Suggested-by: Sean Christopherson <seanjc@google.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20251106191230.182393-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05KVM: nSVM: Avoid incorrect injection of SVM_EXIT_CR0_SEL_WRITEYosry Ahmed
When emulating L2 instructions, svm_check_intercept() checks whether a write to CR0 should trigger a synthesized #VMEXIT with SVM_EXIT_CR0_SEL_WRITE. However, it does not check whether L1 enabled the intercept for SVM_EXIT_WRITE_CR0, which has higher priority according to the APM (24593—Rev. 3.42—March 2024, Table 15-7): When both selective and non-selective CR0-write intercepts are active at the same time, the non-selective intercept takes priority. With respect to exceptions, the priority of this intercept is the same as the generic CR0-write intercept. Make sure L1 does NOT intercept SVM_EXIT_WRITE_CR0 before checking if SVM_EXIT_CR0_SEL_WRITE needs to be injected. Opportunistically tweak the "not CR0" logic to explicitly bail early so that it's more obvious that only CR0 has a selective intercept, and that modifying icpt_info.exit_code is functionally necessary so that the call to nested_svm_exit_handled() checks the correct exit code. Fixes: cfec82cb7d31 ("KVM: SVM: Add intercept check for emulated cr accesses") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251024192918.3191141-4-yosry.ahmed@linux.dev [sean: isolate non-CR0 write logic, tweak comments accordingly] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05KVM: nSVM: Propagate SVM_EXIT_CR0_SEL_WRITE correctly for LMSW emulationYosry Ahmed
When emulating L2 instructions, svm_check_intercept() checks whether a write to CR0 should trigger a synthesized #VMEXIT with SVM_EXIT_CR0_SEL_WRITE. For MOV-to-CR0, SVM_EXIT_CR0_SEL_WRITE is only triggered if any bit other than CR0.MP and CR0.TS is updated. However, according to the APM (24593—Rev. 3.42—March 2024, Table 15-7): The LMSW instruction treats the selective CR0-write intercept as a non-selective intercept (i.e., it intercepts regardless of the value being written). Skip checking the changed bits for x86_intercept_lmsw and always inject SVM_EXIT_CR0_SEL_WRITE. Fixes: cfec82cb7d31 ("KVM: SVM: Add intercept check for emulated cr accesses") Cc: stable@vger.kernel.org Reported-by: Matteo Rizzo <matteorizzo@google.com> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251024192918.3191141-3-yosry.ahmed@linux.dev Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05KVM: nSVM: Remove redundant cases in nested_svm_intercept()Yosry Ahmed
Both the CRx and DRx cases are doing exactly what the default case is doing, remove them. No functional change intended. Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251024192918.3191141-2-yosry.ahmed@linux.dev Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-04KVM: x86: Add a helper to dedup reporting of unhandled VM-ExitsSean Christopherson
Add and use a helper, kvm_prepare_unexpected_reason_exit(), to dedup the code that fills the exit reason and CPU when KVM encounters a VM-Exit that KVM doesn't know how to handle. Reviewed-by: yaoyuan@linux.alibaba.com Reviewed-by: Yao Yuan <yaoyuan@linux.alibaba.com> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030185004.3372256-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-04KVM: SVM: switch to raw spinlock for svm->ir_list_lockMaxim Levitsky
Use a raw spinlock for vcpu_svm.ir_list_lock as the lock can be taken during schedule() via kvm_sched_out() => __avic_vcpu_put(), and "normal" spinlocks are sleepable locks when PREEMPT_RT=y. This fixes the following lockdep warning: ============================= [ BUG: Invalid wait context ] 6.12.0-146.1640_2124176644.el10.x86_64+debug #1 Not tainted ----------------------------- qemu-kvm/38299 is trying to lock: ff11000239725600 (&svm->ir_list_lock){....}-{3:3}, at: __avic_vcpu_put+0xfd/0x300 [kvm_amd] other info that might help us debug this: context-{5:5} 2 locks held by qemu-kvm/38299: #0: ff11000239723ba8 (&vcpu->mutex){+.+.}-{4:4}, at: kvm_vcpu_ioctl+0x240/0xe00 [kvm] #1: ff11000b906056d8 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x2e/0x130 stack backtrace: CPU: 1 UID: 0 PID: 38299 Comm: qemu-kvm Kdump: loaded Not tainted 6.12.0-146.1640_2124176644.el10.x86_64+debug #1 PREEMPT(voluntary) Hardware name: AMD Corporation QUARTZ/QUARTZ, BIOS RQZ100AB 09/14/2023 Call Trace: <TASK> dump_stack_lvl+0x6f/0xb0 __lock_acquire+0x921/0xb80 lock_acquire.part.0+0xbe/0x270 _raw_spin_lock_irqsave+0x46/0x90 __avic_vcpu_put+0xfd/0x300 [kvm_amd] svm_vcpu_put+0xfa/0x130 [kvm_amd] kvm_arch_vcpu_put+0x48c/0x790 [kvm] kvm_sched_out+0x161/0x1c0 [kvm] prepare_task_switch+0x36b/0xf60 __schedule+0x4f7/0x1890 schedule+0xd4/0x260 xfer_to_guest_mode_handle_work+0x54/0xc0 vcpu_run+0x69a/0xa70 [kvm] kvm_arch_vcpu_ioctl_run+0xdc0/0x17e0 [kvm] kvm_vcpu_ioctl+0x39f/0xe00 [kvm] Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://patch.msgid.link/20251030194130.307900-1-mlevitsk@redhat.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-04KVM: SVM: Make avic_ga_log_notifier() local to avic.cSean Christopherson
Make amd_iommu_register_ga_log_notifier() a local symbol now that it's defined and used purely within avic.c. No functional change intended. Fixes: 4bdec12aa8d6 ("KVM: SVM: Detect X2APIC virtualization (x2AVIC) support") Link: https://patch.msgid.link/20251016190643.80529-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-04KVM: SVM: Unregister KVM's GALog notifier on kvm-amd.ko exitSean Christopherson
Unregister the GALog notifier (used to get notified of wake events for blocking vCPUs) on kvm-amd.ko exit so that a KVM or IOMMU driver bug that results in a spurious GALog event "only" results in a spurious IRQ, and doesn't trigger a use-after-free due to executing unloaded module code. Fixes: 5881f73757cc ("svm: Introduce AMD IOMMU avic_ga_log_notifier") Reported-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Closes: https://lore.kernel.org/all/20250918130320.GA119526@k08j02272.eu95sqa Link: https://patch.msgid.link/20251016190643.80529-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-04KVM: SVM: Initialize per-CPU svm_data at the end of hardware setupSean Christopherson
Setup the per-CPU SVM data structures at the very end of hardware setup so that svm_hardware_unsetup() can be used in svm_hardware_setup() to unwind AVIC setup (for the GALog notifier). Alternatively, the error path could do an explicit, manual unwind, e.g. by adding a helper to free the per-CPU structures. But the per-CPU allocations have no interactions or dependencies, i.e. can comfortably live at the end, and so converting to a manual unwind would introduce churn and code without providing any immediate advantage. Link: https://patch.msgid.link/20251016190643.80529-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-10-17KVM: SVM: Add AVIC support for 4k vCPUs in x2AVIC modeNaveen N Rao
With AVIC support for 4k vCPUs, the maximum supported physical ID in x2AVIC mode is 4095. Since this is no longer fixed, introduce a variable (x2avic_max_physical_id) to capture the maximum supported physical ID on the current platform and use that in place of the existing macro (X2AVIC_MAX_PHYSICAL_ID). With AVIC support for 4k vCPUs, the AVIC Physical ID table is no longer a single page and can occupy up to 8 contiguous 4k pages. Since AVIC hardware accesses of the physical ID table are limited by the physical max index programmed in the VMCB, it is sufficient to allocate only as many pages as are required to have a physical table entry for the max guest APIC ID. Since the guest APIC mode is not available at this point, provision for the maximum possible x2AVIC ID. For this purpose, add a variant of avic_get_max_physical_id() that works with a NULL vCPU pointer and returns the max x2AVIC ID. Wrap this in a new helper for obtaining the allocation order. To make it easy to identify support for 4k vCPUs in x2AVIC mode, update the message printed to the kernel log to print the maximum number of vCPUs supported. Do this on all platforms supporting x2AVIC since it is useful to know what is supported on a specific platform. Co-developed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/7fc5962f6da028f7dd3c79dbbd5c574fa02c99dd.1757009416.git.naveen@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-10-17KVM: SVM: Move AVIC Physical ID table allocation to vcpu_precreate()Naveen N Rao
With support for 4k vCPUs in x2AVIC, the size of the AVIC Physical ID table is expanded from a single 4k page to a maximum of 8 contiguous 4k pages. The actual number of pages allocated depends on the maximum possible APIC ID in the guest, which is only known by the time the first vCPU is created. In preparation for supporting a dynamic AVIC Physical ID table size, move its allocation to vcpu_precreate(). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/7dc764e0af7f01440bbac3d9215ed174027c2384.1757009416.git.naveen@kernel.org [sean: drop enable_apicv check from svm_vcpu_precreate()] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-10-17KVM: SVM: Replace hard-coded value 0x1FF with the corresponding macroNaveen N Rao
The lower 9-bit field in EXITINFO2 represents an index into the AVIC Physical/Logical APIC ID table for a AVIC_INCOMPLETE_IPI #VMEXIT. Since the index into the Logical APIC ID table is just 8 bits, this field is actually bound by the bit-width of the index into the AVIC Physical ID table which is represented by AVIC_PHYSICAL_MAX_INDEX_MASK. So, use that macro to mask EXITINFO2.Index instead of hard coding 0x1FF in avic_incomplete_ipi_interception(). Co-developed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/95795f449c68bffcb3e1789ee2b0b7393711d37d.1757009416.git.naveen@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-10-17KVM: SVM: Add a helper to look up the max physical ID for AVICNaveen N Rao
To help with a future change, add a helper to look up the maximum physical ID depending on the vCPU AVIC mode. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/0ab9bf5e20a3463a4aa3a5ea9bbbac66beedf1d1.1757009416.git.naveen@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-10-17KVM: SVM: Limit AVIC physical max index based on configured max_vcpu_idsNaveen N Rao
KVM allows VMMs to specify the maximum possible APIC ID for a virtual machine through KVM_CAP_MAX_VCPU_ID capability so as to limit data structures related to APIC/x2APIC. Utilize the same to set the AVIC physical max index in the VMCB, similar to VMX. This helps hardware limit the number of entries to be scanned in the physical APIC ID table speeding up IPI broadcasts for virtual machines with smaller number of vCPUs. Unlike VMX, SVM AVIC requires a single page to be allocated for the Physical APIC ID table and the Logical APIC ID table, so retain the existing approach of allocating those during VM init. Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/adb07ccdb3394cd79cb372ba6bcc69a4e4d4ef54.1757009416.git.naveen@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-10-15KVM: SVM: Disallow EFER.LMSLE when not supported by hardwareJim Mattson
Modern AMD CPUs do not support segment limit checks in 64-bit mode (i.e. EFER.LMSLE must be zero). Do not allow a guest to set EFER.LMSLE on a CPU that requires the bit to be zero. For backwards compatibility, allow EFER.LMSLE to be set on CPUs that support segment limit checks in 64-bit mode, even though KVM's implementation of the feature is incomplete (e.g. KVM's emulator does not enforce segment limits in 64-bit mode). Fixes: eec4b140c924 ("KVM: SVM: Allow EFER.LMSLE to be set with nested svm") Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Nikunj A Dadhania <nikunj@amd.com> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://lore.kernel.org/r/20251001001529.1119031-3-jmattson@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-10-14KVM: SVM: Mark VMCB_NPT as dirty on nested VMRUNJim Mattson
Mark the VMCB_NPT bit as dirty in nested_vmcb02_prepare_save() on every nested VMRUN. If L1 changes the PAT MSR between two VMRUN instructions on the same L1 vCPU, the g_pat field in the associated vmcb02 will change, and the VMCB_NPT clean bit should be cleared. Fixes: 4bb170a5430b ("KVM: nSVM: do not mark all VMCB02 fields dirty on nested vmexit") Cc: stable@vger.kernel.org Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250922162935.621409-3-jmattson@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-10-14KVM: SVM: Mark VMCB_PERM_MAP as dirty on nested VMRUNJim Mattson
Mark the VMCB_PERM_MAP bit as dirty in nested_vmcb02_prepare_control() on every nested VMRUN. If L1 changes MSR interception (INTERCEPT_MSR_PROT) between two VMRUN instructions on the same L1 vCPU, the msrpm_base_pa in the associated vmcb02 will change, and the VMCB_PERM_MAP clean bit should be cleared. Fixes: 4bb170a5430b ("KVM: nSVM: do not mark all VMCB02 fields dirty on nested vmexit") Reported-by: Matteo Rizzo <matteorizzo@google.com> Cc: stable@vger.kernel.org Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250922162935.621409-2-jmattson@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-30Merge tag 'kvm-x86-cet-6.18' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 CET virtualization support for 6.18 Add support for virtualizing Control-flow Enforcement Technology (CET) on Intel (Shadow Stacks and Indirect Branch Tracking) and AMD (Shadow Stacks). CET is comprised of two distinct features, Shadow Stacks (SHSTK) and Indirect Branch Tracking (IBT), that can be utilized by software to help provide Control-flow integrity (CFI). SHSTK defends against backward-edge attacks (a.k.a. Return-oriented programming (ROP)), while IBT defends against forward-edge attacks (a.k.a. similarly CALL/JMP-oriented programming (COP/JOP)). Attackers commonly use ROP and COP/JOP methodologies to redirect the control- flow to unauthorized targets in order to execute small snippets of code, a.k.a. gadgets, of the attackers choice. By chaining together several gadgets, an attacker can perform arbitrary operations and circumvent the system's defenses. SHSTK defends against backward-edge attacks, which execute gadgets by modifying the stack to branch to the attacker's target via RET, by providing a second stack that is used exclusively to track control transfer operations. The shadow stack is separate from the data/normal stack, and can be enabled independently in user and kernel mode. When SHSTK is is enabled, CALL instructions push the return address on both the data and shadow stack. RET then pops the return address from both stacks and compares the addresses. If the return addresses from the two stacks do not match, the CPU generates a Control Protection (#CP) exception. IBT defends against backward-edge attacks, which branch to gadgets by executing indirect CALL and JMP instructions with attacker controlled register or memory state, by requiring the target of indirect branches to start with a special marker instruction, ENDBRANCH. If an indirect branch is executed and the next instruction is not an ENDBRANCH, the CPU generates a #CP. Note, ENDBRANCH behaves as a NOP if IBT is disabled or unsupported. From a virtualization perspective, CET presents several problems. While SHSTK and IBT have two layers of enabling, a global control in the form of a CR4 bit, and a per-feature control in user and kernel (supervisor) MSRs (U_CET and S_CET respectively), the {S,U}_CET MSRs can be context switched via XSAVES/XRSTORS. Practically speaking, intercepting and emulating XSAVES/XRSTORS is not a viable option due to complexity, and outright disallowing use of XSTATE to context switch SHSTK/IBT state would render the features unusable to most guests. To limit the overall complexity without sacrificing performance or usability, simply ignore the potential virtualization hole, but ensure that all paths in KVM treat SHSTK/IBT as usable by the guest if the feature is supported in hardware, and the guest has access to at least one of SHSTK or IBT. I.e. allow userspace to advertise one of SHSTK or IBT if both are supported in hardware, even though doing so would allow a misbehaving guest to use the unadvertised feature. Fully emulating SHSTK and IBT would also require significant complexity, e.g. to track and update branch state for IBT, and shadow stack state for SHSTK. Given that emulating large swaths of the guest code stream isn't necessary on modern CPUs, punt on emulating instructions that meaningful impact or consume SHSTK or IBT. However, instead of doing nothing, explicitly reject emulation of such instructions so that KVM's emulator can't be abused to circumvent CET. Disable support for SHSTK and IBT if KVM is configured such that emulation of arbitrary guest instructions may be required, specifically if Unrestricted Guest (Intel only) is disabled, or if KVM will emulate a guest.MAXPHYADDR that is smaller than host.MAXPHYADDR. Lastly disable SHSTK support if shadow paging is enabled, as the protections for the shadow stack are novel (shadow stacks require Writable=0,Dirty=1, so that they can't be directly modified by software), i.e. would require non-trivial support in the Shadow MMU. Note, AMD CPUs currently only support SHSTK. Explicitly disable IBT support so that KVM doesn't over-advertise if AMD CPUs add IBT, and virtualizing IBT in SVM requires KVM modifications.
2025-09-30Merge tag 'kvm-x86-misc-6.18' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 changes for 6.18 - Don't (re)check L1 intercepts when completing userspace I/O to fix a flaw where a misbehaving usersepace (a.k.a. syzkaller) could swizzle L1's intercepts and trigger a variety of WARNs in KVM. - Emulate PERF_CNTR_GLOBAL_STATUS_SET for PerfMonV2 guests, as the MSR is supposed to exist for v2 PMUs. - Allow Centaur CPU leaves (base 0xC000_0000) for Zhaoxin CPUs. - Clean up KVM's vector hashing code for delivering lowest priority IRQs. - Clean up the fastpath handler code to only handle IPIs and WRMSRs that are actually "fast", as opposed to handling those that KVM _hopes_ are fast, and in the process of doing so add fastpath support for TSC_DEADLINE writes on AMD CPUs. - Clean up a pile of PMU code in anticipation of adding support for mediated vPMUs. - Add support for the immediate forms of RDMSR and WRMSRNS, sans full emulator support (KVM should never need to emulate the MSRs outside of forced emulation and other contrived testing scenarios). - Clean up the MSR APIs in preparation for CET and FRED virtualization, as well as mediated vPMU support. - Rejecting a fully in-kernel IRQCHIP if EOIs are protected, i.e. for TDX VMs, as KVM can't faithfully emulate an I/O APIC for such guests. - KVM_REQ_MSR_FILTER_CHANGED into a generic RECALC_INTERCEPTS in preparation for mediated vPMU support, as KVM will need to recalculate MSR intercepts in response to PMU refreshes for guests with mediated vPMUs. - Misc cleanups and minor fixes.
2025-09-30Merge tag 'kvm-x86-ciphertext-6.18' of https://github.com/kvm-x86/linux into ↵Paolo Bonzini
HEAD KVM SEV-SNP CipherText Hiding support for 6.18 Add support for SEV-SNP's CipherText Hiding, an opt-in feature that prevents unauthorized CPU accesses from reading the ciphertext of SNP guest private memory, e.g. to attempt an offline attack. Instead of ciphertext, the CPU will always read back all FFs when CipherText Hiding is enabled. Add new module parameter to the KVM module to enable CipherText Hiding and control the number of ASIDs that can be used for VMs with CipherText Hiding, which is in effect the number of SNP VMs. When CipherText Hiding is enabled, the shared SEV-ES/SEV-SNP ASID space is split into separate ranges for SEV-ES and SEV-SNP guests, i.e. ASIDs that can be used for CipherText Hiding cannot be used to run SEV-ES guests.
2025-09-30Merge tag 'kvm-x86-svm-6.18' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM SVM changes for 6.18 - Require a minimum GHCB version of 2 when starting SEV-SNP guests via KVM_SEV_INIT2 so that invalid GHCB versions result in immediate errors instead of latent guest failures. - Add support for Secure TSC for SEV-SNP guests, which prevents the untrusted host from tampering with the guest's TSC frequency, while still allowing the the VMM to configure the guest's TSC frequency prior to launch. - Mitigate the potential for TOCTOU bugs when accessing GHCB fields by wrapping all accesses via READ_ONCE(). - Validate the XCR0 provided by the guest (via the GHCB) to avoid tracking a bogous XCR0 value in KVM's software model. - Save an SEV guest's policy if and only if LAUNCH_START fully succeeds to avoid leaving behind stale state (thankfully not consumed in KVM). - Explicitly reject non-positive effective lengths during SNP's LAUNCH_UPDATE instead of subtly relying on guest_memfd to do the "heavy" lifting. - Reload the pre-VMRUN TSC_AUX on #VMEXIT for SEV-ES guests, not the host's desired TSC_AUX, to fix a bug where KVM could clobber a different vCPU's TSC_AUX due to hardware not matching the value cached in the user-return MSR infrastructure. - Enable AVIC by default for Zen4+ if x2AVIC (and other prereqs) is supported, and clean up the AVIC initialization code along the way.
2025-09-30Merge tag 'loongarch-kvm-6.18' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.18 1. Add PTW feature detection on new hardware. 2. Add sign extension with kernel MMIO/IOCSR emulation. 3. Improve in-kernel IPI emulation. 4. Improve in-kernel PCH-PIC emulation. 5. Move kvm_iocsr tracepoint out of generic code.
2025-09-23KVM: SVM: Enable shadow stack virtualization for SVMJohn Allen
Remove the explicit clearing of shadow stack CPU capabilities. Reviewed-by: Chao Gao <chao.gao@intel.com> Signed-off-by: John Allen <john.allen@amd.com> Link: https://lore.kernel.org/r/20250919223258.1604852-41-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SEV: Synchronize MSR_IA32_XSS from the GHCB when it's validSean Christopherson
Synchronize XSS from the GHCB to KVM's internal tracking if the guest marks XSS as valid on a #VMGEXIT. Like XCR0, KVM needs an up-to-date copy of XSS in order to compute the required XSTATE size when emulating CPUID.0xD.0x1 for the guest. Treat the incoming XSS change as an emulated write, i.e. validatate the guest-provided value, to avoid letting the guest load garbage into KVM's tracking. Simply ignore bad values, as either the guest managed to get an unsupported value into hardware, or the guest is misbehaving and providing pure garbage. In either case, KVM can't fix the broken guest. Explicitly allow access to XSS at all times, as KVM needs to ensure its copy of XSS stays up-to-date. E.g. KVM supports migration of SEV-ES guests and so needs to allow the host to save/restore XSS, otherwise a guest that *knows* its XSS hasn't change could get stale/bad CPUID emulation if the guest doesn't provide XSS in the GHCB on every exit. This creates a hypothetical problem where a guest could request emulation of RDMSR or WRMSR on XSS, but arguably that's not even a problem, e.g. it would be entirely reasonable for a guest to request "emulation" as a way to inform the hypervisor that its XSS value has been modified. Note, emulating the change as an MSR write also takes care of side effects, e.g. marking dynamic CPUID bits as dirty. Suggested-by: John Allen <john.allen@amd.com> base-commit: 14298d819d5a6b7180a4089e7d2121ca3551dc6c Link: https://lore.kernel.org/r/20250919223258.1604852-40-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SVM: Pass through shadow stack MSRs as appropriateJohn Allen
Pass through XSAVE managed CET MSRs on SVM when KVM supports shadow stack. These cannot be intercepted without also intercepting XSAVE which would likely cause unacceptable performance overhead. MSR_IA32_INT_SSP_TAB is not managed by XSAVE, so it is intercepted. Reviewed-by: Chao Gao <chao.gao@intel.com> Signed-off-by: John Allen <john.allen@amd.com> Link: https://lore.kernel.org/r/20250919223258.1604852-39-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SVM: Update dump_vmcb with shadow stack save area additionsJohn Allen
Add shadow stack VMCB fields to dump_vmcb. PL0_SSP, PL1_SSP, PL2_SSP, PL3_SSP, and U_CET are part of the SEV-ES save area and are encrypted, but can be decrypted and dumped if the guest policy allows debugging. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: John Allen <john.allen@amd.com> Link: https://lore.kernel.org/r/20250919223258.1604852-38-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: nSVM: Save/load CET Shadow Stack state to/from vmcb12/vmcb02Sean Christopherson
Transfer the three CET Shadow Stack VMCB fields (S_CET, ISST_ADDR, and SSP) on VMRUN, #VMEXIT, and loading nested state (saving nested state simply copies the entire save area). SVM doesn't provide a way to disallow L1 from enabling Shadow Stacks for L2, i.e. KVM *must* provide nested support before advertising SHSTK to userspace. Link: https://lore.kernel.org/r/20250919223258.1604852-37-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SVM: Emulate reads and writes to shadow stack MSRsJohn Allen
Emulate shadow stack MSR access by reading and writing to the corresponding fields in the VMCB. Signed-off-by: John Allen <john.allen@amd.com> [sean: mark VMCB_CET dirty/clean as appropriate] Link: https://lore.kernel.org/r/20250919223258.1604852-36-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: x86: Enable CET virtualization for VMX and advertise to userspaceYang Weijiang
Add support for the LOAD_CET_STATE VM-Enter and VM-Exit controls, the CET XFEATURE bits in XSS, and advertise support for IBT and SHSTK to userspace. Explicitly clear IBT and SHSTK onn SVM, as additional work is needed to enable CET on SVM, e.g. to context switch S_CET and other state. Disable KVM CET feature if unrestricted_guest is unsupported/disabled as KVM does not support emulating CET, as running without Unrestricted Guest can result in KVM emulating large swaths of guest code. While it's highly unlikely any guest will trigger emulation while also utilizing IBT or SHSTK, there's zero reason to allow CET without Unrestricted Guest as that combination should only be possible when explicitly disabling unrestricted_guest for testing purposes. Disable CET if VMX_BASIC[bit56] == 0, i.e. if hardware strictly enforces the presence of an Error Code based on exception vector, as attempting to inject a #CP with an Error Code (#CP architecturally has an Error Code) will fail due to the #CP vector historically not having an Error Code. Clear S_CET and SSP-related VMCS on "reset" to emulate the architectural of CET MSRs and SSP being reset to 0 after RESET, power-up and INIT. Note, KVM already clears guest CET state that is managed via XSTATE in kvm_xstate_reset(). Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Signed-off-by: Mathias Krause <minipli@grsecurity.net> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> [sean: move some bits to separate patches, massage changelog] Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-29-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: x86: Initialize allow_smaller_maxphyaddr earlier in setupSean Christopherson
Initialize allow_smaller_maxphyaddr during hardware setup as soon as KVM knows whether or not TDP will be utilized. To avoid having to teach KVM's emulator all about CET, KVM's upcoming CET virtualization support will be mutually exclusive with allow_smaller_maxphyaddr, i.e. will disable SHSTK and IBT if allow_smaller_maxphyaddr is enabled. In general, allow_smaller_maxphyaddr should be initialized as soon as possible since it's globally visible while its only input is whether or not EPT/NPT is enabled. I.e. there's effectively zero risk of setting allow_smaller_maxphyaddr too early, and substantial risk of setting it too late. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250922184743.1745778-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: x86: Merge 'svm' into 'cet' to pick up GHCB dependenciesSean Christopherson
Merge the queue of SVM changes for 6.18 to pick up the KVM-defined GHCB helpers so that kvm_ghcb_get_xss() can be used to virtualize CET for SEV-ES+ guests.
2025-09-23KVM: SVM: Enable AVIC by default for Zen4+ if x2AVIC is supportNaveen N Rao
AVIC and x2AVIC are fully functional since Zen 4, with no known hardware errata. Enable AVIC and x2AVIC by default on Zen4+ so long as x2AVIC is supported (to avoid enabling partial support for APIC virtualization by default). Internally, convert "avic" to an integer so that KVM can identify if the user has asked to explicitly enable or disable AVIC, i.e. so that KVM doesn't override an explicit 'y' from the user. Arbitrarily use -1 to denote auto-mode, and accept the string "auto" for the module param in addition to standard boolean values, i.e. continue to allow the user to configure the "avic" module parameter to explicitly enable/disable AVIC. To again maintain backward compatibility with a standard boolean param, set KERNEL_PARAM_OPS_FL_NOARG, which tells the params infrastructure to allow empty values for %true, i.e. to interpret a bare "avic" as "avic=y". Take care to check for a NULL @val when looking for "auto"! Lastly, always print "avic" as a boolean, since auto-mode is resolved during module initialization, i.e. the user should never see "auto" in sysfs. Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org> Tested-by: Naveen N Rao (AMD) <naveen@kernel.org> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250919215934.1590410-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SVM: Move global "avic" variable to avic.cSean Christopherson
Move "avic" to avic.c so that it's colocated with the other AVIC specific globals and module params, and so that avic_hardware_setup() is a bit more self-contained, e.g. similar to sev_hardware_setup(). Deliberately set enable_apicv in svm.c as it's already globally visible (defined by kvm.ko, not by kvm-amd.ko), and to clearly capture the dependency on enable_apicv being initialized (svm_hardware_setup() clears several AVIC-specific hooks when enable_apicv is disabled). Alternatively, clearing of the hooks (and enable_ipiv) could be moved to avic_hardware_setup(), but that's not obviously better, e.g. it's helpful to isolate the setting of enable_apicv when reading code from the generic x86 side of the world. No functional change intended. Acked-by: Naveen N Rao (AMD) <naveen@kernel.org> Tested-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250919215934.1590410-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SVM: Don't advise the user to do force_avic=y (when x2AVIC is detected)Sean Christopherson
Don't advise the end user to try to force enable AVIC when x2AVIC is reported as supported in CPUID, as forcefully enabling AVIC isn't something that should be done lightly. E.g. some Zen4 client systems hide AVIC but leave x2AVIC behind, and while such a configuration is indeed due to buggy firmware in the sense the reporting x2AVIC without AVIC is nonsensical, KVM has no idea _why_ firmware disabled AVIC in the first place. Suggesting that the user try to run with force_avic=y is sketchy even if the user explicitly tries to enable AVIC, and will be downright irresponsible once KVM starts enabling AVIC by default. Alternatively, KVM could print the message only when the user explicitly asks for AVIC, but running with force_avic=y isn't something that should be encouraged for random users. force_avic is a useful knob for developers and perhaps even advanced users, but isn't something that KVM should advertise broadly. Opportunistically append a newline to the pr_warn() so that it prints out immediately, and tweak the message to say that AVIC is unsupported instead of disabled (disabled suggests that the kernel/KVM is somehow responsible). Suggested-by: Naveen N Rao (AMD) <naveen@kernel.org> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Tested-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250919215934.1590410-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>