summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/svm/svm.h
AgeCommit message (Collapse)Author
4 daysMerge tag 'kvm-x86-fixes-6.19-rc1' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM fixes for 6.19-rc1 - Add a missing "break" to fix param parsing in the rseq selftest. - Apply runtime updates to the _current_ CPUID when userspace is setting CPUID, e.g. as part of vCPU hotplug, to fix a false positive and to avoid dropping the pending update. - Disallow toggling KVM_MEM_GUEST_MEMFD on an existing memslot, as it's not supported by KVM and leads to a use-after-free due to KVM failing to unbind the memslot from the previously-associated guest_memfd instance. - Harden against similar KVM_MEM_GUEST_MEMFD goofs, and prepare for supporting flags-only changes on KVM_MEM_GUEST_MEMFD memlslots, e.g. for dirty logging. - Set exit_code[63:32] to -1 (all 0xffs) when synthesizing a nested SVM_EXIT_ERR (a.k.a. VMEXIT_INVALID) #VMEXIT, as VMEXIT_INVALID is defined as -1ull (a 64-bit value). - Update SVI when activating APICv to fix a bug where a post-activation EOI for an in-service IRQ would effective be lost due to SVI being stale. - Immediately refresh APICv controls (if necessary) on a nested VM-Exit instead of deferring the update via KVM_REQ_APICV_UPDATE, as the request is effectively ignored because KVM thinks the vCPU already has the correct APICv settings.
2025-12-04KVM: nSVM: Clear exit_code_hi in VMCB when synthesizing nested VM-ExitsSean Christopherson
Explicitly clear exit_code_hi in the VMCB when synthesizing "normal" nested VM-Exits, as the full exit code is a 64-bit value (spoiler alert), and all exit codes for non-failing VMRUN use only bits 31:0. Cc: Jim Mattson <jmattson@google.com> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: stable@vger.kernel.org Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251113225621.1688428-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-26Merge tag 'kvm-x86-svm-6.19' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM SVM changes for 6.19: - Fix a few missing "VMCB dirty" bugs. - Fix the worst of KVM's lack of EFER.LMSLE emulation. - Add AVIC support for addressing 4k vCPUs in x2AVIC mode. - Fix incorrect handling of selective CR0 writes when checking intercepts during emulation of L2 instructions. - Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32] on VMRUN and #VMEXIT. - Fix a bug where KVM corrupt the guest code stream when re-injecting a soft interrupt if the guest patched the underlying code after the VM-Exit, e.g. when Linux patches code with a temporary INT3. - Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits to userspace, and extend KVM "support" to all policy bits that don't require any actual support from KVM.
2025-11-18KVM: SVM: Fix redundant updates of LBR MSR interceptsYosry Ahmed
Don't update the LBR MSR intercept bitmaps if they're already up-to-date, as unconditionally updating the intercepts forces KVM to recalculate the MSR bitmaps for vmcb02 on every nested VMRUN. The redundant updates are functionally okay; however, they neuter an optimization in Hyper-V nested virtualization enlightenments and this manifests as a self-test failure. In particular, Hyper-V lets L1 mark "nested enlightenments" as clean, i.e. tell KVM that no changes were made to the MSR bitmap since the last VMRUN. The hyperv_svm_test KVM selftest intentionally changes the MSR bitmap "without telling KVM about it" to verify that KVM honors the clean hint, correctly fails because KVM notices the changed bitmap anyway: ==== Test Assertion Failure ==== x86/hyperv_svm_test.c:120: vmcb->control.exit_code == 0x081 pid=193558 tid=193558 errno=4 - Interrupted system call 1 0x0000000000411361: assert_on_unhandled_exception at processor.c:659 2 0x0000000000406186: _vcpu_run at kvm_util.c:1699 3 (inlined by) vcpu_run at kvm_util.c:1710 4 0x0000000000401f2a: main at hyperv_svm_test.c:175 5 0x000000000041d0d3: __libc_start_call_main at libc-start.o:? 6 0x000000000041f27c: __libc_start_main_impl at ??:? 7 0x00000000004021a0: _start at ??:? vmcb->control.exit_code == SVM_EXIT_VMMCALL Do *not* fix this by skipping svm_hv_vmcb_dirty_nested_enlightenments() when svm_set_intercept_for_msr() performs a no-op change. changes to the L0 MSR interception bitmap are only triggered by full CPUID updates and MSR filter updates, both of which should be rare. Changing svm_set_intercept_for_msr() risks hiding unintended pessimizations like this one, and is actually more complex than this change. Fixes: fbe5e5f030c2 ("KVM: nSVM: Always recalculate LBR MSR intercepts in svm_update_lbrv()") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251112013017.1836863-1-yosry.ahmed@linux.dev [Rewritten commit message based on mailing list discussion. - Paolo] Reviewed-by: Sean Christopherson <seanjc@google.com> Tested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-11-14KVM: SEV: Consolidate the SEV policy bits in a single header fileTom Lendacky
Consolidate SEV policy bit definitions into a single file. Use include/linux/psp-sev.h to hold the definitions and remove the current definitions from the arch/x86/kvm/svm/sev.c and arch/x86/include/svm.h files. No functional change intended. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://patch.msgid.link/d9639f88a0b521a1a67aeac77cc609fdea1f90bd.1761593632.git.thomas.lendacky@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-04KVM: SVM: switch to raw spinlock for svm->ir_list_lockMaxim Levitsky
Use a raw spinlock for vcpu_svm.ir_list_lock as the lock can be taken during schedule() via kvm_sched_out() => __avic_vcpu_put(), and "normal" spinlocks are sleepable locks when PREEMPT_RT=y. This fixes the following lockdep warning: ============================= [ BUG: Invalid wait context ] 6.12.0-146.1640_2124176644.el10.x86_64+debug #1 Not tainted ----------------------------- qemu-kvm/38299 is trying to lock: ff11000239725600 (&svm->ir_list_lock){....}-{3:3}, at: __avic_vcpu_put+0xfd/0x300 [kvm_amd] other info that might help us debug this: context-{5:5} 2 locks held by qemu-kvm/38299: #0: ff11000239723ba8 (&vcpu->mutex){+.+.}-{4:4}, at: kvm_vcpu_ioctl+0x240/0xe00 [kvm] #1: ff11000b906056d8 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x2e/0x130 stack backtrace: CPU: 1 UID: 0 PID: 38299 Comm: qemu-kvm Kdump: loaded Not tainted 6.12.0-146.1640_2124176644.el10.x86_64+debug #1 PREEMPT(voluntary) Hardware name: AMD Corporation QUARTZ/QUARTZ, BIOS RQZ100AB 09/14/2023 Call Trace: <TASK> dump_stack_lvl+0x6f/0xb0 __lock_acquire+0x921/0xb80 lock_acquire.part.0+0xbe/0x270 _raw_spin_lock_irqsave+0x46/0x90 __avic_vcpu_put+0xfd/0x300 [kvm_amd] svm_vcpu_put+0xfa/0x130 [kvm_amd] kvm_arch_vcpu_put+0x48c/0x790 [kvm] kvm_sched_out+0x161/0x1c0 [kvm] prepare_task_switch+0x36b/0xf60 __schedule+0x4f7/0x1890 schedule+0xd4/0x260 xfer_to_guest_mode_handle_work+0x54/0xc0 vcpu_run+0x69a/0xa70 [kvm] kvm_arch_vcpu_ioctl_run+0xdc0/0x17e0 [kvm] kvm_vcpu_ioctl+0x39f/0xe00 [kvm] Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://patch.msgid.link/20251030194130.307900-1-mlevitsk@redhat.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-04KVM: SVM: Make avic_ga_log_notifier() local to avic.cSean Christopherson
Make amd_iommu_register_ga_log_notifier() a local symbol now that it's defined and used purely within avic.c. No functional change intended. Fixes: 4bdec12aa8d6 ("KVM: SVM: Detect X2APIC virtualization (x2AVIC) support") Link: https://patch.msgid.link/20251016190643.80529-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-04KVM: SVM: Unregister KVM's GALog notifier on kvm-amd.ko exitSean Christopherson
Unregister the GALog notifier (used to get notified of wake events for blocking vCPUs) on kvm-amd.ko exit so that a KVM or IOMMU driver bug that results in a spurious GALog event "only" results in a spurious IRQ, and doesn't trigger a use-after-free due to executing unloaded module code. Fixes: 5881f73757cc ("svm: Introduce AMD IOMMU avic_ga_log_notifier") Reported-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Closes: https://lore.kernel.org/all/20250918130320.GA119526@k08j02272.eu95sqa Link: https://patch.msgid.link/20251016190643.80529-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-10-17KVM: SVM: Move AVIC Physical ID table allocation to vcpu_precreate()Naveen N Rao
With support for 4k vCPUs in x2AVIC, the size of the AVIC Physical ID table is expanded from a single 4k page to a maximum of 8 contiguous 4k pages. The actual number of pages allocated depends on the maximum possible APIC ID in the guest, which is only known by the time the first vCPU is created. In preparation for supporting a dynamic AVIC Physical ID table size, move its allocation to vcpu_precreate(). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/7dc764e0af7f01440bbac3d9215ed174027c2384.1757009416.git.naveen@kernel.org [sean: drop enable_apicv check from svm_vcpu_precreate()] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-30Merge tag 'kvm-x86-cet-6.18' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 CET virtualization support for 6.18 Add support for virtualizing Control-flow Enforcement Technology (CET) on Intel (Shadow Stacks and Indirect Branch Tracking) and AMD (Shadow Stacks). CET is comprised of two distinct features, Shadow Stacks (SHSTK) and Indirect Branch Tracking (IBT), that can be utilized by software to help provide Control-flow integrity (CFI). SHSTK defends against backward-edge attacks (a.k.a. Return-oriented programming (ROP)), while IBT defends against forward-edge attacks (a.k.a. similarly CALL/JMP-oriented programming (COP/JOP)). Attackers commonly use ROP and COP/JOP methodologies to redirect the control- flow to unauthorized targets in order to execute small snippets of code, a.k.a. gadgets, of the attackers choice. By chaining together several gadgets, an attacker can perform arbitrary operations and circumvent the system's defenses. SHSTK defends against backward-edge attacks, which execute gadgets by modifying the stack to branch to the attacker's target via RET, by providing a second stack that is used exclusively to track control transfer operations. The shadow stack is separate from the data/normal stack, and can be enabled independently in user and kernel mode. When SHSTK is is enabled, CALL instructions push the return address on both the data and shadow stack. RET then pops the return address from both stacks and compares the addresses. If the return addresses from the two stacks do not match, the CPU generates a Control Protection (#CP) exception. IBT defends against backward-edge attacks, which branch to gadgets by executing indirect CALL and JMP instructions with attacker controlled register or memory state, by requiring the target of indirect branches to start with a special marker instruction, ENDBRANCH. If an indirect branch is executed and the next instruction is not an ENDBRANCH, the CPU generates a #CP. Note, ENDBRANCH behaves as a NOP if IBT is disabled or unsupported. From a virtualization perspective, CET presents several problems. While SHSTK and IBT have two layers of enabling, a global control in the form of a CR4 bit, and a per-feature control in user and kernel (supervisor) MSRs (U_CET and S_CET respectively), the {S,U}_CET MSRs can be context switched via XSAVES/XRSTORS. Practically speaking, intercepting and emulating XSAVES/XRSTORS is not a viable option due to complexity, and outright disallowing use of XSTATE to context switch SHSTK/IBT state would render the features unusable to most guests. To limit the overall complexity without sacrificing performance or usability, simply ignore the potential virtualization hole, but ensure that all paths in KVM treat SHSTK/IBT as usable by the guest if the feature is supported in hardware, and the guest has access to at least one of SHSTK or IBT. I.e. allow userspace to advertise one of SHSTK or IBT if both are supported in hardware, even though doing so would allow a misbehaving guest to use the unadvertised feature. Fully emulating SHSTK and IBT would also require significant complexity, e.g. to track and update branch state for IBT, and shadow stack state for SHSTK. Given that emulating large swaths of the guest code stream isn't necessary on modern CPUs, punt on emulating instructions that meaningful impact or consume SHSTK or IBT. However, instead of doing nothing, explicitly reject emulation of such instructions so that KVM's emulator can't be abused to circumvent CET. Disable support for SHSTK and IBT if KVM is configured such that emulation of arbitrary guest instructions may be required, specifically if Unrestricted Guest (Intel only) is disabled, or if KVM will emulate a guest.MAXPHYADDR that is smaller than host.MAXPHYADDR. Lastly disable SHSTK support if shadow paging is enabled, as the protections for the shadow stack are novel (shadow stacks require Writable=0,Dirty=1, so that they can't be directly modified by software), i.e. would require non-trivial support in the Shadow MMU. Note, AMD CPUs currently only support SHSTK. Explicitly disable IBT support so that KVM doesn't over-advertise if AMD CPUs add IBT, and virtualizing IBT in SVM requires KVM modifications.
2025-09-30Merge tag 'kvm-x86-svm-6.18' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM SVM changes for 6.18 - Require a minimum GHCB version of 2 when starting SEV-SNP guests via KVM_SEV_INIT2 so that invalid GHCB versions result in immediate errors instead of latent guest failures. - Add support for Secure TSC for SEV-SNP guests, which prevents the untrusted host from tampering with the guest's TSC frequency, while still allowing the the VMM to configure the guest's TSC frequency prior to launch. - Mitigate the potential for TOCTOU bugs when accessing GHCB fields by wrapping all accesses via READ_ONCE(). - Validate the XCR0 provided by the guest (via the GHCB) to avoid tracking a bogous XCR0 value in KVM's software model. - Save an SEV guest's policy if and only if LAUNCH_START fully succeeds to avoid leaving behind stale state (thankfully not consumed in KVM). - Explicitly reject non-positive effective lengths during SNP's LAUNCH_UPDATE instead of subtly relying on guest_memfd to do the "heavy" lifting. - Reload the pre-VMRUN TSC_AUX on #VMEXIT for SEV-ES guests, not the host's desired TSC_AUX, to fix a bug where KVM could clobber a different vCPU's TSC_AUX due to hardware not matching the value cached in the user-return MSR infrastructure. - Enable AVIC by default for Zen4+ if x2AVIC (and other prereqs) is supported, and clean up the AVIC initialization code along the way.
2025-09-23KVM: SEV: Synchronize MSR_IA32_XSS from the GHCB when it's validSean Christopherson
Synchronize XSS from the GHCB to KVM's internal tracking if the guest marks XSS as valid on a #VMGEXIT. Like XCR0, KVM needs an up-to-date copy of XSS in order to compute the required XSTATE size when emulating CPUID.0xD.0x1 for the guest. Treat the incoming XSS change as an emulated write, i.e. validatate the guest-provided value, to avoid letting the guest load garbage into KVM's tracking. Simply ignore bad values, as either the guest managed to get an unsupported value into hardware, or the guest is misbehaving and providing pure garbage. In either case, KVM can't fix the broken guest. Explicitly allow access to XSS at all times, as KVM needs to ensure its copy of XSS stays up-to-date. E.g. KVM supports migration of SEV-ES guests and so needs to allow the host to save/restore XSS, otherwise a guest that *knows* its XSS hasn't change could get stale/bad CPUID emulation if the guest doesn't provide XSS in the GHCB on every exit. This creates a hypothetical problem where a guest could request emulation of RDMSR or WRMSR on XSS, but arguably that's not even a problem, e.g. it would be entirely reasonable for a guest to request "emulation" as a way to inform the hypervisor that its XSS value has been modified. Note, emulating the change as an MSR write also takes care of side effects, e.g. marking dynamic CPUID bits as dirty. Suggested-by: John Allen <john.allen@amd.com> base-commit: 14298d819d5a6b7180a4089e7d2121ca3551dc6c Link: https://lore.kernel.org/r/20250919223258.1604852-40-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SVM: Emulate reads and writes to shadow stack MSRsJohn Allen
Emulate shadow stack MSR access by reading and writing to the corresponding fields in the VMCB. Signed-off-by: John Allen <john.allen@amd.com> [sean: mark VMCB_CET dirty/clean as appropriate] Link: https://lore.kernel.org/r/20250919223258.1604852-36-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SVM: Update "APICv in x2APIC without x2AVIC" in avic.c, not svm.cSean Christopherson
Set the "allow_apicv_in_x2apic_without_x2apic_virtualization" flag as part of avic_hardware_setup() instead of handling in svm_hardware_setup(), and make x2avic_enabled local to avic.c (setting the flag was the only use in svm.c). Tag avic_hardware_setup() with __init as necessary (it should have been tagged __init long ago). No functional change intended (aside from the side effects of tagging avic_hardware_setup() with __init). Acked-by: Naveen N Rao (AMD) <naveen@kernel.org> Tested-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250919215934.1590410-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SVM: Move x2AVIC MSR interception helper to avic.cSean Christopherson
Move svm_set_x2apic_msr_interception() to avic.c as it's only relevant when x2AVIC is enabled/supported and only called by AVIC code. In addition to scoping AVIC code to avic.c, this will allow burying the global x2avic_enabled variable in avic. Opportunistically rename the helper to explicitly scope it to "avic". No functional change intended. Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Tested-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250919215934.1590410-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SVM: Make svm_x86_ops globally visible, clean up on-HyperV usageSean Christopherson
Make svm_x86_ops globally visible in anticipation of modifying the struct in avic.c, and clean up the KVM-on-HyperV usage, as declaring _and using_ a local variable in a header that's only defined in one specific .c-file is all kinds of ugly. Opportunistically make svm_hv_enable_l2_tlb_flush() local to svm_onhyperv.c, as the only reason it was visible was due to the aforementioned shenanigans in svm_onhyperv.h. Alternatively, svm_x86_ops could be explicitly passed to svm_hv_hardware_setup() as a parameter. While that approach is slightly safer, e.g. avoids "hidden" updates, for better or worse, the Intel side of KVM has already chosen to expose vt_x86_ops (and vt_init_ops). Given that svm_x86_ops is only truly consumed by kvm_ops_update, the odds of a "hidden" update causing problems are extremely low. So, absent a strong reason to rework the VMX/TDX code, make svm_x86_ops visible, as having all updates use exactly "svm_x86_ops." is advantageous in its own right. No functional change intended. Link: https://lore.kernel.org/r/20250919215934.1590410-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SVM: Re-load current, not host, TSC_AUX on #VMEXIT from SEV-ES guestHou Wenlong
Prior to running an SEV-ES guest, set TSC_AUX in the host save area to the current value in hardware, as tracked by the user return infrastructure, instead of always loading the host's desired value for the CPU. If the pCPU is also running a non-SEV-ES vCPU, loading the host's value on #VMEXIT could clobber the other vCPU's value, e.g. if the SEV-ES vCPU preempted the non-SEV-ES vCPU, in which case KVM expects the other vCPU's TSC_AUX value to be resident in hardware. Note, unlike TDX, which blindly _zeroes_ TSC_AUX on TD-Exit, SEV-ES CPUs can load an arbitrary value. Stuff the current value in the host save area instead of refreshing the user return cache so that KVM doesn't need to track whether or not the vCPU actually enterred the guest and thus loaded TSC_AUX from the host save area. Opportunistically tag tsc_aux_uret_slot as read-only after init to guard against unexpected modifications, and to make it obvious that using the variable in sev_es_prepare_switch_to_guest() is safe. Fixes: 916e3e5f26ab ("KVM: SVM: Do not use user return MSR support for virtualized TSC_AUX") Cc: stable@vger.kernel.org Suggested-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> [sean: handle the SEV-ES case in sev_es_prepare_switch_to_guest()] Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250923153738.1875174-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SEV: Read save fields from GHCB exactly onceSean Christopherson
Wrap all reads of GHCB save fields with READ_ONCE() via a KVM-specific GHCB get() utility to help guard against TOCTOU bugs. Using READ_ONCE() doesn't completely prevent such bugs, e.g. doesn't prevent KVM from redoing get() after checking the initial value, but at least addresses all potential TOCTOU issues in the current KVM code base. To prevent unintentional use of the generic helpers, take only @svm for the kvm_ghcb_get_xxx() helpers and retrieve the ghcb instead of explicitly passing it in. Opportunistically reduce the indentation of the macro-defined helpers and clean up the alignment. Fixes: 4e15a0ddc3ff ("KVM: SEV: snapshot the GHCB before accessing it") Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250919223258.1604852-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-27KVM: x86/mmu: Extend guest_memfd's max mapping level to shared mappingsSean Christopherson
Rework kvm_mmu_max_mapping_level() to consult guest_memfd for all mappings, not just private mappings, so that hugepage support plays nice with the upcoming support for backing non-private memory with guest_memfd. In addition to getting the max order from guest_memfd for gmem-only memslots, update TDX's hook to effectively ignore shared mappings, as TDX's restrictions on page size only apply to Secure EPT mappings. Do nothing for SNP, as RMP restrictions apply to both private and shared memory. Suggested-by: Ackerley Tng <ackerleytng@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Fuad Tabba <tabba@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20250729225455.670324-16-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27KVM: x86/mmu: Rename .private_max_mapping_level() to .gmem_max_mapping_level()Ackerley Tng
Rename kvm_x86_ops.private_max_mapping_level() to .gmem_max_mapping_level() in anticipation of extending guest_memfd support to non-private memory. No functional change intended. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Ackerley Tng <ackerleytng@google.com> Signed-off-by: Fuad Tabba <tabba@google.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Message-ID: <20250729225455.670324-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-21KVM: SEV: Fold sev_es_vcpu_reset() into sev_vcpu_create()Sean Christopherson
Fold the remaining line of sev_es_vcpu_reset() into sev_vcpu_create() as there's no need for a dedicated RESET hook just to init a mutex, and the mutex should be initialized as early as possible anyways. No functional change intended. Reviewed-by: Nikunj A Dadhania <nikunj@amd.com> Link: https://lore.kernel.org/r/20250819234833.3080255-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-21KVM: SEV: Move init of SNP guest state into sev_init_vmcb()Sean Christopherson
Move the initialization of SNP guest state from svm_vcpu_reset() into sev_init_vmcb() to reduce the number of paths that deal with INIT/RESET for SEV+ vCPUs from 4+ to 1. Plumb in @init_event as necessary. Opportunistically check for an SNP guest outside of sev_snp_init_protected_guest_state() so that sev_init_vmcb() is consistent with respect to checking for SEV-ES+ and SNP+ guests. No functional change intended. Reviewed-by: Nikunj A Dadhania <nikunj@amd.com> Link: https://lore.kernel.org/r/20250819234833.3080255-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-21KVM: SVM: Move SEV-ES VMSA allocation to a dedicated sev_vcpu_create() helperSean Christopherson
Add a dedicated sev_vcpu_create() helper to allocate the VMSA page for SEV-ES+ vCPUs, and to allow for consolidating a variety of related SEV+ code in the near future. No functional change intended. Reviewed-by: Nikunj A Dadhania <nikunj@amd.com> Link: https://lore.kernel.org/r/20250819234833.3080255-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-29Merge tag 'kvm-x86-sev-6.17' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM SEV cache maintenance changes for 6.17 - Drop a superfluous WBINVD (on all CPUs!) when destroying a VM. - Use WBNOINVD instead of WBINVD when possible, for SEV cache maintenance, e.g. to minimize collateral damage when reclaiming memory from an SEV guest. - When reclaiming memory from an SEV guest, only do cache flushes on CPUs that have ever run a vCPU for the guest, i.e. don't flush the caches for CPUs that can't possibly have cache lines with dirty, encrypted data.
2025-07-29Merge tag 'kvm-x86-misc-6.17' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 misc changes for 6.17 - Prevert the host's DEBUGCTL.FREEZE_IN_SMM (Intel only) when running the guest. Failure to honor FREEZE_IN_SMM can bleed host state into the guest. - Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter (Intel only) to prevent L1 from running L2 with features that KVM doesn't support, e.g. BTF. - Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to the vCPU's CPUID model. - Rework the MSR interception code so that the SVM and VMX APIs are more or less identical. - Recalculate all MSR intercepts from the "source" on MSR filter changes, and drop the dedicated "shadow" bitmaps (and their awful "max" size defines). - WARN and reject loading kvm-amd.ko instead of panicking the kernel if the nested SVM MSRPM offsets tracker can't handle an MSR. - Advertise support for LKGS (Load Kernel GS base), a new instruction that's loosely related to FRED, but is supported and enumerated independently. - Fix a user-triggerable WARN that syzkaller found by stuffing INIT_RECEIVED, a.k.a. WFS, and then putting the vCPU into VMX Root Mode (post-VMXON). Use the same approach KVM uses for dealing with "impossible" emulation when running a !URG guest, and simply wait until KVM_RUN to detect that the vCPU has architecturally impossible state. - Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling interception of APERF/MPERF reads, so that a "properly" configured VM can "virtualize" APERF/MPERF (with many caveats). - Reject KVM_SET_TSC_KHZ if vCPUs have been created, as changing the "default" frequency is unsupported for VMs with a "secure" TSC, and there's no known use case for changing the default frequency for other VM types.
2025-07-14KVM: SVM: Flush cache only on CPUs running SEV guestZheyun Shen
On AMD CPUs without ensuring cache consistency, each memory page reclamation in an SEV guest triggers a call to do WBNOINVD/WBINVD on all CPUs, thereby affecting the performance of other programs on the host. Typically, an AMD server may have 128 cores or more, while the SEV guest might only utilize 8 of these cores. Meanwhile, host can use qemu-affinity to bind these 8 vCPUs to specific physical CPUs. Therefore, keeping a record of the physical core numbers each time a vCPU runs can help avoid flushing the cache for all CPUs every time. Take care to allocate the cpumask used to track which CPUs have run a vCPU when copying or moving an "encryption context", as nothing guarantees memory in a mirror VM is a strict subset of the ASID owner, and the destination VM for intrahost migration needs to maintain it's own set of CPUs. E.g. for intrahost migration, if a CPU was used for the source VM but not the destination VM, then it can only have cached memory that was accessible to the source VM. And a CPU that was run in the source is also used by the destination is no different than a CPU that was run in the destination only. Note, KVM is guaranteed to do flush caches prior to sev_vm_destroy(), thanks to kvm_arch_guest_memory_reclaimed for SEV and SEV-ES, and kvm_arch_gmem_invalidate() for SEV-SNP. I.e. it's safe to free the cpumask prior to unregistering encrypted regions and freeing the ASID. Opportunistically clean up sev_vm_destroy()'s comment regarding what is (implicitly, what isn't) skipped for mirror VMs. Cc: Srikanth Aithal <sraithal@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Zheyun Shen <szy0127@sjtu.edu.cn> Link: https://lore.kernel.org/r/20250522233733.3176144-9-seanjc@google.com Link: https://lore.kernel.org/all/935a82e3-f7ad-47d7-aaaf-f3d2b62ed768@amd.com Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24KVM: x86: Deduplicate MSR interception enabling and disablingChao Gao
Extract a common function from MSR interception disabling logic and create disabling and enabling functions based on it. This removes most of the duplicated code for MSR interception disabling/enabling. No functional change intended. Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250612081947.94081-2-chao.gao@intel.com [sean: s/enable/set, inline the wrappers] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: x86: Dedup AVIC vs. PI code for identifying target vCPUSean Christopherson
Hoist the logic for identifying the target vCPU for a posted interrupt into common x86. The code is functionally identical between Intel and AMD. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-30-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Add enable_ipiv param, never set IsRunning if disabledMaxim Levitsky
Let userspace "disable" IPI virtualization for AVIC via the enable_ipiv module param, by never setting IsRunning. SVM doesn't provide a way to disable IPI virtualization in hardware, but by ensuring CPUs never see IsRunning=1, every IPI in the guest (except for self-IPIs) will generate a VM-Exit. To avoid setting the real IsRunning bit, while still allowing KVM to use each vCPU's entry to update GA log entries, simply maintain a shadow of the entry, without propagating IsRunning updates to the real table when IPI virtualization is disabled. Providing a way to effectively disable IPI virtualization will allow KVM to safely enable AVIC on hardware that is susceptible to erratum #1235, which causes hardware to sometimes fail to detect that the IsRunning bit has been cleared by software. Note, the table _must_ be fully populated, as broadcast IPIs skip invalid entries, i.e. won't generate VM-Exit if every entry is invalid, and so simply pointing the VMCB at a common dummy table won't work. Alternatively, KVM could allocate a shadow of the entire table, but that'd be a waste of 4KiB since the per-vCPU entry doesn't actually consume an additional 8 bytes of memory (vCPU structures are large enough that they are backed by order-N pages). Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> [sean: keep "entry" variables, reuse enable_ipiv, split from erratum] Link: https://lore.kernel.org/r/20250611224604.313496-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Drop superfluous "cache" of AVIC Physical ID entry pointerSean Christopherson
Drop the vCPU's pointer to its AVIC Physical ID entry, and simply index the table directly. Caching a pointer address is completely unnecessary for performance, and while the field technically caches the result of the pointer calculation, it's all too easy to misinterpret the name and think that the field somehow caches the _data_ in the table. No functional change intended. Suggested-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250611224604.313496-17-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Track AVIC tables as natively sized pointers, not "struct pages"Sean Christopherson
Allocate and track AVIC's logical and physical tables as u32 and u64 pointers respectively, as managing the pages as "struct page" pointers adds an almost absurd amount of boilerplate and complexity. E.g. with page_address() out of the way, svm->avic_physical_id_cache becomes completely superfluous, and will be removed in a future cleanup. No functional change intended. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Acked-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250611224604.313496-16-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU creationSean Christopherson
Inhibit AVIC with a new "ID too big" flag if userspace creates a vCPU with an ID that is too big, but otherwise allow vCPU creation to succeed. Rejecting KVM_CREATE_VCPU with EINVAL violates KVM's ABI as KVM advertises that the max vCPU ID is 4095, but disallows creating vCPUs with IDs bigger than 254 (AVIC) or 511 (x2AVIC). Alternatively, KVM could advertise an accurate value depending on which AVIC mode is in use, but that wouldn't really solve the underlying problem, e.g. would be a breaking change if KVM were to ever try and enable AVIC or x2AVIC by default. Cc: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Drop vcpu_svm's pointless avic_backing_page fieldSean Christopherson
Drop vcpu_svm's avic_backing_page pointer and instead grab the physical address of KVM's vAPIC page directly from the source. Getting a physical address from a kernel virtual address is not an expensive operation, and getting the physical address from a struct page is *more* expensive for CONFIG_SPARSEMEM=y kernels. Regardless, none of the paths that consume the address are hot paths, i.e. shaving cycles is not a priority. Eliminating the "cache" means KVM doesn't have to worry about the cache being invalid, which will simplify a future fix when dealing with vCPU IDs that are too big. WARN if KVM attempts to allocate a vCPU's AVIC backing page without an in-kernel local APIC. avic_init_vcpu() bails early if the APIC is not in-kernel, and KVM disallows enabling an in-kernel APIC after vCPUs have been created, i.e. it should be impossible to reach avic_init_backing_page() without the vAPIC being allocated. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250611224604.313496-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Track per-vCPU IRTEs using kvm_kernel_irqfd structureSean Christopherson
Track the IRTEs that are posting to an SVM vCPU via the associated irqfd structure and GSI routing instead of dynamically allocating a separate data structure. In addition to eliminating an atomic allocation, this will allow hoisting much of the IRTE update logic to common x86. Cc: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: Pass new routing entries and irqfd when updating IRTEsSean Christopherson
When updating IRTEs in response to a GSI routing or IRQ bypass change, pass the new/current routing information along with the associated irqfd. This will allow KVM x86 to harden, simplify, and deduplicate its code. Since adding/removing a bypass producer is now conveniently protected with irqfds.lock, i.e. can't run concurrently with kvm_irq_routing_update(), use the routing information cached in the irqfd instead of looking up the information in the current GSI routing tables. Opportunistically convert an existing printk() to pr_info() and put its string onto a single line (old code that strictly adhered to 80 chars). Link: https://lore.kernel.org/r/20250611224604.313496-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Add a helper to allocate and initialize permissions bitmapsSean Christopherson
Add a helper to allocate and initialize an MSR or I/O permissions map, as the logic is identical between the two map types, the only difference is the size of the bitmap. Opportunistically add a comment to explain why the bitmaps are initialized with 0xff, e.g. instead of the more common zero-initialized behavior, which is the main motivation for deduplicating the code. No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-31-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Return -EINVAL instead of MSR_INVALID to signal out-of-range MSRSean Christopherson
Return -EINVAL instead of MSR_INVALID from svm_msrpm_bit_nr() to indicate that the MSR isn't covered by one of the (currently) three MSRPM ranges, and delete the MSR_INVALID macro now that all users are gone. Link: https://lore.kernel.org/r/20250610225737.156318-29-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Store MSRPM pointer as "void *" instead of "u32 *"Sean Christopherson
Store KVM's MSRPM pointers as "void *" instead of "u32 *" to guard against directly accessing the bitmaps outside of code that is explicitly written to access the bitmaps with a specific type. Opportunistically use svm_vcpu_free_msrpm() in svm_vcpu_free() instead of open coding an equivalent. Link: https://lore.kernel.org/r/20250610225737.156318-27-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Move svm_msrpm_offset() to nested.cSean Christopherson
Move svm_msrpm_offset() from svm.c to nested.c now that all usage of the u32-index offsets is nested virtualization specific. No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-26-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Manually recalc all MSR intercepts on userspace MSR filter changeSean Christopherson
On a userspace MSR filter change, recalculate all MSR intercepts using the filter-agnostic logic instead of maintaining a "shadow copy" of KVM's desired intercepts. The shadow bitmaps add yet another point of failure, are confusing (e.g. what does "handled specially" mean!?!?), an eyesore, and a maintenance burden. Given that KVM *must* be able to recalculate the correct intercepts at any given time, and that MSR filter updates are not hot paths, there is zero benefit to maintaining the shadow bitmaps. Opportunistically switch from boot_cpu_has() to cpu_feature_enabled() as appropriate. Link: https://lore.kernel.org/all/aCdPbZiYmtni4Bjs@google.com Link: https://lore.kernel.org/all/20241126180253.GAZ0YNTdXH1UGeqsu6@fat_crate.local Cc: Francesco Lavra <francescolavra.fl@gmail.com> Link: https://lore.kernel.org/r/20250610225737.156318-20-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Implement and adopt VMX style MSR intercepts APIsSean Christopherson
Add and use SVM MSR interception APIs (in most paths) to match VMX's APIs and nomenclature. Specifically, add SVM variants of: vmx_disable_intercept_for_msr(vcpu, msr, type) vmx_enable_intercept_for_msr(vcpu, msr, type) vmx_set_intercept_for_msr(vcpu, msr, type, intercept) to eventually replace SVM's single helper: set_msr_interception(vcpu, msrpm, msr, allow_read, allow_write) which is awkward to use (in all cases, KVM either applies the same logic for both reads and writes, or intercepts one of read or write), and is unintuitive due to using '0' to indicate interception should be *set*. Keep the guts of the old API for the moment to avoid churning the MSR filter code, as that mess will be overhauled in the near future. Leave behind a temporary comment to call out that the shadow bitmaps have inverted polarity relative to the bitmaps consumed by hardware. No functional change intended. Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Add helpers for accessing MSR bitmap that don't rely on offsetsSean Christopherson
Add macro-built helpers for testing, setting, and clearing MSRPM entries without relying on precomputed offsets. This sets the stage for eventually removing general KVM use of precomputed offsets, which are quite confusing and rather inefficient for the vast majority of KVM's usage. Outside of merging L0 and L1 bitmaps for nested SVM, using u32-indexed offsets and accesses is at best unnecessary, and at worst introduces extra operations to retrieve the individual bit from within the offset u32 value. And simply calling them "offsets" is very confusing, as the "unit" of the offset isn't immediately obvious. Use the new helpers in set_msr_interception_bitmap() and msr_write_intercepted() to verify the math and operations, but keep the existing offset-based logic in set_msr_interception_bitmap() to sanity check the "clear" and "set" operations. Manipulating MSR interceptions isn't a hot path and no kernel release is ever expected to contain this specific version of set_msr_interception_bitmap() (it will be removed entirely in the near future). Link: https://lore.kernel.org/r/20250610225737.156318-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: nSVM: Don't initialize vmcb02 MSRPM with vmcb01's "always passthrough"Sean Christopherson
Don't initialize vmcb02's MSRPM with KVM's set of "always passthrough" MSRs, as KVM always needs to consult L1's intercepts, i.e. needs to merge vmcb01 with vmcb12 and write the result to vmcb02. This will eventually allow for the removal of svm_vcpu_init_msrpm(). Note, the bitmaps are truly initialized by svm_vcpu_alloc_msrpm() (default to intercepting all MSRs), e.g. if there is a bug lurking elsewhere, the worst case scenario from dropping the call to svm_vcpu_init_msrpm() should be that KVM would fail to passthrough MSRs to L2. Link: https://lore.kernel.org/r/20250610225737.156318-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: nSVM: Use dedicated array of MSRPM offsets to merge L0 and L1 bitmapsSean Christopherson
Use a dedicated array of MSRPM offsets to merge L0 and L1 bitmaps, i.e. to merge KVM's vmcb01 bitmap with L1's vmcb12 bitmap. This will eventually allow for the removal of direct_access_msrs, as the only path where tracking the offsets is truly justified is the merge for nested SVM, where merging in chunks is an easy way to batch uaccess reads/writes. Opportunistically omit the x2APIC MSRs from the merge-specific array instead of filtering them out at runtime. Note, disabling interception of DEBUGCTL, XSS, EFER, PAT, GHCB, and TSC_AUX is mutually exclusive with nested virtualization, as KVM passes through those MSRs only for SEV-ES guests, and KVM doesn't support nested virtualization for SEV+ guests. Defer removing those MSRs to a future cleanup in order to make this refactoring as benign as possible. Link: https://lore.kernel.org/r/20250610225737.156318-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Clean up macros related to architectural MSRPM definitionsSean Christopherson
Move SVM's MSR Permissions Map macros to svm.h in anticipation of adding helpers that are available to SVM code, and opportunistically replace a variety of open-coded literals with (hopefully) informative macros. Opportunistically open code ARRAY_SIZE(msrpm_ranges) instead of wrapping it as NUM_MSR_MAPS, which is an ambiguous name even if it were qualified with "SVM_MSRPM". Deliberately leave the ranges as open coded literals, as using macros to define the ranges actually introduces more potential failure points, since both the definitions and the usage have to be careful to use the correct index. The lack of clear intent behind the ranges will be addressed in future patches. No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Use ARRAY_SIZE() to iterate over direct_access_msrsSean Christopherson
Drop the unnecessary and dangerous value-terminated behavior of direct_access_msrs, and simply iterate over the actual size of the array. The use in svm_set_x2apic_msr_interception() is especially sketchy, as it relies on unused capacity being zero-initialized, and '0' being outside the range of x2APIC MSRs. To ensure the array and shadow_msr_intercept stay synchronized, simply assert that their sizes are identical (note the six 64-bit-only MSRs). Note, direct_access_msrs will soon be removed entirely; keeping the assert synchronized with the array isn't expected to be along-term maintenance burden. Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-27Merge tag 'kvm-x86-svm-6.16' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM SVM changes for 6.16: - Wait for target vCPU to acknowledge KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to fix a race between AP destroy and VMRUN. - Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM. - Add support for ALLOWED_SEV_FEATURES. - Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y. - Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features that utilize those bits. - Don't account temporary allocations in sev_send_update_data(). - Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold.
2025-05-27Merge tag 'kvm-x86-misc-6.16' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 misc changes for 6.16: - Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU IBPB, between SVM and VMX. - Advertise support to userspace for WRMSRNS and PREFETCHI. - Rescan I/O APIC routes after handling EOI that needed to be intercepted due to the old/previous routing, but not the new/current routing. - Add a module param to control and enumerate support for device posted interrupts. - Misc cleanups.
2025-05-19KVM: SVM: Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM CPUsManali Shukla
Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM CPUs with Bus Lock Threshold, which is close enough to VMX's Bus Lock Detection VM-Exit to allow reusing KVM_CAP_X86_BUS_LOCK_EXIT. The biggest difference between the two features is that Threshold is fault-like, whereas Detection is trap-like. To allow the guest to make forward progress, Threshold provides a per-VMCB counter which is decremented every time a bus lock occurs, and a VM-Exit is triggered if and only if the counter is '0'. To provide Detection-like semantics, initialize the counter to '0', i.e. exit on every bus lock, and when re-executing the guilty instruction, set the counter to '1' to effectively step past the instruction. Note, in the unlikely scenario that re-executing the instruction doesn't trigger a bus lock, e.g. because the guest has changed memory types or patched the guilty instruction, the bus lock counter will be left at '1', i.e. the guest will be able to do a bus lock on a different instruction. In a perfect world, KVM would ensure the counter is '0' if the guest has made forward progress, e.g. if RIP has changed. But trying to close that hole would incur non-trivial complexity, for marginal benefit; the intent of KVM_CAP_X86_BUS_LOCK_EXIT is to allow userspace rate-limit bus locks, not to allow for precise detection of problematic guest code. And, it's simply not feasible to fully close the hole, e.g. if an interrupt arrives before the original instruction can re-execute, the guest could step past a different bus lock. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Manali Shukla <manali.shukla@amd.com> Link: https://lore.kernel.org/r/20250502050346.14274-5-manali.shukla@amd.com [sean: fix typo in comment] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-08KVM: SVM: Set/clear SRSO's BP_SPEC_REDUCE on 0 <=> 1 VM count transitionsSean Christopherson
Set the magic BP_SPEC_REDUCE bit to mitigate SRSO when running VMs if and only if KVM has at least one active VM. Leaving the bit set at all times unfortunately degrades performance by a wee bit more than expected. Use a dedicated spinlock and counter instead of hooking virtualization enablement, as changing the behavior of kvm.enable_virt_at_load based on SRSO_BP_SPEC_REDUCE is painful, and has its own drawbacks, e.g. could result in performance issues for flows that are sensitive to VM creation latency. Defer setting BP_SPEC_REDUCE until VMRUN is imminent to avoid impacting performance on CPUs that aren't running VMs, e.g. if a setup is using housekeeping CPUs. Setting BP_SPEC_REDUCE in task context, i.e. without blasting IPIs to all CPUs, also helps avoid serializing 1<=>N transitions without incurring a gross amount of complexity (see the Link for details on how ugly coordinating via IPIs gets). Link: https://lore.kernel.org/all/aBOnzNCngyS_pQIW@google.com Fixes: 8442df2b49ed ("x86/bugs: KVM: Add support for SRSO_MSR_FIX") Reported-by: Michael Larabel <Michael@michaellarabel.com> Closes: https://www.phoronix.com/review/linux-615-amd-regression Cc: Borislav Petkov <bp@alien8.de> Tested-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250505180300.973137-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>