The counters_read_on_cpu() function warns when called with IRQs
disabled to prevent deadlock in smp_call_function_single(). However,
this warning is spurious when reading counters on the current CPU,
since no IPI is needed for same CPU reads.
Commit 12eb8f4fff24 ("cpufreq: CPPC: Update FIE arch_freq_scale in
ticks for non-PCC regs") changed the CPPC Frequency Invariance Engine
to read AMU counters directly from the scheduler tick for non-PCC
register spaces (like FFH), instead of deferring to a kthread. This
means counters_read_on_cpu() is now called with IRQs disabled from the
tick handler, triggering the warning.
Fix this by restructuring the logic: when IRQs are disabled (tick
context), call the function directly for same-CPU reads. Otherwise
use smp_call_function_single().
Fixes: 997c021abc ("cpufreq: CPPC: Update FIE arch_freq_scale in ticks for non-PCC regs")
Suggested-by: Will Deacon <will@kernel.org>
Signed-off-by: Sumit Gupta <sumitg@nvidia.com>
Signed-off-by: Will Deacon <will@kernel.org>
The ARM64_WORKAROUND_REPEAT_TLBI workaround is used to mitigate several
errata where broadcast TLBI;DSB sequences don't provide all the
architecturally required synchronization. The workaround performs more
work than necessary, and can have significant overhead. This patch
optimizes the workaround, as explained below.
The workaround was originally added for Qualcomm Falkor erratum 1009 in
commit:
d9ff80f83e ("arm64: Work around Falkor erratum 1009")
As noted in the message for that commit, the workaround is applied even
in cases where it is not strictly necessary.
The workaround was later reused without changes for:
* Arm Cortex-A76 erratum #1286807
SDEN v33: https://developer.arm.com/documentation/SDEN-885749/33-0/
* Arm Cortex-A55 erratum #2441007
SDEN v16: https://developer.arm.com/documentation/SDEN-859338/1600/
* Arm Cortex-A510 erratum #2441009
SDEN v19: https://developer.arm.com/documentation/SDEN-1873351/1900/
The important details to note are as follows:
1. All relevant errata only affect the ordering and/or completion of
memory accesses which have been translated by an invalidated TLB
entry. The actual invalidation of TLB entries is unaffected.
2. The existing workaround is applied to both broadcast and local TLB
invalidation, whereas for all relevant errata it is only necessary to
apply a workaround for broadcast invalidation.
3. The existing workaround replaces every TLBI with a TLBI;DSB;TLBI
sequence, whereas for all relevant errata it is only necessary to
execute a single additional TLBI;DSB sequence after any number of
TLBIs are completed by a DSB.
For example, for a sequence of batched TLBIs:
TLBI <op1>[, <arg1>]
TLBI <op2>[, <arg2>]
TLBI <op3>[, <arg3>]
DSB ISH
... the existing workaround will expand this to:
TLBI <op1>[, <arg1>]
DSB ISH // additional
TLBI <op1>[, <arg1>] // additional
TLBI <op2>[, <arg2>]
DSB ISH // additional
TLBI <op2>[, <arg2>] // additional
TLBI <op3>[, <arg3>]
DSB ISH // additional
TLBI <op3>[, <arg3>] // additional
DSB ISH
... whereas it is sufficient to have:
TLBI <op1>[, <arg1>]
TLBI <op2>[, <arg2>]
TLBI <op3>[, <arg3>]
DSB ISH
TLBI <opX>[, <argX>] // additional
DSB ISH // additional
Using a single additional TBLI and DSB at the end of the sequence can
have significantly lower overhead as each DSB which completes a TLBI
must synchronize with other PEs in the system, with potential
performance effects both locally and system-wide.
4. The existing workaround repeats each specific TLBI operation, whereas
for all relevant errata it is sufficient for the additional TLBI to
use *any* operation which will be broadcast, regardless of which
translation regime or stage of translation the operation applies to.
For example, for a single TLBI:
TLBI ALLE2IS
DSB ISH
... the existing workaround will expand this to:
TLBI ALLE2IS
DSB ISH
TLBI ALLE2IS // additional
DSB ISH // additional
... whereas it is sufficient to have:
TLBI ALLE2IS
DSB ISH
TLBI VALE1IS, XZR // additional
DSB ISH // additional
As the additional TLBI doesn't have to match a specific earlier TLBI,
the additional TLBI can be implemented in separate code, with no
memory of the earlier TLBIs. The additional TLBI can also use a
cheaper TLBI operation.
5. The existing workaround is applied to both Stage-1 and Stage-2 TLB
invalidation, whereas for all relevant errata it is only necessary to
apply a workaround for Stage-1 invalidation.
Architecturally, TLBI operations which invalidate only Stage-2
information (e.g. IPAS2E1IS) are not required to invalidate TLB
entries which combine information from Stage-1 and Stage-2
translation table entries, and consequently may not complete memory
accesses translated by those combined entries. In these cases,
completion of memory accesses is only guaranteed after subsequent
invalidation of Stage-1 information (e.g. VMALLE1IS).
Taking the above points into account, this patch reworks the workaround
logic to reduce overhead:
* New __tlbi_sync_s1ish() and __tlbi_sync_s1ish_hyp() functions are
added and used in place of any dsb(ish) which is used to complete
broadcast Stage-1 TLB maintenance. When the
ARM64_WORKAROUND_REPEAT_TLBI workaround is enabled, these helpers will
execute an additional TLBI;DSB sequence.
For consistency, it might make sense to add __tlbi_sync_*() helpers
for local and stage 2 maintenance. For now I've left those with
open-coded dsb() to keep the diff small.
* The duplication of TLBIs in __TLBI_0() and __TLBI_1() is removed. This
is no longer needed as the necessary synchronization will happen in
__tlbi_sync_s1ish() or __tlbi_sync_s1ish_hyp().
* The additional TLBI operation is chosen to have minimal impact:
- __tlbi_sync_s1ish() uses "TLBI VALE1IS, XZR". This is only used at
EL1 or at EL2 with {E2H,TGE}=={1,1}, where it will target an unused
entry for the reserved ASID in the kernel's own translation regime,
and have no adverse affect.
- __tlbi_sync_s1ish_hyp() uses "TLBI VALE2IS, XZR". This is only used
in hyp code, where it will target an unused entry in the hyp code's
TTBR0 mapping, and should have no adverse effect.
* As __TLBI_0() and __TLBI_1() no longer replace each TLBI with a
TLBI;DSB;TLBI sequence, batching TLBIs is worthwhile, and there's no
need for arch_tlbbatch_should_defer() to consider
ARM64_WORKAROUND_REPEAT_TLBI.
When building defconfig with GCC 15.1.0, compared to v6.19-rc1, this
patch saves ~1KiB of text, makes the vmlinux ~42KiB smaller, and makes
the resulting Image 64KiB smaller:
| [mark@lakrids:~/src/linux]% size vmlinux-*
| text data bss dec hex filename
| 21179831 19660919 708216 41548966 279fca6 vmlinux-after
| 21181075 19660903 708216 41550194 27a0172 vmlinux-before
| [mark@lakrids:~/src/linux]% ls -l vmlinux-*
| -rwxr-xr-x 1 mark mark 157771472 Feb 4 12:05 vmlinux-after
| -rwxr-xr-x 1 mark mark 157815432 Feb 4 12:05 vmlinux-before
| [mark@lakrids:~/src/linux]% ls -l Image-*
| -rw-r--r-- 1 mark mark 41007616 Feb 4 12:05 Image-after
| -rw-r--r-- 1 mark mark 41073152 Feb 4 12:05 Image-before
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oupton@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
Rename our ioremap_prot() implementation to __ioremap_prot() and convert
all arch-internal callers over to the new function.
ioremap_prot() remains as a #define to __ioremap_prot() for
generic_access_phys() and will be subsequently extended to handle user
permissions in 'prot'.
Cc: Zeng Heng <zengheng4@huawei.com>
Cc: Jinjiang Tu <tujinjiang@huawei.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
This converts some of the visually simpler cases that have been split
over multiple lines. I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.
Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script. I probably had made it a bit _too_ trivial.
So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.
The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the exact same thing as the 'alloc_obj()' version, only much
smaller because there are a lot fewer users of the *alloc_flex()
interface.
As with alloc_obj() version, this was done entirely with mindless brute
force, using the same script, except using 'flex' in the pattern rather
than 'objs*'.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
Pull KVM updates from Paolo Bonzini:
"Loongarch:
- Add more CPUCFG mask bits
- Improve feature detection
- Add lazy load support for FPU and binary translation (LBT) register
state
- Fix return value for memory reads from and writes to in-kernel
devices
- Add support for detecting preemption from within a guest
- Add KVM steal time test case to tools/selftests
ARM:
- Add support for FEAT_IDST, allowing ID registers that are not
implemented to be reported as a normal trap rather than as an UNDEF
exception
- Add sanitisation of the VTCR_EL2 register, fixing a number of
UXN/PXN/XN bugs in the process
- Full handling of RESx bits, instead of only RES0, and resulting in
SCTLR_EL2 being added to the list of sanitised registers
- More pKVM fixes for features that are not supposed to be exposed to
guests
- Make sure that MTE being disabled on the pKVM host doesn't give it
the ability to attack the hypervisor
- Allow pKVM's host stage-2 mappings to use the Force Write Back
version of the memory attributes by using the "pass-through'
encoding
- Fix trapping of ICC_DIR_EL1 on GICv5 hosts emulating GICv3 for the
guest
- Preliminary work for guest GICv5 support
- A bunch of debugfs fixes, removing pointless custom iterators
stored in guest data structures
- A small set of FPSIMD cleanups
- Selftest fixes addressing the incorrect alignment of page
allocation
- Other assorted low-impact fixes and spelling fixes
RISC-V:
- Fixes for issues discoverd by KVM API fuzzing in
kvm_riscv_aia_imsic_has_attr(), kvm_riscv_aia_imsic_rw_attr(), and
kvm_riscv_vcpu_aia_imsic_update()
- Allow Zalasr, Zilsd and Zclsd extensions for Guest/VM
- Transparent huge page support for hypervisor page tables
- Adjust the number of available guest irq files based on MMIO
register sizes found in the device tree or the ACPI tables
- Add RISC-V specific paging modes to KVM selftests
- Detect paging mode at runtime for selftests
s390:
- Performance improvement for vSIE (aka nested virtualization)
- Completely new memory management. s390 was a special snowflake that
enlisted help from the architecture's page table management to
build hypervisor page tables, in particular enabling sharing the
last level of page tables. This however was a lot of code (~3K
lines) in order to support KVM, and also blocked several features.
The biggest advantages is that the page size of userspace is
completely independent of the page size used by the guest:
userspace can mix normal pages, THPs and hugetlbfs as it sees fit,
and in fact transparent hugepages were not possible before. It's
also now possible to have nested guests and guests with huge pages
running on the same host
- Maintainership change for s390 vfio-pci
- Small quality of life improvement for protected guests
x86:
- Add support for giving the guest full ownership of PMU hardware
(contexted switched around the fastpath run loop) and allowing
direct access to data MSRs and PMCs (restricted by the vPMU model).
KVM still intercepts access to control registers, e.g. to enforce
event filtering and to prevent the guest from profiling sensitive
host state. This is more accurate, since it has no risk of
contention and thus dropped events, and also has significantly less
overhead.
For more information, see the commit message for merge commit
bf2c3138ae ("Merge tag 'kvm-x86-pmu-6.20' ...")
- Disallow changing the virtual CPU model if L2 is active, for all
the same reasons KVM disallows change the model after the first
KVM_RUN
- Fix a bug where KVM would incorrectly reject host accesses to PV
MSRs when running with KVM_CAP_ENFORCE_PV_FEATURE_CPUID enabled,
even if those were advertised as supported to userspace,
- Fix a bug with protected guest state (SEV-ES/SNP and TDX) VMs,
where KVM would attempt to read CR3 configuring an async #PF entry
- Fail the build if EXPORT_SYMBOL_GPL or EXPORT_SYMBOL is used in KVM
(for x86 only) to enforce usage of EXPORT_SYMBOL_FOR_KVM_INTERNAL.
Only a few exports that are intended for external usage, and those
are allowed explicitly
- When checking nested events after a vCPU is unblocked, ignore
-EBUSY instead of WARNing. Userspace can sometimes put the vCPU
into what should be an impossible state, and spurious exit to
userspace on -EBUSY does not really do anything to solve the issue
- Also throw in the towel and drop the WARN on INIT/SIPI being
blocked when vCPU is in Wait-For-SIPI, which also resulted in
playing whack-a-mole with syzkaller stuffing architecturally
impossible states into KVM
- Add support for new Intel instructions that don't require anything
beyond enumerating feature flags to userspace
- Grab SRCU when reading PDPTRs in KVM_GET_SREGS2
- Add WARNs to guard against modifying KVM's CPU caps outside of the
intended setup flow, as nested VMX in particular is sensitive to
unexpected changes in KVM's golden configuration
- Add a quirk to allow userspace to opt-in to actually suppress EOI
broadcasts when the suppression feature is enabled by the guest
(currently limited to split IRQCHIP, i.e. userspace I/O APIC).
Sadly, simply fixing KVM to honor Suppress EOI Broadcasts isn't an
option as some userspaces have come to rely on KVM's buggy behavior
(KVM advertises Supress EOI Broadcast irrespective of whether or
not userspace I/O APIC supports Directed EOIs)
- Clean up KVM's handling of marking mapped vCPU pages dirty
- Drop a pile of *ancient* sanity checks hidden behind in KVM's
unused ASSERT() macro, most of which could be trivially triggered
by the guest and/or user, and all of which were useless
- Fold "struct dest_map" into its sole user, "struct rtc_status", to
make it more obvious what the weird parameter is used for, and to
allow fropping these RTC shenanigans if CONFIG_KVM_IOAPIC=n
- Bury all of ioapic.h, i8254.h and related ioctls (including
KVM_CREATE_IRQCHIP) behind CONFIG_KVM_IOAPIC=y
- Add a regression test for recent APICv update fixes
- Handle "hardware APIC ISR", a.k.a. SVI, updates in
kvm_apic_update_apicv() to consolidate the updates, and to
co-locate SVI updates with the updates for KVM's own cache of ISR
information
- Drop a dead function declaration
- Minor cleanups
x86 (Intel):
- Rework KVM's handling of VMCS updates while L2 is active to
temporarily switch to vmcs01 instead of deferring the update until
the next nested VM-Exit.
The deferred updates approach directly contributed to several bugs,
was proving to be a maintenance burden due to the difficulty in
auditing the correctness of deferred updates, and was polluting
"struct nested_vmx" with a growing pile of booleans
- Fix an SGX bug where KVM would incorrectly try to handle EPCM page
faults, and instead always reflect them into the guest. Since KVM
doesn't shadow EPCM entries, EPCM violations cannot be due to KVM
interference and can't be resolved by KVM
- Fix a bug where KVM would register its posted interrupt wakeup
handler even if loading kvm-intel.ko ultimately failed
- Disallow access to vmcb12 fields that aren't fully supported,
mostly to avoid weirdness and complexity for FRED and other
features, where KVM wants enable VMCS shadowing for fields that
conditionally exist
- Print out the "bad" offsets and values if kvm-intel.ko refuses to
load (or refuses to online a CPU) due to a VMCS config mismatch
x86 (AMD):
- Drop a user-triggerable WARN on nested_svm_load_cr3() failure
- Add support for virtualizing ERAPS. Note, correct virtualization of
ERAPS relies on an upcoming, publicly announced change in the APM
to reduce the set of conditions where hardware (i.e. KVM) *must*
flush the RAP
- Ignore nSVM intercepts for instructions that are not supported
according to L1's virtual CPU model
- Add support for expedited writes to the fast MMIO bus, a la VMX's
fastpath for EPT Misconfig
- Don't set GIF when clearing EFER.SVME, as GIF exists independently
of SVM, and allow userspace to restore nested state with GIF=0
- Treat exit_code as an unsigned 64-bit value through all of KVM
- Add support for fetching SNP certificates from userspace
- Fix a bug where KVM would use vmcb02 instead of vmcb01 when
emulating VMLOAD or VMSAVE on behalf of L2
- Misc fixes and cleanups
x86 selftests:
- Add a regression test for TPR<=>CR8 synchronization and IRQ masking
- Overhaul selftest's MMU infrastructure to genericize stage-2 MMU
support, and extend x86's infrastructure to support EPT and NPT
(for L2 guests)
- Extend several nested VMX tests to also cover nested SVM
- Add a selftest for nested VMLOAD/VMSAVE
- Rework the nested dirty log test, originally added as a regression
test for PML where KVM logged L2 GPAs instead of L1 GPAs, to
improve test coverage and to hopefully make the test easier to
understand and maintain
guest_memfd:
- Remove kvm_gmem_populate()'s preparation tracking and half-baked
hugepage handling. SEV/SNP was the only user of the tracking and it
can do it via the RMP
- Retroactively document and enforce (for SNP) that
KVM_SEV_SNP_LAUNCH_UPDATE and KVM_TDX_INIT_MEM_REGION require the
source page to be 4KiB aligned, to avoid non-trivial complexity for
something that no known VMM seems to be doing and to avoid an API
special case for in-place conversion, which simply can't support
unaligned sources
- When populating guest_memfd memory, GUP the source page in common
code and pass the refcounted page to the vendor callback, instead
of letting vendor code do the heavy lifting. Doing so avoids a
looming deadlock bug with in-place due an AB-BA conflict betwee
mmap_lock and guest_memfd's filemap invalidate lock
Generic:
- Fix a bug where KVM would ignore the vCPU's selected address space
when creating a vCPU-specific mapping of guest memory. Actually
this bug could not be hit even on x86, the only architecture with
multiple address spaces, but it's a bug nevertheless"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (267 commits)
KVM: s390: Increase permitted SE header size to 1 MiB
MAINTAINERS: Replace backup for s390 vfio-pci
KVM: s390: vsie: Fix race in acquire_gmap_shadow()
KVM: s390: vsie: Fix race in walk_guest_tables()
KVM: s390: Use guest address to mark guest page dirty
irqchip/riscv-imsic: Adjust the number of available guest irq files
RISC-V: KVM: Transparent huge page support
RISC-V: KVM: selftests: Add Zalasr extensions to get-reg-list test
RISC-V: KVM: Allow Zalasr extensions for Guest/VM
KVM: riscv: selftests: Add riscv vm satp modes
KVM: riscv: selftests: add Zilsd and Zclsd extension to get-reg-list test
riscv: KVM: allow Zilsd and Zclsd extensions for Guest/VM
RISC-V: KVM: Skip IMSIC update if vCPU IMSIC state is not initialized
RISC-V: KVM: Fix null pointer dereference in kvm_riscv_aia_imsic_rw_attr()
RISC-V: KVM: Fix null pointer dereference in kvm_riscv_aia_imsic_has_attr()
RISC-V: KVM: Remove unnecessary 'ret' assignment
KVM: s390: Add explicit padding to struct kvm_s390_keyop
KVM: LoongArch: selftests: Add steal time test case
LoongArch: KVM: Add paravirt vcpu_is_preempted() support in guest side
LoongArch: KVM: Add paravirt preempt feature in hypervisor side
...
Pull x86 paravirt updates from Borislav Petkov:
- A nice cleanup to the paravirt code containing a unification of the
paravirt clock interface, taming the include hell by splitting the
pv_ops structure and removing of a bunch of obsolete code (Juergen
Gross)
* tag 'x86_paravirt_for_v7.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/paravirt: Use XOR r32,r32 to clear register in pv_vcpu_is_preempted()
x86/paravirt: Remove trailing semicolons from alternative asm templates
x86/pvlocks: Move paravirt spinlock functions into own header
x86/paravirt: Specify pv_ops array in paravirt macros
x86/paravirt: Allow pv-calls outside paravirt.h
objtool: Allow multiple pv_ops arrays
x86/xen: Drop xen_mmu_ops
x86/xen: Drop xen_cpu_ops
x86/xen: Drop xen_irq_ops
x86/paravirt: Move pv_native_*() prototypes to paravirt.c
x86/paravirt: Introduce new paravirt-base.h header
x86/paravirt: Move paravirt_sched_clock() related code into tsc.c
x86/paravirt: Use common code for paravirt_steal_clock()
riscv/paravirt: Use common code for paravirt_steal_clock()
loongarch/paravirt: Use common code for paravirt_steal_clock()
arm64/paravirt: Use common code for paravirt_steal_clock()
arm/paravirt: Use common code for paravirt_steal_clock()
sched: Move clock related paravirt code to kernel/sched
paravirt: Remove asm/paravirt_api_clock.h
x86/paravirt: Move thunk macros to paravirt_types.h
...
Pull VDSO updates from Thomas Gleixner:
- Provide the missing 64-bit variant of clock_getres()
This allows the extension of CONFIG_COMPAT_32BIT_TIME to the vDSO and
finally the removal of 32-bit time types from the kernel and UAPI.
- Remove the useless and broken getcpu_cache from the VDSO
The intention was to provide a trivial way to retrieve the CPU number
from the VDSO, but as the VDSO data is per process there is no way to
make it work.
- Switch get/put_unaligned() from packed struct to memcpy()
The packed struct violates strict aliasing rules which requires to
pass -fno-strict-aliasing to the compiler. As this are scalar values
__builtin_memcpy() turns them into simple loads and stores
- Use __typeof_unqual__() for __unqual_scalar_typeof()
The get/put_unaligned() changes triggered a new sparse warning when
__beNN types are used with get/put_unaligned() as sparse builds add a
special 'bitwise' attribute to them which prevents sparse to evaluate
the Generic in __unqual_scalar_typeof().
Newer sparse versions support __typeof_unqual__() which avoids the
problem, but requires a recent sparse install. So this adds a sanity
check to sparse builds, which validates that sparse is available and
capable of handling it.
- Force inline __cvdso_clock_getres_common()
Compilers sometimes un-inline agressively, which results in function
call overhead and problems with automatic stack variable
initialization.
Interestingly enough the force inlining results in smaller code than
the un-inlined variant produced by GCC when optimizing for size.
* tag 'timers-vdso-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
vdso/gettimeofday: Force inlining of __cvdso_clock_getres_common()
x86/percpu: Make CONFIG_USE_X86_SEG_SUPPORT work with sparse
compiler: Use __typeof_unqual__() for __unqual_scalar_typeof()
powerpc/vdso: Provide clock_getres_time64()
tools headers: Remove unneeded ignoring of warnings in unaligned.h
tools headers: Update the linux/unaligned.h copy with the kernel sources
vdso: Switch get/put_unaligned() from packed struct to memcpy()
parisc: Inline a type punning version of get_unaligned_le32()
vdso: Remove struct getcpu_cache
MIPS: vdso: Provide getres_time64() for 32-bit ABIs
arm64: vdso32: Provide clock_getres_time64()
ARM: VDSO: Provide clock_getres_time64()
ARM: VDSO: Patch out __vdso_clock_getres() if unavailable
x86/vdso: Provide clock_getres_time64() for x86-32
selftests: vDSO: vdso_test_abi: Add test for clock_getres_time64()
selftests: vDSO: vdso_test_abi: Use UAPI system call numbers
selftests: vDSO: vdso_config: Add configurations for clock_getres_time64()
vdso: Add prototype for __vdso_clock_getres_time64()
Pull performance event updates from Ingo Molnar:
"x86 PMU driver updates:
- Add support for the core PMU for Intel Diamond Rapids (DMR) CPUs
(Dapeng Mi)
Compared to previous iterations of the Intel PMU code, there's been
a lot of changes, which center around three main areas:
- Introduce the OFF-MODULE RESPONSE (OMR) facility to replace the
Off-Core Response (OCR) facility
- New PEBS data source encoding layout
- Support the new "RDPMC user disable" feature
- Likewise, a large series adds uncore PMU support for Intel Diamond
Rapids (DMR) CPUs (Zide Chen)
This centers around these four main areas:
- DMR may have two Integrated I/O and Memory Hub (IMH) dies,
separate from the compute tile (CBB) dies. Each CBB and each IMH
die has its own discovery domain.
- Unlike prior CPUs that retrieve the global discovery table
portal exclusively via PCI or MSR, DMR uses PCI for IMH PMON
discovery and MSR for CBB PMON discovery.
- DMR introduces several new PMON types: SCA, HAMVF, D2D_ULA, UBR,
PCIE4, CRS, CPC, ITC, OTC, CMS, and PCIE6.
- IIO free-running counters in DMR are MMIO-based, unlike SPR.
- Also add support for Add missing PMON units for Intel Panther Lake,
and support Nova Lake (NVL), which largely maps to Panther Lake.
(Zide Chen)
- KVM integration: Add support for mediated vPMUs (by Kan Liang and
Sean Christopherson, with fixes and cleanups by Peter Zijlstra,
Sandipan Das and Mingwei Zhang)
- Add Intel cstate driver to support for Wildcat Lake (WCL) CPUs,
which are a low-power variant of Panther Lake (Zide Chen)
- Add core, cstate and MSR PMU support for the Airmont NP Intel CPU
(aka MaxLinear Lightning Mountain), which maps to the existing
Airmont code (Martin Schiller)
Performance enhancements:
- Speed up kexec shutdown by avoiding unnecessary cross CPU calls
(Jan H. Schönherr)
- Fix slow perf_event_task_exit() with LBR callstacks (Namhyung Kim)
User-space stack unwinding support:
- Various cleanups and refactorings in preparation to generalize the
unwinding code for other architectures (Jens Remus)
Uprobes updates:
- Transition from kmap_atomic to kmap_local_page (Keke Ming)
- Fix incorrect lockdep condition in filter_chain() (Breno Leitao)
- Fix XOL allocation failure for 32-bit tasks (Oleg Nesterov)
Misc fixes and cleanups:
- s390: Remove kvm_types.h from Kbuild (Randy Dunlap)
- x86/intel/uncore: Convert comma to semicolon (Chen Ni)
- x86/uncore: Clean up const mismatch (Greg Kroah-Hartman)
- x86/ibs: Fix typo in dc_l2tlb_miss comment (Xiang-Bin Shi)"
* tag 'perf-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
s390: remove kvm_types.h from Kbuild
uprobes: Fix incorrect lockdep condition in filter_chain()
x86/ibs: Fix typo in dc_l2tlb_miss comment
x86/uprobes: Fix XOL allocation failure for 32-bit tasks
perf/x86/intel/uncore: Convert comma to semicolon
perf/x86/intel: Add support for rdpmc user disable feature
perf/x86: Use macros to replace magic numbers in attr_rdpmc
perf/x86/intel: Add core PMU support for Novalake
perf/x86/intel: Add support for PEBS memory auxiliary info field in NVL
perf/x86/intel: Add core PMU support for DMR
perf/x86/intel: Add support for PEBS memory auxiliary info field in DMR
perf/x86/intel: Support the 4 new OMR MSRs introduced in DMR and NVL
perf/core: Fix slow perf_event_task_exit() with LBR callstacks
perf/core: Speed up kexec shutdown by avoiding unnecessary cross CPU calls
uprobes: use kmap_local_page() for temporary page mappings
arm/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol()
mips/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol()
arm64/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol()
riscv/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol()
perf/x86/intel/uncore: Add Nova Lake support
...
Pull EFI updates from Ard Biesheuvel:
- Quirk the broken EFI framebuffer geometry on the Valve Steam Deck
- Capture the EDID information of the primary display also on non-x86
EFI systems when booting via the EFI stub.
* tag 'efi-next-for-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efi: Support EDID information
sysfb: Move edid_info into sysfb_primary_display
sysfb: Pass sysfb_primary_display to devices
sysfb: Replace screen_info with sysfb_primary_display
sysfb: Add struct sysfb_display_info
efi: sysfb_efi: Reduce number of references to global screen_info
efi: earlycon: Reduce number of references to global screen_info
efi: sysfb_efi: Fix efidrmfb and simpledrmfb on Valve Steam Deck
efi: sysfb_efi: Convert swap width and height quirk to a callback
efi: sysfb_efi: Fix lfb_linelength calculation when applying quirks
efi: sysfb_efi: Replace open coded swap with the macro
Pull arm64 updates from Will Deacon:
"There's a little less than normal, probably due to LPC & Christmas/New
Year meaning that a few series weren't quite ready or reviewed in
time. It's still useful across the board, despite the only real
feature being support for the LS64 feature enabling 64-byte atomic
accesses to endpoints that support it.
ACPI:
- Add interrupt signalling support to the AGDI handler
- Add Catalin and myself to the arm64 ACPI MAINTAINERS entry
CPU features:
- Drop Kconfig options for PAN and LSE (these are detected at runtime)
- Add support for 64-byte single-copy atomic instructions (LS64/LS64V)
- Reduce MTE overhead when executing in the kernel on Ampere CPUs
- Ensure POR_EL0 value exposed via ptrace is up-to-date
- Fix error handling on GCS allocation failure
CPU frequency:
- Add CPU hotplug support to the FIE setup in the AMU driver
Entry code:
- Minor optimisations and cleanups to the syscall entry path
- Preparatory rework for moving to the generic syscall entry code
Hardware errata:
- Work around Spectre-BHB on TSV110 processors
- Work around broken CMO propagation on some systems with the SI-L1
interconnect
Miscellaneous:
- Disable branch profiling for arch/arm64/ to avoid issues with
noinstr
- Minor fixes and cleanups (kexec + ubsan, WARN_ONCE() instead of
WARN_ON(), reduction of boolean expression)
- Fix custom __READ_ONCE() implementation for LTO builds when
operating on non-atomic types
Perf and PMUs:
- Support for CMN-600AE
- Be stricter about supported hardware in the CMN driver
- Support for DSU-110 and DSU-120
- Support for the cycles event in the DSU driver (alongside the
dedicated cycles counter)
- Use IRQF_NO_THREAD instead of IRQF_ONESHOT in the cxlpmu driver
- Use !bitmap_empty() as a faster alternative to bitmap_weight()
- Fix SPE error handling when failing to resume profiling
Selftests:
- Add support for the FORCE_TARGETS option to the arm64 kselftests
- Avoid nolibc-specific my_syscall() function
- Add basic test for the LS64 HWCAP
- Extend fp-pidbench to cover additional workload patterns"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (43 commits)
perf/arm-cmn: Reject unsupported hardware configurations
perf: arm_spe: Properly set hw.state on failures
arm64/gcs: Fix error handling in arch_set_shadow_stack_status()
arm64: Fix non-atomic __READ_ONCE() with CONFIG_LTO=y
arm64: poe: fix stale POR_EL0 values for ptrace
kselftest/arm64: Raise default number of loops in fp-pidbench
kselftest/arm64: Add a no-SVE loop after SVE in fp-pidbench
perf/cxlpmu: Replace IRQF_ONESHOT with IRQF_NO_THREAD
arm64: mte: Set TCMA1 whenever MTE is present in the kernel
arm64/ptrace: Return early for ptrace_report_syscall_entry() error
arm64/ptrace: Split report_syscall()
arm64: Remove unused _TIF_WORK_MASK
kselftest/arm64: Add missing file in .gitignore
arm64: errata: Workaround for SI L1 downstream coherency issue
kselftest/arm64: Add HWCAP test for FEAT_LS64
arm64: Add support for FEAT_{LS64, LS64_V}
KVM: arm64: Enable FEAT_{LS64, LS64_V} in the supported guest
arm64: Provide basic EL2 setup for FEAT_{LS64, LS64_V} usage at EL0/1
KVM: arm64: Handle DABT caused by LS64* instructions on unsupported memory
KVM: arm64: Add documentation for KVM_EXIT_ARM_LDST64B
...
Pull kthread updates from Frederic Weisbecker:
"The kthread code provides an infrastructure which manages the
preferred affinity of unbound kthreads (node or custom cpumask)
against housekeeping (CPU isolation) constraints and CPU hotplug
events.
One crucial missing piece is the handling of cpuset: when an isolated
partition is created, deleted, or its CPUs updated, all the unbound
kthreads in the top cpuset become indifferently affine to _all_ the
non-isolated CPUs, possibly breaking their preferred affinity along
the way.
Solve this with performing the kthreads affinity update from cpuset to
the kthreads consolidated relevant code instead so that preferred
affinities are honoured and applied against the updated cpuset
isolated partitions.
The dispatch of the new isolated cpumasks to timers, workqueues and
kthreads is performed by housekeeping, as per the nice Tejun's
suggestion.
As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set
from boot defined domain isolation (through isolcpus=) and cpuset
isolated partitions. Housekeeping cpumasks are now modifiable with a
specific RCU based synchronization. A big step toward making
nohz_full= also mutable through cpuset in the future"
* tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: (33 commits)
doc: Add housekeeping documentation
kthread: Document kthread_affine_preferred()
kthread: Comment on the purpose and placement of kthread_affine_node() call
kthread: Honour kthreads preferred affinity after cpuset changes
sched/arm64: Move fallback task cpumask to HK_TYPE_DOMAIN
sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN
kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management
kthread: Include kthreadd to the managed affinity list
kthread: Include unbound kthreads in the managed affinity list
kthread: Refine naming of affinity related fields
PCI: Remove superfluous HK_TYPE_WQ check
sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated()
cpuset: Remove cpuset_cpu_is_isolated()
timers/migration: Remove superfluous cpuset isolation test
cpuset: Propagate cpuset isolation update to timers through housekeeping
cpuset: Propagate cpuset isolation update to workqueue through housekeeping
PCI: Flush PCI probe workqueue on cpuset isolated partition change
sched/isolation: Flush vmstat workqueues on cpuset isolated partition change
sched/isolation: Flush memcg workqueues on cpuset isolated partition change
cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset
...
KVM/arm64 updates for 7.0
- Add support for FEAT_IDST, allowing ID registers that are not
implemented to be reported as a normal trap rather than as an UNDEF
exception.
- Add sanitisation of the VTCR_EL2 register, fixing a number of
UXN/PXN/XN bugs in the process.
- Full handling of RESx bits, instead of only RES0, and resulting in
SCTLR_EL2 being added to the list of sanitised registers.
- More pKVM fixes for features that are not supposed to be exposed to
guests.
- Make sure that MTE being disabled on the pKVM host doesn't give it
the ability to attack the hypervisor.
- Allow pKVM's host stage-2 mappings to use the Force Write Back
version of the memory attributes by using the "pass-through'
encoding.
- Fix trapping of ICC_DIR_EL1 on GICv5 hosts emulating GICv3 for the
guest.
- Preliminary work for guest GICv5 support.
- A bunch of debugfs fixes, removing pointless custom iterators stored
in guest data structures.
- A small set of FPSIMD cleanups.
- Selftest fixes addressing the incorrect alignment of page
allocation.
- Other assorted low-impact fixes and spelling fixes.
* kvm-arm64/gicv3-tdir-fixes:
: .
: Address two trapping-related issues when running legacy (i.e. GICv3)
: guests on GICv5 hosts, courtesy of Sascha Bischoff.
: .
KVM: arm64: Correct test for ICH_HCR_EL2_TDIR cap for GICv5 hosts
KVM: arm64: gic: Enable GICv3 CPUIF trapping on GICv5 hosts if required
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/pkvm-no-mte:
: .
: pKVM updates preventing the host from using MTE-related system
: sysrem registers when the feature is disabled from the kernel
: command-line (arm64.nomte), courtesy of Fuad Taba.
:
: From the cover letter:
:
: "If MTE is supported by the hardware (and is enabled at EL3), it remains
: available to lower exception levels by default. Disabling it in the host
: kernel (e.g., via 'arm64.nomte') only stops the kernel from advertising
: the feature; it does not physically disable MTE in the hardware.
:
: The ability to disable MTE in the host kernel is used by some systems,
: such as Android, so that the physical memory otherwise used as tag
: storage can be used for other things (i.e. treated just like the rest of
: memory). In this scenario, a malicious host could still access tags in
: pages donated to a guest using MTE instructions (e.g., STG and LDG),
: bypassing the kernel's configuration."
: .
KVM: arm64: Use kvm_has_mte() in pKVM trap initialization
KVM: arm64: Inject UNDEF when accessing MTE sysregs with MTE disabled
KVM: arm64: Trap MTE access and discovery when MTE is disabled
KVM: arm64: Remove dead code resetting HCR_EL2 for pKVM
Signed-off-by: Marc Zyngier <maz@kernel.org>
When none of the allowed CPUs of a task are online, it gets migrated
to the fallback cpumask which is all the non nohz_full CPUs.
However just like nohz_full CPUs, domain isolated CPUs don't want to be
disturbed by tasks that have lost their CPU affinities.
And since nohz_full rely on domain isolation to work correctly, the
housekeeping mask of domain isolated CPUs should always be a subset of
the housekeeping mask of nohz_full CPUs (there can be CPUs that are
domain isolated but not nohz_full, OTOH there shouldn't be nohz_full
CPUs that are not domain isolated):
HK_TYPE_DOMAIN & HK_TYPE_KERNEL_NOISE == HK_TYPE_DOMAIN
Therefore use HK_TYPE_DOMAIN as the appropriate fallback target for
tasks. Note that cpuset isolated partitions are not supported on those
systems and may result in undefined behaviour.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Tested-by: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
* for-next/misc:
arm64: mm: warn once for ioremap attempts on RAM mappings
arm64: Disable branch profiling for all arm64 code
arm64: kernel: initialize missing kexec_buf->random field
arm64: simplify arch_uprobe_xol_was_trapped return
* for-next/cpufreq:
arm64: topology: Do not warn on missing AMU in cpuhp_topology_online()
arm64: topology: Handle AMU FIE setup on CPU hotplug
cpufreq: Add new helper function returning cpufreq policy
arm64: topology: Skip already covered CPUs when setting freq source
If a process wrote to POR_EL0 and then crashed before a context switch
happened, the coredump would contain an incorrect value for POR_EL0.
The value read in poe_get() would be a stale value left in thread.por_el0. Fix
this by reading the value from the system register, if the target thread is the
current thread.
This matches what gcs/fpsimd do.
Fixes: 1751981992 ("arm64/ptrace: add support for FEAT_POE")
Reported-by: David Spickett <david.spickett@arm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
The original order of checks in the ICH_HCR_EL2_TDIR test returned
with false early in the case where the native GICv3 CPUIF was not
present. The result was that on GICv5 hosts with legacy support -
which do not have the GICv3 CPUIF - the test always returned false.
Reshuffle the checks such that support for GICv5 legacy is checked
prior to checking for the native GICv3 CPUIF.
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Fixes: 2a28810cbb ("KVM: arm64: GICv3: Detect and work around the lack of ICV_DIR_EL1 trapping")
Link: https://patch.msgid.link/20251208152724.3637157-4-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
The generic entry abort the syscall_trace_enter() sequence if
ptrace_report_syscall_entry() errors out, but arm64 not.
When ptrace requests interception, it should prevent all subsequent
system-call processing, including audit and seccomp. In preparation for
moving arm64 over to the generic entry code, return early if
ptrace_report_syscall_entry() encounters an error.
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Will Deacon <will@kernel.org>
The generic syscall entry code has the form:
| syscall_trace_enter()
| {
| ptrace_report_syscall_entry()
| }
|
| syscall_exit_work()
| {
| ptrace_report_syscall_exit()
| }
In preparation for moving arm64 over to the generic entry code, split
report_syscall() to two separate enter and exit functions to align
the structure of the arm64 code with syscall_trace_enter() and
syscall_exit_work() from the generic entry code.
No functional changes.
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Will Deacon <will@kernel.org>
Pull arm64 kvm fixes from Paolo Bonzini:
- Ensure early return semantics are preserved for pKVM fault handlers
- Fix case where the kernel runs with the guest's PAN value when
CONFIG_ARM64_PAN is not set
- Make stage-1 walks to set the access flag respect the access
permission of the underlying stage-2, when enabled
- Propagate computed FGT values to the pKVM view of the vCPU at
vcpu_load()
- Correctly program PXN and UXN privilege bits for hVHE's stage-1 page
tables
- Check that the VM is actually using VGICv3 before accessing the GICv3
CPU interface
- Delete some unused code
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: arm64: Invert KVM_PGTABLE_WALK_HANDLE_FAULT to fix pKVM walkers
KVM: arm64: Don't blindly set set PSTATE.PAN on guest exit
KVM: arm64: nv: Respect stage-2 write permssion when setting stage-1 AF
KVM: arm64: Remove unused vcpu_{clear,set}_wfx_traps()
KVM: arm64: Remove unused parameter in synchronize_vcpu_pstate()
KVM: arm64: Remove extra argument for __pvkm_host_{share,unshare}_hyp()
KVM: arm64: Inject UNDEF for a register trap without accessor
KVM: arm64: Copy FGT traps to unprotected pKVM VCPU on VCPU load
KVM: arm64: Fix EL2 S1 XN handling for hVHE setups
KVM: arm64: gic: Check for vGICv3 when clearing TWI
When software issues a Cache Maintenance Operation (CMO) targeting a
dirty cache line, the CPU and DSU cluster may optimize the operation by
combining the CopyBack Write and CMO into a single combined CopyBack
Write plus CMO transaction presented to the interconnect (MCN).
For these combined transactions, the MCN splits the operation into two
separate transactions, one Write and one CMO, and then propagates the
write and optionally the CMO to the downstream memory system or external
Point of Serialization (PoS).
However, the MCN may return an early CompCMO response to the DSU cluster
before the corresponding Write and CMO transactions have completed at
the external PoS or downstream memory. As a result, stale data may be
observed by external observers that are directly connected to the
external PoS or downstream memory.
This erratum affects any system topology in which the following
conditions apply:
- The Point of Serialization (PoS) is located downstream of the
interconnect.
- A downstream observer accesses memory directly, bypassing the
interconnect.
Conditions:
This erratum occurs only when all of the following conditions are met:
1. Software executes a data cache maintenance operation, specifically,
a clean or clean&invalidate by virtual address (DC CVAC or DC
CIVAC), that hits on unique dirty data in the CPU or DSU cache.
This results in a combined CopyBack and CMO being issued to the
interconnect.
2. The interconnect splits the combined transaction into separate Write
and CMO transactions and returns an early completion response to the
CPU or DSU before the write has completed at the downstream memory
or PoS.
3. A downstream observer accesses the affected memory address after the
early completion response is issued but before the actual memory
write has completed. This allows the observer to read stale data
that has not yet been updated at the PoS or downstream memory.
The implementation of workaround put a second loop of CMOs at the same
virtual address whose operation meet erratum conditions to wait until
cache data be cleaned to PoC. This way of implementation mitigates
performance penalty compared to purely duplicate original CMO.
Signed-off-by: Lucas Wei <lucaswei@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
If MTE is not supported by the hardware, or is disabled in the kernel
configuration (`CONFIG_ARM64_MTE=n`) or command line (`arm64.nomte`),
the kernel stops advertising MTE to userspace and avoids using MTE
instructions. However, this is a software-level disable only.
When MTE hardware is present and enabled by EL3 firmware, leaving
`HCR_EL2.ATA` set allows the host to execute MTE instructions (STG, LDG,
etc.) and access allocation tags in physical memory.
Prevent this by clearing `HCR_EL2.ATA` when MTE is disabled. Remove it
from the `HCR_HOST_NVHE_FLAGS` default, and conditionally set it in
`cpu_prepare_hyp_mode()` only when `system_supports_mte()` returns true.
This causes MTE instructions to trap to EL2 when `HCR_EL2.ATA` is
cleared.
Additionally, set `HCR_EL2.TID5` when MTE is disabled. This traps reads
of `GMID_EL1` (Multiple tag transfer ID register) to EL2, preventing the
discovery of MTE parameters (such as tag block size) when the feature is
suppressed.
Early boot code in `head.S` temporarily keeps `HCR_ATA` set to avoid
special-casing initialization paths. This is safe because this code
executes before untrusted code runs and will clear `HCR_ATA` if MTE is
disabled.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260122112218.531948-3-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Merge arm64/for-next/cpufeature in to resolve conflicts resulting from
the removal of CONFIG_PAN.
* arm64/for-next/cpufeature:
arm64: Add support for FEAT_{LS64, LS64_V}
KVM: arm64: Enable FEAT_{LS64, LS64_V} in the supported guest
arm64: Provide basic EL2 setup for FEAT_{LS64, LS64_V} usage at EL0/1
KVM: arm64: Handle DABT caused by LS64* instructions on unsupported memory
KVM: arm64: Add documentation for KVM_EXIT_ARM_LDST64B
KVM: arm64: Add exit to userspace on {LD,ST}64B* outside of memslots
arm64: Unconditionally enable PAN support
arm64: Unconditionally enable LSE support
arm64: Add support for TSV110 Spectre-BHB mitigation
Signed-off-by: Marc Zyngier <maz@kernel.org>
Armv8.7 introduces single-copy atomic 64-byte loads and stores
instructions and its variants named under FEAT_{LS64, LS64_V}.
These features are identified by ID_AA64ISAR1_EL1.LS64 and the
use of such instructions in userspace (EL0) can be trapped.
As st64bv (FEAT_LS64_V) and st64bv0 (FEAT_LS64_ACCDATA) can not be tell
apart, FEAT_LS64 and FEAT_LS64_ACCDATA which will be supported in later
patch will be exported to userspace, FEAT_LS64_V will be enabled only
in kernel.
In order to support the use of corresponding instructions in userspace:
- Make ID_AA64ISAR1_EL1.LS64 visbile to userspace
- Add identifying and enabling in the cpufeature list
- Expose these support of these features to userspace through HWCAP3
and cpuinfo
ld64b/st64b (FEAT_LS64) and st64bv (FEAT_LS64_V) is intended for
special memory (device memory) so requires support by the CPU, system
and target memory location (device that support these instructions).
The HWCAP3_LS64, implies the support of CPU and system (since no
identification method from system, so SoC vendors should advertise
support in the CPU if system also support them).
Otherwise for ld64b/st64b the atomicity may not be guaranteed or a
DABT will be generated, so users (probably userspace driver developer)
should make sure the target memory (device) also have the support.
For st64bv 0xffffffffffffffff will be returned as status result for
unsupported memory so user should check it.
Document the restrictions along with HWCAP3_LS64.
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Oliver Upton <oupton@kernel.org>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Zhou Wang <wangzhou1@hisilicon.com>
Signed-off-by: Will Deacon <will@kernel.org>
When CONFIG_CPUMASK_OFFSTACK is not enabled, and resuming from s2ram on
Renesas R-Car H3 (big.LITTLE 4x Cortex-A57 + 4x Cortex-A53), during
enabling of the first little core, a warning message is printed:
AMU: CPU[4] doesn't support AMU counters
This confuses users, as during boot amu_fie_setup() does not print such
a message, unless debugging is enabled (freq_counters_valid() prints
"CPU%d: counters are not supported.\n" at debug level in that case).
Hence drop the warning, freq_counters_valid() has already printed a
debug message anyway.
Fixes: 6fd9be0b7b ("arm64: topology: Handle AMU FIE setup on CPU hotplug")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Lifeng Zheng <zhenglifeng1@huawei.com>
Signed-off-by: Will Deacon <will@kernel.org>
FEAT_PAN has been around since ARMv8.1 (over 11 years ago), has no compiler
dependency (we have our own accessors), and is a great security benefit.
Drop CONFIG_ARM64_PAN, and make the support unconditionnal.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
LSE atomics have been in the architecture since ARMv8.1 (released in
2014), and are hopefully supported by all modern toolchains.
Drop the optional nature of LSE support in the kernel, and always
compile the support in, as this really is very little code. LL/SC
still is the default, and the switch to LSE is done dynamically.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
When SME is supported, Restoring SVE signal context can go wrong in a
few ways, including placing the task into an invalid state where the
kernel may read from out-of-bounds memory (and may potentially take a
fatal fault) and/or may kill the task with a SIGKILL.
(1) Restoring a context with SVE_SIG_FLAG_SM set can place the task into
an invalid state where SVCR.SM is set (and sve_state is non-NULL)
but TIF_SME is clear, consequently resuting in out-of-bounds memory
reads and/or killing the task with SIGKILL.
This can only occur in unusual (but legitimate) cases where the SVE
signal context has either been modified by userspace or was saved in
the context of another task (e.g. as with CRIU), as otherwise the
presence of an SVE signal context with SVE_SIG_FLAG_SM implies that
TIF_SME is already set.
While in this state, task_fpsimd_load() will NOT configure SMCR_ELx
(leaving some arbitrary value configured in hardware) before
restoring SVCR and attempting to restore the streaming mode SVE
registers from memory via sve_load_state(). As the value of
SMCR_ELx.LEN may be larger than the task's streaming SVE vector
length, this may read memory outside of the task's allocated
sve_state, reading unrelated data and/or triggering a fault.
While this can result in secrets being loaded into streaming SVE
registers, these values are never exposed. As TIF_SME is clear,
fpsimd_bind_task_to_cpu() will configure CPACR_ELx.SMEN to trap EL0
accesses to streaming mode SVE registers, so these cannot be
accessed directly at EL0. As fpsimd_save_user_state() verifies the
live vector length before saving (S)SVE state to memory, no secret
values can be saved back to memory (and hence cannot be observed via
ptrace, signals, etc).
When the live vector length doesn't match the expected vector length
for the task, fpsimd_save_user_state() will send a fatal SIGKILL
signal to the task. Hence the task may be killed after executing
userspace for some period of time.
(2) Restoring a context with SVE_SIG_FLAG_SM clear does not clear the
task's SVCR.SM. If SVCR.SM was set prior to restoring the context,
then the task will be left in streaming mode unexpectedly, and some
register state will be combined inconsistently, though the task will
be left in legitimate state from the kernel's PoV.
This can only occur in unusual (but legitimate) cases where ptrace
has been used to set SVCR.SM after entry to the sigreturn syscall,
as syscall entry clears SVCR.SM.
In these cases, the the provided SVE register data will be loaded
into the task's sve_state using the non-streaming SVE vector length
and the FPSIMD registers will be merged into this using the
streaming SVE vector length.
Fix (1) by setting TIF_SME when setting SVCR.SM. This also requires
ensuring that the task's sme_state has been allocated, but as this could
contain live ZA state, it should not be zeroed. Fix (2) by clearing
SVCR.SM when restoring a SVE signal context with SVE_SIG_FLAG_SM clear.
For consistency, I've pulled the manipulation of SVCR, TIF_SVE, TIF_SME,
and fp_type earlier, immediately after the allocation of
sve_state/sme_state, before the restore of the actual register state.
This makes it easier to ensure that these are always modified
consistently, even if a fault is taken while reading the register data
from the signal context. I do not expect any software to depend on the
exact state restored when a fault is taken while reading the context.
Fixes: 85ed24dad2 ("arm64/sme: Implement streaming SVE signal handling")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: <stable@vger.kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
The code to restore a ZA context doesn't attempt to allocate the task's
sve_state before setting TIF_SME. Consequently, restoring a ZA context
can place a task into an invalid state where TIF_SME is set but the
task's sve_state is NULL.
In legitimate but uncommon cases where the ZA signal context was NOT
created by the kernel in the context of the same task (e.g. if the task
is saved/restored with something like CRIU), we have no guarantee that
sve_state had been allocated previously. In these cases, userspace can
enter streaming mode without trapping while sve_state is NULL, causing a
later NULL pointer dereference when the kernel attempts to store the
register state:
| # ./sigreturn-za
| Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
| Mem abort info:
| ESR = 0x0000000096000046
| EC = 0x25: DABT (current EL), IL = 32 bits
| SET = 0, FnV = 0
| EA = 0, S1PTW = 0
| FSC = 0x06: level 2 translation fault
| Data abort info:
| ISV = 0, ISS = 0x00000046, ISS2 = 0x00000000
| CM = 0, WnR = 1, TnD = 0, TagAccess = 0
| GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
| user pgtable: 4k pages, 52-bit VAs, pgdp=0000000101f47c00
| [0000000000000000] pgd=08000001021d8403, p4d=0800000102274403, pud=0800000102275403, pmd=0000000000000000
| Internal error: Oops: 0000000096000046 [#1] SMP
| Modules linked in:
| CPU: 0 UID: 0 PID: 153 Comm: sigreturn-za Not tainted 6.19.0-rc1 #1 PREEMPT
| Hardware name: linux,dummy-virt (DT)
| pstate: 214000c9 (nzCv daIF +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
| pc : sve_save_state+0x4/0xf0
| lr : fpsimd_save_user_state+0xb0/0x1c0
| sp : ffff80008070bcc0
| x29: ffff80008070bcc0 x28: fff00000c1ca4c40 x27: 63cfa172fb5cf658
| x26: fff00000c1ca5228 x25: 0000000000000000 x24: 0000000000000000
| x23: 0000000000000000 x22: fff00000c1ca4c40 x21: fff00000c1ca4c40
| x20: 0000000000000020 x19: fff00000ff6900f0 x18: 0000000000000000
| x17: fff05e8e0311f000 x16: 0000000000000000 x15: 028fca8f3bdaf21c
| x14: 0000000000000212 x13: fff00000c0209f10 x12: 0000000000000020
| x11: 0000000000200b20 x10: 0000000000000000 x9 : fff00000ff69dcc0
| x8 : 00000000000003f2 x7 : 0000000000000001 x6 : fff00000c1ca5b48
| x5 : fff05e8e0311f000 x4 : 0000000008000000 x3 : 0000000000000000
| x2 : 0000000000000001 x1 : fff00000c1ca5970 x0 : 0000000000000440
| Call trace:
| sve_save_state+0x4/0xf0 (P)
| fpsimd_thread_switch+0x48/0x198
| __switch_to+0x20/0x1c0
| __schedule+0x36c/0xce0
| schedule+0x34/0x11c
| exit_to_user_mode_loop+0x124/0x188
| el0_interrupt+0xc8/0xd8
| __el0_irq_handler_common+0x18/0x24
| el0t_64_irq_handler+0x10/0x1c
| el0t_64_irq+0x198/0x19c
| Code: 54000040 d51b4408 d65f03c0 d503245f (e5bb5800)
| ---[ end trace 0000000000000000 ]---
Fix this by having restore_za_context() ensure that the task's sve_state
is allocated, matching what we do when taking an SME trap. Any live
SVE/SSVE state (which is restored earlier from a separate signal
context) must be preserved, and hence this is not zeroed.
Fixes: 39782210eb ("arm64/sme: Implement ZA signal handling")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: <stable@vger.kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
When SVE is supported but SME is not supported, a ptrace write to the
NT_ARM_SVE regset can place the tracee into an invalid state where
(non-streaming) SVE register data is stored in FP_STATE_SVE format but
TIF_SVE is clear. This can result in a later warning from
fpsimd_restore_current_state(), e.g.
WARNING: CPU: 0 PID: 7214 at arch/arm64/kernel/fpsimd.c:383 fpsimd_restore_current_state+0x50c/0x748
When this happens, fpsimd_restore_current_state() will set TIF_SVE,
placing the task into the correct state. This occurs before any other
check of TIF_SVE can possibly occur, as other checks of TIF_SVE only
happen while the FPSIMD/SVE/SME state is live. Thus, aside from the
warning, there is no functional issue.
This bug was introduced during rework to error handling in commit:
9f8bf718f2 ("arm64/fpsimd: ptrace: Gracefully handle errors")
... where the setting of TIF_SVE was moved into a block which is only
executed when system_supports_sme() is true.
Fix this by removing the system_supports_sme() check. This ensures that
TIF_SVE is set for (SVE-formatted) writes to NT_ARM_SVE, at the cost of
unconditionally manipulating the tracee's saved svcr value. The
manipulation of svcr is benign and inexpensive, and we already do
similar elsewhere (e.g. during signal handling), so I don't think it's
worth guarding this with system_supports_sme() checks.
Aside from the above, there is no functional change. The 'type' argument
to sve_set_common() is only set to ARM64_VEC_SME (in ssve_set())) when
system_supports_sme(), so the ARM64_VEC_SME case in the switch statement
is still unreachable when !system_supports_sme(). When
CONFIG_ARM64_SME=n, the only caller of sve_set_common() is sve_set(),
and the compiler can constant-fold for the case where type is
ARM64_VEC_SVE, removing the logic for other cases.
Reported-by: syzbot+d4ab35af21e99d07ce67@syzkaller.appspotmail.com
Fixes: 9f8bf718f2 ("arm64/fpsimd: ptrace: Gracefully handle errors")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: <stable@vger.kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Paravirt clock related functions are available in multiple archs.
In order to share the common parts, move the common static keys
to kernel/sched/ and remove them from the arch specific files.
Make a common paravirt_steal_clock() implementation available in
kernel/sched/cputime.c, guarding it with a new config option
CONFIG_HAVE_PV_STEAL_CLOCK_GEN, which can be selected by an arch
in case it wants to use that common variant.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260105110520.21356-7-jgross@suse.com
We set PSTATE.PAN to 1 on exiting from a guest if PAN support has
been compiled in and that it exists on the HW. However, this is not
necessarily correct.
In a nVHE configuration, there is no notion of PAN at EL2, so setting
PSTATE.PAN to anything is pointless.
Furthermore, not setting PAN to 0 when CONFIG_ARM64_PAN isn't set
means we run with the *guest's* PSTATE.PAN (which might be set to 1),
and we will explode on the next userspace access. Yes, the architecture
is delightful in that particular corner.
Fix the whole thing by always setting PAN to something when running
VHE (which implies PAN support), and only ignore it when running nVHE.
Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://msgid.link/20260107124600.2736328-1-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
Currently, when a cpufreq policy is created, the AMU FIE setup process
checks all CPUs in the policy -- including those that are offline. If any
of these CPUs are offline at that time, their AMU capability flag hasn't
been verified yet, leading the check fail. As a result, AMU FIE is not
enabled, even if the CPUs that are online do support it.
Later, when the previously offline CPUs come online and report AMU support,
there's no mechanism in place to re-enable AMU FIE for the policy. This
leaves the entire frequency domain without AMU FIE, despite being eligible.
Restrict the initial AMU FIE check to only those CPUs that are online at
the time the policy is created, and allow CPUs that come online later to
join the policy with AMU FIE enabled.
Signed-off-by: Lifeng Zheng <zhenglifeng1@huawei.com>
Acked-by: Beata Michalska <beata.michalska@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
The TSV110 processor is vulnerable to the Spectre-BHB (Branch History
Buffer) attack, which can be exploited to leak information through
branch prediction side channels. This commit adds the MIDR of TSV110
to the list for software mitigation.
Signed-off-by: Jinqian Yang <yangjinqian1@huawei.com>
Reviewed-by: Zenghui Yu <zenghui.yu@linux.dev>
Signed-off-by: Will Deacon <will@kernel.org>
Replace the global screen_info with sysfb_primary_display of type
struct sysfb_display_info. Adapt all users of screen_info.
Instances of screen_info are defined for x86, loongarch and EFI,
with only one instance compiled into a specific build. Replace all
of them with sysfb_primary_display.
All existing users of screen_info are updated by pointing them to
sysfb_primary_display.screen instead. This introduces some churn to
the code, but has no impact on functionality.
Boot parameters and EFI config tables are unchanged. They transfer
screen_info as before. The logic in EFI's alloc_screen_info() changes
slightly, as it now returns the screen field of sysfb_primary_display.
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com> # drivers/pci/
Reviewed-by: Richard Lyu <richard.lyu@suse.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
When we exec a new task we forget to flush the set of locked GCS mode bits.
Since we do flush the rest of the state this means that if GCS is locked
the new task will be unable to enable GCS, it will be locked as being
disabled. Add the expected flush.
Fixes: fc84bc5378 ("arm64/gcs: Context switch GCS state for EL0")
Cc: <stable@vger.kernel.org> # 6.13.x
Reported-by: Yury Khrustalev <Yury.Khrustalev@arm.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Tested-by: Yury Khrustalev <yury.khrustalev@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Since commit 7137a203b2 ("arm64/fpsimd: Permit kernel mode NEON with
IRQs off"), the only condition under which the fallback path is taken
for FP/SIMD preserve/restore across a EFI runtime call is when it is
called from hardirq or NMI context.
In practice, this only happens when the EFI pstore driver is called to
dump the kernel log buffer into a EFI variable under a panic, oops or
emergency_restart() condition, and none of these can be expected to
result in a return to user space for the task in question.
This means that the existing EFI-specific logic for preserving and
restoring SVE/SME state is pointless, and can be removed.
Instead, kill the task, so that an exceedingly unlikely inadvertent
return to user space does not proceed with a corrupted FP/SIMD state.
Also, retain the preserve and restore of the base FP/SIMD state, as that
might belong to kernel mode use of FP/SIMD. (Note that EFI runtime calls
are never invoked reentrantly, even in this case, and so any interrupted
kernel mode FP/SIMD usage will be unrelated to EFI)
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>