Commit Graph

5615 Commits

Author SHA1 Message Date
Ada Couprie Diaz
708fd522b8 arm64/entry: Mask DAIF in cpu_switch_to(), call_on_irq_stack()
commit d42e6c20de upstream.

`cpu_switch_to()` and `call_on_irq_stack()` manipulate SP to change
to different stacks along with the Shadow Call Stack if it is enabled.
Those two stack changes cannot be done atomically and both functions
can be interrupted by SErrors or Debug Exceptions which, though unlikely,
is very much broken : if interrupted, we can end up with mismatched stacks
and Shadow Call Stack leading to clobbered stacks.

In `cpu_switch_to()`, it can happen when SP_EL0 points to the new task,
but x18 stills points to the old task's SCS. When the interrupt handler
tries to save the task's SCS pointer, it will save the old task
SCS pointer (x18) into the new task struct (pointed to by SP_EL0),
clobbering it.

In `call_on_irq_stack()`, it can happen when switching from the task stack
to the IRQ stack and when switching back. In both cases, we can be
interrupted when the SCS pointer points to the IRQ SCS, but SP points to
the task stack. The nested interrupt handler pushes its return addresses
on the IRQ SCS. It then detects that SP points to the task stack,
calls `call_on_irq_stack()` and clobbers the task SCS pointer with
the IRQ SCS pointer, which it will also use !

This leads to tasks returning to addresses on the wrong SCS,
or even on the IRQ SCS, triggering kernel panics via CONFIG_VMAP_STACK
or FPAC if enabled.

This is possible on a default config, but unlikely.
However, when enabling CONFIG_ARM64_PSEUDO_NMI, DAIF is unmasked and
instead the GIC is responsible for filtering what interrupts the CPU
should receive based on priority.
Given the goal of emulating NMIs, pseudo-NMIs can be received by the CPU
even in `cpu_switch_to()` and `call_on_irq_stack()`, possibly *very*
frequently depending on the system configuration and workload, leading
to unpredictable kernel panics.

Completely mask DAIF in `cpu_switch_to()` and restore it when returning.
Do the same in `call_on_irq_stack()`, but restore and mask around
the branch.
Mask DAIF even if CONFIG_SHADOW_CALL_STACK is not enabled for consistency
of behaviour between all configurations.

Introduce and use an assembly macro for saving and masking DAIF,
as the existing one saves but only masks IF.

Cc: <stable@vger.kernel.org>
Signed-off-by: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Reported-by: Cristian Prundeanu <cpru@amazon.com>
Fixes: 59b37fe52f ("arm64: Stash shadow stack pointer in the task struct on interrupt")
Tested-by: Cristian Prundeanu <cpru@amazon.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250718142814.133329-1-ada.coupriediaz@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-01 09:48:43 +01:00
Ryan Roberts
86d9837e46 arm64/mm: Close theoretical race where stale TLB entry remains valid
commit 4b63491838 upstream.

Commit 3ea277194d ("mm, mprotect: flush TLB if potentially racing with
a parallel reclaim leaving stale TLB entries") describes a race that,
prior to the commit, could occur between reclaim and operations such as
mprotect() when using reclaim's tlbbatch mechanism. See that commit for
details but the summary is:

"""
Nadav Amit identified a theoritical race between page reclaim and
mprotect due to TLB flushes being batched outside of the PTL being held.

He described the race as follows:

	CPU0				CPU1
	----				----
					user accesses memory using RW PTE
					[PTE now cached in TLB]
	try_to_unmap_one()
	==> ptep_get_and_clear()
	==> set_tlb_ubc_flush_pending()
					mprotect(addr, PROT_READ)
					==> change_pte_range()
					==> [ PTE non-present - no flush ]

					user writes using cached RW PTE
	...

	try_to_unmap_flush()
"""

The solution was to insert flush_tlb_batched_pending() in mprotect() and
friends to explcitly drain any pending reclaim TLB flushes. In the
modern version of this solution, arch_flush_tlb_batched_pending() is
called to do that synchronisation.

arm64's tlbbatch implementation simply issues TLBIs at queue-time
(arch_tlbbatch_add_pending()), eliding the trailing dsb(ish). The
trailing dsb(ish) is finally issued in arch_tlbbatch_flush() at the end
of the batch to wait for all the issued TLBIs to complete.

Now, the Arm ARM states:

"""
The completion of the TLB maintenance instruction is guaranteed only by
the execution of a DSB by the observer that performed the TLB
maintenance instruction. The execution of a DSB by a different observer
does not have this effect, even if the DSB is known to be executed after
the TLB maintenance instruction is observed by that different observer.
"""

arch_tlbbatch_add_pending() and arch_tlbbatch_flush() conform to this
requirement because they are called from the same task (either kswapd or
caller of madvise(MADV_PAGEOUT)), so either they are on the same CPU or
if the task was migrated, __switch_to() contains an extra dsb(ish).

HOWEVER, arm64's arch_flush_tlb_batched_pending() is also implemented as
a dsb(ish). But this may be running on a CPU remote from the one that
issued the outstanding TLBIs. So there is no architectural gurantee of
synchonization. Therefore we are still vulnerable to the theoretical
race described in Commit 3ea277194d ("mm, mprotect: flush TLB if
potentially racing with a parallel reclaim leaving stale TLB entries").

Fix this by flushing the entire mm in arch_flush_tlb_batched_pending().
This aligns with what the other arches that implement the tlbbatch
feature do.

Cc: <stable@vger.kernel.org>
Fixes: 43b3dfdd04 ("arm64: support batched/deferred tlb shootdown during page reclamation/migration")
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Link: https://lore.kernel.org/r/20250530152445.2430295-1-ryan.roberts@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-27 11:11:18 +01:00
Mark Rutland
b3cfc1f9f5 arm64/fpsimd: Do not discard modified SVE state
[ Upstream commit 398edaa12f ]

Historically SVE state was discarded deterministically early in the
syscall entry path, before ptrace is notified of syscall entry. This
permitted ptrace to modify SVE state before and after the "real" syscall
logic was executed, with the modified state being retained.

This behaviour was changed by commit:

  8c845e2731 ("arm64/sve: Leave SVE enabled on syscall if we don't context switch")

That commit was intended to speed up workloads that used SVE by
opportunistically leaving SVE enabled when returning from a syscall.
The syscall entry logic was modified to truncate the SVE state without
disabling userspace access to SVE, and fpsimd_save_user_state() was
modified to discard userspace SVE state whenever
in_syscall(current_pt_regs()) is true, i.e. when
current_pt_regs()->syscallno != NO_SYSCALL.

Leaving SVE enabled opportunistically resulted in a couple of changes to
userspace visible behaviour which weren't described at the time, but are
logical consequences of opportunistically leaving SVE enabled:

* Signal handlers can observe the type of saved state in the signal's
  sve_context record. When the kernel only tracks FPSIMD state, the 'vq'
  field is 0 and there is no space allocated for register contents. When
  the kernel tracks SVE state, the 'vq' field is non-zero and the
  register contents are saved into the record.

  As a result of the above commit, 'vq' (and the presence of SVE
  register state) is non-deterministically zero or non-zero for a period
  of time after a syscall. The effective register state is still
  deterministic.

  Hopefully no-one relies on this being deterministic. In general,
  handlers for asynchronous events cannot expect a deterministic state.

* Similarly to signal handlers, ptrace requests can observe the type of
  saved state in the NT_ARM_SVE and NT_ARM_SSVE regsets, as this is
  exposed in the header flags. As a result of the above commit, this is
  now in a non-deterministic state after a syscall. The effective
  register state is still deterministic.

  Hopefully no-one relies on this being deterministic. In general,
  debuggers would have to handle this changing at arbitrary points
  during program flow.

Discarding the SVE state within fpsimd_save_user_state() resulted in
other changes to userspace visible behaviour which are not desirable:

* A ptrace tracer can modify (or create) a tracee's SVE state at syscall
  entry or syscall exit. As a result of the above commit, the tracee's
  SVE state can be discarded non-deterministically after modification,
  rather than being retained as it previously was.

  Note that for co-operative tracer/tracee pairs, the tracer may
  (re)initialise the tracee's state arbitrarily after the tracee sends
  itself an initial SIGSTOP via a syscall, so this affects realistic
  design patterns.

* The current_pt_regs()->syscallno field can be modified via ptrace, and
  can be altered even when the tracee is not really in a syscall,
  causing non-deterministic discarding to occur in situations where this
  was not previously possible.

Further, using current_pt_regs()->syscallno in this way is unsound:

* There are data races between readers and writers of the
  current_pt_regs()->syscallno field.

  The current_pt_regs()->syscallno field is written in interruptible
  task context using plain C accesses, and is read in irq/softirq
  context using plain C accesses. These accesses are subject to data
  races, with the usual concerns with tearing, etc.

* Writes to current_pt_regs()->syscallno are subject to compiler
  reordering.

  As current_pt_regs()->syscallno is written with plain C accesses,
  the compiler is free to move those writes arbitrarily relative to
  anything which doesn't access the same memory location.

  In theory this could break signal return, where prior to restoring the
  SVE state, restore_sigframe() calls forget_syscall(). If the write
  were hoisted after restore of some SVE state, that state could be
  discarded unexpectedly.

  In practice that reordering cannot happen in the absence of LTO (as
  cross compilation-unit function calls happen prevent this reordering),
  and that reordering appears to be unlikely in the presence of LTO.

Additionally, since commit:

  f130ac0ae4 ("arm64: syscall: unmask DAIF earlier for SVCs")

... DAIF is unmasked before el0_svc_common() sets regs->syscallno to the
real syscall number. Consequently state may be saved in SVE format prior
to this point.

Considering all of the above, current_pt_regs()->syscallno should not be
used to infer whether the SVE state can be discarded. Luckily we can
instead use cpu_fp_state::to_save to track when it is safe to discard
the SVE state:

* At syscall entry, after the live SVE register state is truncated, set
  cpu_fp_state::to_save to FP_STATE_FPSIMD to indicate that only the
  FPSIMD portion is live and needs to be saved.

* At syscall exit, once the task's state is guaranteed to be live, set
  cpu_fp_state::to_save to FP_STATE_CURRENT to indicate that TIF_SVE
  must be considered to determine which state needs to be saved.

* Whenever state is modified, it must be saved+flushed prior to
  manipulation. The state will be truncated if necessary when it is
  saved, and reloading the state will set fp_state::to_save to
  FP_STATE_CURRENT, preventing subsequent discarding.

This permits SVE state to be discarded *only* when it is known to have
been truncated (and the non-FPSIMD portions must be zero), and ensures
that SVE state is retained after it is explicitly modified.

For backporting, note that this fix depends on the following commits:

* b2482807fb ("arm64/sme: Optimise SME exit on syscall entry")
* f130ac0ae4 ("arm64: syscall: unmask DAIF earlier for SVCs")
* 929fa99b12 ("arm64/fpsimd: signal: Always save+flush state early")

Fixes: 8c845e2731 ("arm64/sve: Leave SVE enabled on syscall if we don't context switch")
Fixes: f130ac0ae4 ("arm64: syscall: unmask DAIF earlier for SVCs")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-2-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19 15:31:38 +02:00
Mark Rutland
fa65c89f3f arm64/fpsimd: Avoid RES0 bits in the SME trap handler
[ Upstream commit 95507570fb ]

The SME trap handler consumes RES0 bits from the ESR when determining
the reason for the trap, and depends upon those bits reading as zero.
This may break in future when those RES0 bits are allocated a meaning
and stop reading as zero.

For SME traps taken with ESR_ELx.EC == 0b011101, the specific reason for
the trap is indicated by ESR_ELx.ISS.SMTC ("SME Trap Code"). This field
occupies bits [2:0] of ESR_ELx.ISS, and as of ARM DDI 0487 L.a, bits
[24:3] of ESR_ELx.ISS are RES0. ESR_ELx.ISS itself occupies bits [24:0]
of ESR_ELx.

Extract the SMTC field specifically, matching the way we handle ESR_ELx
fields elsewhere, and ensuring that the handler is future-proof.

Fixes: 8bd7f91c03 ("arm64/sme: Implement traps and syscall handling for SME")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20250409164010.3480271-2-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19 15:31:34 +02:00
Ryan Roberts
8ad58a7eba arm64/mm: Check PUD_TYPE_TABLE in pud_bad()
[ Upstream commit bfb1d2b902 ]

pud_bad() is currently defined in terms of pud_table(). Although for some
configs, pud_table() is hard-coded to true i.e. when using 64K base pages
or when page table levels are less than 3.

pud_bad() is intended to check that the pud is configured correctly. Hence
let's open-code the same check that the full version of pud_table() uses
into pud_bad(). Then it always performs the check regardless of the config.

Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20250221044227.1145393-7-anshuman.khandual@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29 11:02:21 +02:00
Ryan Roberts
6215143ad3 arm64/mm: Check pmd_table() in pmd_trans_huge()
[ Upstream commit d1770e9098 ]

Check for pmd_table() in pmd_trans_huge() rather then just checking for the
PMD_TABLE_BIT. But ensure all present-invalid entries are handled correctly
by always setting PTE_VALID before checking with pmd_table().

Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20250221044227.1145393-8-anshuman.khandual@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29 11:02:21 +02:00
Jinqian Yang
e506751b7d arm64: Add support for HIP09 Spectre-BHB mitigation
[ Upstream commit e18c09b204 ]

The HIP09 processor is vulnerable to the Spectre-BHB (Branch History
Buffer) attack, which can be exploited to leak information through
branch prediction side channels. This commit adds the MIDR of HIP09
to the list for software mitigation.

Signed-off-by: Jinqian Yang <yangjinqian1@huawei.com>
Link: https://lore.kernel.org/r/20250325141900.2057314-1-yangjinqian1@huawei.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29 11:02:03 +02:00
James Morse
2176530849 arm64: proton-pack: Add new CPUs 'k' values for branch mitigation
commit efe676a1a7 upstream.

Update the list of 'k' values for the branch mitigation from arm's
website.

Add the values for Cortex-X1C. The MIDR_EL1 value can be found here:
https://developer.arm.com/documentation/101968/0002/Register-descriptions/AArch>

Link: https://developer.arm.com/documentation/110280/2-0/?lang=en
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-18 08:24:58 +02:00
James Morse
38c345fd54 arm64: bpf: Add BHB mitigation to the epilogue for cBPF programs
commit 0dfefc2ea2 upstream.

A malicious BPF program may manipulate the branch history to influence
what the hardware speculates will happen next.

On exit from a BPF program, emit the BHB mititgation sequence.

This is only applied for 'classic' cBPF programs that are loaded by
seccomp.

Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-18 08:24:58 +02:00
James Morse
f2aebb8ec6 arm64: proton-pack: Expose whether the branchy loop k value
commit a1152be30a upstream.

Add a helper to expose the k value of the branchy loop. This is needed
by the BPF JIT to generate the mitigation sequence in BPF programs.

Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-18 08:24:58 +02:00
James Morse
ec5bca57af arm64: proton-pack: Expose whether the platform is mitigated by firmware
commit e7956c92f3 upstream.

is_spectre_bhb_fw_affected() allows the caller to determine if the CPU
is known to need a firmware mitigation. CPUs are either on the list
of CPUs we know about, or firmware has been queried and reported that
the platform is affected - and mitigated by firmware.

This helper is not useful to determine if the platform is mitigated
by firmware. A CPU could be on the know list, but the firmware may
not be implemented. Its affected but not mitigated.

spectre_bhb_enable_mitigation() handles this distinction by checking
the firmware state before enabling the mitigation.

Add a helper to expose this state. This will be used by the BPF JIT
to determine if calling firmware for a mitigation is necessary and
supported.

Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-18 08:24:58 +02:00
James Morse
2a3915e861 arm64: insn: Add support for encoding DSB
commit 63de8abd97 upstream.

To generate code in the eBPF epilogue that uses the DSB instruction,
insn.c needs a heler to encode the type and domain.

Re-use the crm encoding logic from the DMB instruction.

Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-18 08:24:58 +02:00
Anshuman Khandual
686d8f7b59 arm64/boot: Enable EL2 requirements for FEAT_PMUv3p9
commit 858c7bfcb3 upstream.

FEAT_PMUv3p9 registers such as PMICNTR_EL0, PMICFILTR_EL0, and PMUACR_EL1
access from EL1 requires appropriate EL2 fine grained trap configuration
via FEAT_FGT2 based trap control registers HDFGRTR2_EL2 and HDFGWTR2_EL2.
Otherwise such register accesses will result in traps into EL2.

Add a new helper __init_el2_fgt2() which initializes FEAT_FGT2 based fine
grained trap control registers HDFGRTR2_EL2 and HDFGWTR2_EL2 (setting the
bits nPMICNTR_EL0, nPMICFILTR_EL0 and nPMUACR_EL1) to enable access into
PMICNTR_EL0, PMICFILTR_EL0, and PMUACR_EL1 registers.

Also update booting.rst with SCR_EL3.FGTEn2 requirement for all FEAT_FGT2
based registers to be accessible in EL2.

Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kvmarm@lists.linux.dev
Fixes: 0bbff9ed81 ("perf/arm_pmuv3: Add PMUv3.9 per counter EL0 access control")
Fixes: d8226d8cfb ("perf: arm_pmuv3: Add support for Armv9.4 PMU instruction counter")
Tested-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20250227035119.2025171-1-anshuman.khandual@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-25 10:48:04 +02:00
Keir Fraser
eec737e17e arm64: mops: Do not dereference src reg for a set operation
commit a13bfa4fe0 upstream.

The source register is not used for SET* and reading it can result in
a UBSAN out-of-bounds array access error, specifically when the MOPS
exception is taken from a SET* sequence with XZR (reg 31) as the
source. Architecturally this is the only case where a src/dst/size
field in the ESR can be reported as 31.

Prior to 2de451a329 the code in do_el0_mops() was benign as the
use of pt_regs_read_reg() prevented the out-of-bounds access.

Fixes: 2de451a329 ("KVM: arm64: Add handler for MOPS exceptions")
Cc: <stable@vger.kernel.org> # 6.12.x
Cc: Kristina Martsenko <kristina.martsenko@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: stable@vger.kernel.org
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Keir Fraser <keirf@google.com>
Reviewed-by: Kristina Martšenko <kristina.martsenko@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20250326110448.3792396-1-keirf@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20 10:15:46 +02:00
Douglas Anderson
3b0f2526c8 arm64: errata: Assume that unknown CPUs _are_ vulnerable to Spectre BHB
commit e403e85383 upstream.

The code for detecting CPUs that are vulnerable to Spectre BHB was
based on a hardcoded list of CPU IDs that were known to be affected.
Unfortunately, the list mostly only contained the IDs of standard ARM
cores. The IDs for many cores that are minor variants of the standard
ARM cores (like many Qualcomm Kyro CPUs) weren't listed. This led the
code to assume that those variants were not affected.

Flip the code on its head and instead assume that a core is vulnerable
if it doesn't have CSV2_3 but is unrecognized as being safe. This
involves creating a "Spectre BHB safe" list.

As of right now, the only CPU IDs added to the "Spectre BHB safe" list
are ARM Cortex A35, A53, A55, A510, and A520. This list was created by
looking for cores that weren't listed in ARM's list [1] as per review
feedback on v2 of this patch [2]. Additionally Brahma A53 is added as
per mailing list feedback [3].

NOTE: this patch will not actually _mitigate_ anyone, it will simply
cause them to report themselves as vulnerable. If any cores in the
system are reported as vulnerable but not mitigated then the whole
system will be reported as vulnerable though the system will attempt
to mitigate with the information it has about the known cores.

[1] https://developer.arm.com/Arm%20Security%20Center/Spectre-BHB
[2] https://lore.kernel.org/r/20241219175128.GA25477@willie-the-truck
[3] https://lore.kernel.org/r/18dbd7d1-a46c-4112-a425-320c99f67a8d@broadcom.com

Fixes: 558c303c97 ("arm64: Mitigate spectre style branch history side channels")
Cc: stable@vger.kernel.org
Reviewed-by: Julius Werner <jwerner@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20250107120555.v4.2.I2040fa004dafe196243f67ebcc647cbedbb516e6@changeid
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20 10:15:34 +02:00
Douglas Anderson
2ff58c5b26 arm64: cputype: Add MIDR_CORTEX_A76AE
commit a9b5bd81b2 upstream.

>From the TRM, MIDR_CORTEX_A76AE has a partnum of 0xDOE and an
implementor of 0x41 (ARM). Add the values.

Cc: stable@vger.kernel.org # dependency of the next fix in the series
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20250107120555.v4.4.I151f3b7ee323bcc3082179b8c60c3cd03308aa94@changeid
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20 10:15:34 +02:00
Douglas Anderson
cb58e90920 arm64: cputype: Add QCOM_CPU_PART_KRYO_3XX_GOLD
[ Upstream commit 401c3333bb ]

Add a definition for the Qualcomm Kryo 300-series Gold cores.

Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Trilok Soni <quic_tsoni@quicinc.com>
Link: https://lore.kernel.org/r/20241219131107.v3.1.I18e0288742871393228249a768e5d56ea65d93dc@changeid
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-20 10:15:12 +02:00
Mark Rutland
cb53828d69 KVM: arm64: Remove VHE host restore of CPACR_EL1.SMEN
[ Upstream commit 407a99c465 ]

When KVM is in VHE mode, the host kernel tries to save and restore the
configuration of CPACR_EL1.SMEN (i.e. CPTR_EL2.SMEN when HCR_EL2.E2H=1)
across kvm_arch_vcpu_load_fp() and kvm_arch_vcpu_put_fp(), since the
configuration may be clobbered by hyp when running a vCPU. This logic
has historically been broken, and is currently redundant.

This logic was originally introduced in commit:

  861262ab86 ("KVM: arm64: Handle SME host state when running guests")

At the time, the VHE hyp code would reset CPTR_EL2.SMEN to 0b00 when
returning to the host, trapping host access to SME state. Unfortunately,
this was unsafe as the host could take a softirq before calling
kvm_arch_vcpu_put_fp(), and if a softirq handler were to use kernel mode
NEON the resulting attempt to save the live FPSIMD/SVE/SME state would
result in a fatal trap.

That issue was limited to VHE mode. For nVHE/hVHE modes, KVM always
saved/restored the host kernel's CPACR_EL1 value, and configured
CPTR_EL2.TSM to 0b0, ensuring that host usage of SME would not be
trapped.

The issue above was incidentally fixed by commit:

  375110ab51 ("KVM: arm64: Fix resetting SME trap values on reset for (h)VHE")

That commit changed the VHE hyp code to configure CPTR_EL2.SMEN to 0b01
when returning to the host, permitting host kernel usage of SME,
avoiding the issue described above. At the time, this was not identified
as a fix for commit 861262ab86.

Now that the host eagerly saves and unbinds its own FPSIMD/SVE/SME
state, there's no need to save/restore the state of the EL0 SME trap.
The kernel can safely save/restore state without trapping, as described
above, and will restore userspace state (including trap controls) before
returning to userspace.

Remove the redundant logic.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250210195226.1215254-5-mark.rutland@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
[Update for rework of flags storage -- broonie]
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-28 22:03:32 +01:00
Mark Rutland
d547b363f1 KVM: arm64: Remove VHE host restore of CPACR_EL1.ZEN
[ Upstream commit 459f059be7 ]

When KVM is in VHE mode, the host kernel tries to save and restore the
configuration of CPACR_EL1.ZEN (i.e. CPTR_EL2.ZEN when HCR_EL2.E2H=1)
across kvm_arch_vcpu_load_fp() and kvm_arch_vcpu_put_fp(), since the
configuration may be clobbered by hyp when running a vCPU. This logic is
currently redundant.

The VHE hyp code unconditionally configures CPTR_EL2.ZEN to 0b01 when
returning to the host, permitting host kernel usage of SVE.

Now that the host eagerly saves and unbinds its own FPSIMD/SVE/SME
state, there's no need to save/restore the state of the EL0 SVE trap.
The kernel can safely save/restore state without trapping, as described
above, and will restore userspace state (including trap controls) before
returning to userspace.

Remove the redundant logic.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250210195226.1215254-4-mark.rutland@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
[Rework for refactoring of where the flags are stored -- broonie]
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-28 22:03:32 +01:00
Mark Rutland
f19a46cb53 KVM: arm64: Remove host FPSIMD saving for non-protected KVM
[ Upstream commit 8eca7f6d51 ]

Now that the host eagerly saves its own FPSIMD/SVE/SME state,
non-protected KVM never needs to save the host FPSIMD/SVE/SME state,
and the code to do this is never used. Protected KVM still needs to
save/restore the host FPSIMD/SVE state to avoid leaking guest state to
the host (and to avoid revealing to the host whether the guest used
FPSIMD/SVE/SME), and that code needs to be retained.

Remove the unused code and data structures.

To avoid the need for a stub copy of kvm_hyp_save_fpsimd_host() in the
VHE hyp code, the nVHE/hVHE version is moved into the shared switch
header, where it is only invoked when KVM is in protected mode.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250210195226.1215254-3-mark.rutland@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-28 22:03:32 +01:00
Fuad Tabba
e6cd28bbbf KVM: arm64: Calculate cptr_el2 traps on activating traps
[ Upstream commit 2fd5b4b0e7 ]

Similar to VHE, calculate the value of cptr_el2 from scratch on
activate traps. This removes the need to store cptr_el2 in every
vcpu structure. Moreover, some traps, such as whether the guest
owns the fp registers, need to be set on every vcpu run.

Reported-by: James Clark <james.clark@linaro.org>
Fixes: 5294afdbf4 ("KVM: arm64: Exclude FP ownership from kvm_vcpu_arch")
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-13-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-28 22:03:32 +01:00
Piotr Jaroszynski
608bbf7ff5 Fix mmu notifiers for range-based invalidates
commit f7edb07ad7 upstream.

Update the __flush_tlb_range_op macro not to modify its parameters as
these are unexepcted semantics. In practice, this fixes the call to
mmu_notifier_arch_invalidate_secondary_tlbs() in
__flush_tlb_range_nosync() to use the correct range instead of an empty
range with start=end. The empty range was (un)lucky as it results in
taking the invalidate-all path that doesn't cause correctness issues,
but can certainly result in suboptimal perf.

This has been broken since commit 6bbd42e2df ("mmu_notifiers: call
invalidate_range() when invalidating TLBs") when the call to the
notifiers was added to __flush_tlb_range(). It predates the addition of
the __flush_tlb_range_op() macro from commit 360839027a ("arm64: tlb:
Refactor the core flush algorithm of __flush_tlb_range") that made the
bug hard to spot.

Fixes: 6bbd42e2df ("mmu_notifiers: call invalidate_range() when invalidating TLBs")

Signed-off-by: Piotr Jaroszynski <pjaroszynski@nvidia.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Raghavendra Rao Ananta <rananta@google.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Nicolin Chen <nicolinc@nvidia.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: iommu@lists.linux.dev
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Link: https://lore.kernel.org/r/20250304085127.2238030-1-pjaroszynski@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-22 12:54:25 -07:00
Ryan Roberts
a684bad77e mm: hugetlb: Add huge page size param to huge_ptep_get_and_clear()
commit 02410ac72a upstream.

In order to fix a bug, arm64 needs to be told the size of the huge page
for which the huge_pte is being cleared in huge_ptep_get_and_clear().
Provide for this by adding an `unsigned long sz` parameter to the
function. This follows the same pattern as huge_pte_clear() and
set_huge_pte_at().

This commit makes the required interface modifications to the core mm as
well as all arches that implement this function (arm64, loongarch, mips,
parisc, powerpc, riscv, s390, sparc). The actual arm64 bug will be fixed
in a separate commit.

Cc: stable@vger.kernel.org
Fixes: 66b3923a1a ("arm64: hugetlb: add support for PTE contiguous bit")
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> # riscv
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> # s390
Link: https://lore.kernel.org/r/20250226120656.2400136-2-ryan.roberts@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-13 13:02:17 +01:00
Oliver Upton
b9de147b2c KVM: arm64: Ensure a VMID is allocated before programming VTTBR_EL2
commit fa808ed4e1 upstream.

Vladimir reports that a race condition to attach a VMID to a stage-2 MMU
sometimes results in a vCPU entering the guest with a VMID of 0:

| CPU1                                            |   CPU2
|                                                 |
|                                                 | kvm_arch_vcpu_ioctl_run
|                                                 |   vcpu_load             <= load VTTBR_EL2
|                                                 |                            kvm_vmid->id = 0
|                                                 |
| kvm_arch_vcpu_ioctl_run                         |
|   vcpu_load             <= load VTTBR_EL2       |
|                            with kvm_vmid->id = 0|
|   kvm_arm_vmid_update   <= allocates fresh      |
|                            kvm_vmid->id and     |
|                            reload VTTBR_EL2     |
|                                                 |
|                                                 |   kvm_arm_vmid_update <= observes that kvm_vmid->id
|                                                 |                          already allocated,
|                                                 |                          skips reload VTTBR_EL2

Oh yeah, it's as bad as it looks. Remember that VHE loads the stage-2
MMU eagerly but a VMID only gets attached to the MMU later on in the
KVM_RUN loop.

Even in the "best case" where VTTBR_EL2 correctly gets reprogrammed
before entering the EL1&0 regime, there is a period of time where
hardware is configured with VMID 0. That's completely insane. So, rather
than decorating the 'late' binding with another hack, just allocate the
damn thing up front.

Attaching a VMID from vcpu_load() is still rollover safe since
(surprise!) it'll always get called after a vCPU was preempted.

Excuse me while I go find a brown paper bag.

Cc: stable@vger.kernel.org
Fixes: 934bf871f0 ("KVM: arm64: Load the stage-2 MMU context in kvm_vcpu_load_vhe()")
Reported-by: Vladimir Murzin <vladimir.murzin@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250219220737.130842-1-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-07 18:25:37 +01:00
Catalin Marinas
93745f5f23 arm64: mte: Do not allow PROT_MTE on MAP_HUGETLB user mappings
PROT_MTE (memory tagging extensions) is not supported on all user mmap()
types for various reasons (memory attributes, backing storage, CoW
handling). The arm64 arch_validate_flags() function checks whether the
VM_MTE_ALLOWED flag has been set for a vma during mmap(), usually by
arch_calc_vm_flag_bits().

Linux prior to 6.13 does not support PROT_MTE hugetlb mappings. This was
added by commit 25c17c4b55 ("hugetlb: arm64: add mte support").
However, earlier kernels inadvertently set VM_MTE_ALLOWED on
(MAP_ANONYMOUS | MAP_HUGETLB) mappings by only checking for
MAP_ANONYMOUS.

Explicitly check MAP_HUGETLB in arch_calc_vm_flag_bits() and avoid
setting VM_MTE_ALLOWED for such mappings.

Fixes: 9f3419315f ("arm64: mte: Add PROT_MTE support to mmap() and mprotect()")
Cc: <stable@vger.kernel.org> # 5.10.x-6.12.x
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-27 04:30:13 -08:00
Ard Biesheuvel
e5251ae5d3 arm64/mm: Reduce PA space to 48 bits when LPA2 is not enabled
commit bf74bb73cd upstream.

Currently, LPA2 kernel support implies support for up to 52 bits of
physical addressing, and this is reflected in global definitions such as
PHYS_MASK_SHIFT and MAX_PHYSMEM_BITS.

This is potentially problematic, given that LPA2 hardware support is
modeled as a CPU feature which can be overridden, and with LPA2 hardware
support turned off, attempting to map physical regions with address bits
[51:48] set (which may exist on LPA2 capable systems booting with
arm64.nolva) will result in corrupted mappings with a truncated output
address and bogus shareability attributes.

This means that the accepted physical address range in the mapping
routines should be at most 48 bits wide when LPA2 support is configured
but not enabled at runtime.

Fixes: 352b0395b5 ("arm64: Enable 52-bit virtual addressing for 4k and 16k granule configs")
Cc: stable@vger.kernel.org
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241212081841.2168124-9-ardb+git@google.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-17 10:05:05 +01:00
Ard Biesheuvel
8abf613006 arm64/mm: Override PARange for !LPA2 and use it consistently
commit 62cffa496a upstream.

When FEAT_LPA{,2} are not implemented, the ID_AA64MMFR0_EL1.PARange and
TCR.IPS values corresponding with 52-bit physical addressing are
reserved.

Setting the TCR.IPS field to 0b110 (52-bit physical addressing) has side
effects, such as how the TTBRn_ELx.BADDR fields are interpreted, and so
it is important that disabling FEAT_LPA2 (by overriding the
ID_AA64MMFR0.TGran fields) also presents a PARange field consistent with
that.

So limit the field to 48 bits unless LPA2 is enabled, and update
existing references to use the override consistently.

Fixes: 352b0395b5 ("arm64: Enable 52-bit virtual addressing for 4k and 16k granule configs")
Cc: stable@vger.kernel.org
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241212081841.2168124-10-ardb+git@google.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-17 10:05:05 +01:00
Raghavendra Rao Ananta
fe425d5239 KVM: arm64: Get rid of userspace_irqchip_in_use
commit 38d7aacca0 upstream.

Improper use of userspace_irqchip_in_use led to syzbot hitting the
following WARN_ON() in kvm_timer_update_irq():

WARNING: CPU: 0 PID: 3281 at arch/arm64/kvm/arch_timer.c:459
kvm_timer_update_irq+0x21c/0x394
Call trace:
  kvm_timer_update_irq+0x21c/0x394 arch/arm64/kvm/arch_timer.c:459
  kvm_timer_vcpu_reset+0x158/0x684 arch/arm64/kvm/arch_timer.c:968
  kvm_reset_vcpu+0x3b4/0x560 arch/arm64/kvm/reset.c:264
  kvm_vcpu_set_target arch/arm64/kvm/arm.c:1553 [inline]
  kvm_arch_vcpu_ioctl_vcpu_init arch/arm64/kvm/arm.c:1573 [inline]
  kvm_arch_vcpu_ioctl+0x112c/0x1b3c arch/arm64/kvm/arm.c:1695
  kvm_vcpu_ioctl+0x4ec/0xf74 virt/kvm/kvm_main.c:4658
  vfs_ioctl fs/ioctl.c:51 [inline]
  __do_sys_ioctl fs/ioctl.c:907 [inline]
  __se_sys_ioctl fs/ioctl.c:893 [inline]
  __arm64_sys_ioctl+0x108/0x184 fs/ioctl.c:893
  __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
  invoke_syscall+0x78/0x1b8 arch/arm64/kernel/syscall.c:49
  el0_svc_common+0xe8/0x1b0 arch/arm64/kernel/syscall.c:132
  do_el0_svc+0x40/0x50 arch/arm64/kernel/syscall.c:151
  el0_svc+0x54/0x14c arch/arm64/kernel/entry-common.c:712
  el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730
  el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598

The following sequence led to the scenario:
 - Userspace creates a VM and a vCPU.
 - The vCPU is initialized with KVM_ARM_VCPU_PMU_V3 during
   KVM_ARM_VCPU_INIT.
 - Without any other setup, such as vGIC or vPMU, userspace issues
   KVM_RUN on the vCPU. Since the vPMU is requested, but not setup,
   kvm_arm_pmu_v3_enable() fails in kvm_arch_vcpu_run_pid_change().
   As a result, KVM_RUN returns after enabling the timer, but before
   incrementing 'userspace_irqchip_in_use':
   kvm_arch_vcpu_run_pid_change()
       ret = kvm_arm_pmu_v3_enable()
           if (!vcpu->arch.pmu.created)
               return -EINVAL;
       if (ret)
           return ret;
       [...]
       if (!irqchip_in_kernel(kvm))
           static_branch_inc(&userspace_irqchip_in_use);
 - Userspace ignores the error and issues KVM_ARM_VCPU_INIT again.
   Since the timer is already enabled, control moves through the
   following flow, ultimately hitting the WARN_ON():
   kvm_timer_vcpu_reset()
       if (timer->enabled)
          kvm_timer_update_irq()
              if (!userspace_irqchip())
                  ret = kvm_vgic_inject_irq()
                      ret = vgic_lazy_init()
                          if (unlikely(!vgic_initialized(kvm)))
                              if (kvm->arch.vgic.vgic_model !=
                                  KVM_DEV_TYPE_ARM_VGIC_V2)
                                      return -EBUSY;
                  WARN_ON(ret);

Theoretically, since userspace_irqchip_in_use's functionality can be
simply replaced by '!irqchip_in_kernel()', get rid of the static key
to avoid the mismanagement, which also helps with the syzbot issue.

Cc: <stable@vger.kernel.org>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-05 14:02:44 +01:00
Kristina Martsenko
a11e7e3d2a arm64: probes: Disable kprobes/uprobes on MOPS instructions
[ Upstream commit c56c599d90 ]

FEAT_MOPS instructions require that all three instructions (prologue,
main and epilogue) appear consecutively in memory. Placing a
kprobe/uprobe on one of them doesn't work as only a single instruction
gets executed out-of-line or simulated. So don't allow placing a probe
on a MOPS instruction.

Fixes: b7564127ff ("arm64: mops: detect and enable FEAT_MOPS")
Signed-off-by: Kristina Martsenko <kristina.martsenko@arm.com>
Link: https://lore.kernel.org/r/20240930161051.3777828-2-kristina.martsenko@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-05 14:01:10 +01:00
Linus Torvalds
28e43197c4 Merge tag 'mm-hotfixes-stable-2024-11-09-22-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
 "20 hotfixes, 14 of which are cc:stable.

  Three affect DAMON. Lorenzo's five-patch series to address the
  mmap_region error handling is here also.

  Apart from that, various singletons"

* tag 'mm-hotfixes-stable-2024-11-09-22-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mailmap: add entry for Thorsten Blum
  ocfs2: remove entry once instead of null-ptr-dereference in ocfs2_xa_remove()
  signal: restore the override_rlimit logic
  fs/proc: fix compile warning about variable 'vmcore_mmap_ops'
  ucounts: fix counter leak in inc_rlimit_get_ucounts()
  selftests: hugetlb_dio: check for initial conditions to skip in the start
  mm: fix docs for the kernel parameter ``thp_anon=``
  mm/damon/core: avoid overflow in damon_feed_loop_next_input()
  mm/damon/core: handle zero schemes apply interval
  mm/damon/core: handle zero {aggregation,ops_update} intervals
  mm/mlock: set the correct prev on failure
  objpool: fix to make percpu slot allocation more robust
  mm/page_alloc: keep track of free highatomic
  mm: resolve faulty mmap_region() error path behaviour
  mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handling
  mm: refactor map_deny_write_exec()
  mm: unconditionally close VMAs on error
  mm: avoid unsafe VMA hook invocation when error arises on mmap hook
  mm/thp: fix deferred split unqueue naming and locking
  mm/thp: fix deferred split queue not partially_mapped
2024-11-10 09:04:27 -08:00
Mario Limonciello
b79276dcac ACPI: processor: Move arch_init_invariance_cppc() call later
arch_init_invariance_cppc() is called at the end of
acpi_cppc_processor_probe() in order to configure frequency invariance
based upon the values from _CPC.

This however doesn't work on AMD CPPC shared memory designs that have
AMD preferred cores enabled because _CPC needs to be analyzed from all
cores to judge if preferred cores are enabled.

This issue manifests to users as a warning since commit 21fb59ab4b
("ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn"):
```
Could not retrieve highest performance (-19)
```

However the warning isn't the cause of this, it was actually
commit 279f838a61 ("x86/amd: Detect preferred cores in
amd_get_boost_ratio_numerator()") which exposed the issue.

To fix this problem, change arch_init_invariance_cppc() into a new weak
symbol that is called at the end of acpi_processor_driver_init().
Each architecture that supports it can declare the symbol to override
the weak one.

Define it for x86, in arch/x86/kernel/acpi/cppc.c, and for all of the
architectures using the generic arch_topology.c code.

Fixes: 279f838a61 ("x86/amd: Detect preferred cores in amd_get_boost_ratio_numerator()")
Reported-by: Ivan Shapovalov <intelfx@intelfx.name>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219431
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://patch.msgid.link/20241104222855.3959267-1-superm1@kernel.org
[ rjw: Changelog edit ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-11-06 21:31:36 +01:00
Lorenzo Stoakes
5baf8b037d mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handling
Currently MTE is permitted in two circumstances (desiring to use MTE
having been specified by the VM_MTE flag) - where MAP_ANONYMOUS is
specified, as checked by arch_calc_vm_flag_bits() and actualised by
setting the VM_MTE_ALLOWED flag, or if the file backing the mapping is
shmem, in which case we set VM_MTE_ALLOWED in shmem_mmap() when the mmap
hook is activated in mmap_region().

The function that checks that, if VM_MTE is set, VM_MTE_ALLOWED is also
set is the arm64 implementation of arch_validate_flags().

Unfortunately, we intend to refactor mmap_region() to perform this check
earlier, meaning that in the case of a shmem backing we will not have
invoked shmem_mmap() yet, causing the mapping to fail spuriously.

It is inappropriate to set this architecture-specific flag in general mm
code anyway, so a sensible resolution of this issue is to instead move the
check somewhere else.

We resolve this by setting VM_MTE_ALLOWED much earlier in do_mmap(), via
the arch_calc_vm_flag_bits() call.

This is an appropriate place to do this as we already check for the
MAP_ANONYMOUS case here, and the shmem file case is simply a variant of
the same idea - we permit RAM-backed memory.

This requires a modification to the arch_calc_vm_flag_bits() signature to
pass in a pointer to the struct file associated with the mapping, however
this is not too egregious as this is only used by two architectures anyway
- arm64 and parisc.

So this patch performs this adjustment and removes the unnecessary
assignment of VM_MTE_ALLOWED in shmem_mmap().

[akpm@linux-foundation.org: fix whitespace, per Catalin]
Link: https://lkml.kernel.org/r/ec251b20ba1964fb64cf1607d2ad80c47f3873df.1730224667.git.lorenzo.stoakes@oracle.com
Fixes: deb0f65628 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Jann Horn <jannh@google.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05 16:49:55 -08:00
Linus Torvalds
d129377639 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
 "ARM64:

   - Fix the guest view of the ID registers, making the relevant fields
     writable from userspace (affecting ID_AA64DFR0_EL1 and
     ID_AA64PFR1_EL1)

   - Correcly expose S1PIE to guests, fixing a regression introduced in
     6.12-rc1 with the S1POE support

   - Fix the recycling of stage-2 shadow MMUs by tracking the context
     (are we allowed to block or not) as well as the recycling state

   - Address a couple of issues with the vgic when userspace
     misconfigures the emulation, resulting in various splats. Headaches
     courtesy of our Syzkaller friends

   - Stop wasting space in the HYP idmap, as we are dangerously close to
     the 4kB limit, and this has already exploded in -next

   - Fix another race in vgic_init()

   - Fix a UBSAN error when faking the cache topology with MTE enabled

  RISCV:

   - RISCV: KVM: use raw_spinlock for critical section in imsic

  x86:

   - A bandaid for lack of XCR0 setup in selftests, which causes trouble
     if the compiler is configured to have x86-64-v3 (with AVX) as the
     default ISA. Proper XCR0 setup will come in the next merge window.

   - Fix an issue where KVM would not ignore low bits of the nested CR3
     and potentially leak up to 31 bytes out of the guest memory's
     bounds

   - Fix case in which an out-of-date cached value for the segments
     could by returned by KVM_GET_SREGS.

   - More cleanups for KVM_X86_QUIRK_SLOT_ZAP_ALL

   - Override MTRR state for KVM confidential guests, making it WB by
     default as is already the case for Hyper-V guests.

  Generic:

   - Remove a couple of unused functions"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (27 commits)
  RISCV: KVM: use raw_spinlock for critical section in imsic
  KVM: selftests: Fix out-of-bounds reads in CPUID test's array lookups
  KVM: selftests: x86: Avoid using SSE/AVX instructions
  KVM: nSVM: Ignore nCR3[4:0] when loading PDPTEs from memory
  KVM: VMX: reset the segment cache after segment init in vmx_vcpu_reset()
  KVM: x86: Clean up documentation for KVM_X86_QUIRK_SLOT_ZAP_ALL
  KVM: x86/mmu: Add lockdep assert to enforce safe usage of kvm_unmap_gfn_range()
  KVM: x86/mmu: Zap only SPs that shadow gPTEs when deleting memslot
  x86/kvm: Override default caching mode for SEV-SNP and TDX
  KVM: Remove unused kvm_vcpu_gfn_to_pfn_atomic
  KVM: Remove unused kvm_vcpu_gfn_to_pfn
  KVM: arm64: Ensure vgic_ready() is ordered against MMIO registration
  KVM: arm64: vgic: Don't check for vgic_ready() when setting NR_IRQS
  KVM: arm64: Fix shift-out-of-bounds bug
  KVM: arm64: Shave a few bytes from the EL2 idmap code
  KVM: arm64: Don't eagerly teardown the vgic on init error
  KVM: arm64: Expose S1PIE to guests
  KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to reschedule
  KVM: arm64: nv: Punt stage-2 recycling to a vCPU request
  KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed
  ...
2024-10-21 11:22:04 -07:00
Linus Torvalds
6efbea77b3 Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:

 - Disable software tag-based KASAN when compiling with GCC, as
   functions are incorrectly instrumented leading to a crash early
   during boot

 - Fix pkey configuration for kernel threads when POE is enabled

 - Fix invalid memory accesses in uprobes when targetting load-literal
   instructions

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  kasan: Disable Software Tag-Based KASAN with GCC
  Documentation/protection-keys: add AArch64 to documentation
  arm64: set POR_EL0 for kernel threads
  arm64: probes: Fix uprobes for big-endian kernels
  arm64: probes: Fix simulate_ldr*_literal()
  arm64: probes: Remove broken LDR (literal) uprobe support
2024-10-17 09:51:03 -07:00
Marc Zyngier
afa9b48f32 KVM: arm64: Shave a few bytes from the EL2 idmap code
Our idmap is becoming too big, to the point where it doesn't fit in
a 4kB page anymore.

There are some low-hanging fruits though, such as the el2_init_state
horror that is expanded 3 times in the kernel. Let's at least limit
ourselves to two copies, which makes the kernel link again.

At some point, we'll have to have a better way of doing this.

Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241009204903.GA3353168@thelio-3990X
2024-10-17 09:17:56 +01:00
Mark Rutland
13f8f1e05f arm64: probes: Fix uprobes for big-endian kernels
The arm64 uprobes code is broken for big-endian kernels as it doesn't
convert the in-memory instruction encoding (which is always
little-endian) into the kernel's native endianness before analyzing and
simulating instructions. This may result in a few distinct problems:

* The kernel may may erroneously reject probing an instruction which can
  safely be probed.

* The kernel may erroneously erroneously permit stepping an
  instruction out-of-line when that instruction cannot be stepped
  out-of-line safely.

* The kernel may erroneously simulate instruction incorrectly dur to
  interpretting the byte-swapped encoding.

The endianness mismatch isn't caught by the compiler or sparse because:

* The arch_uprobe::{insn,ixol} fields are encoded as arrays of u8, so
  the compiler and sparse have no idea these contain a little-endian
  32-bit value. The core uprobes code populates these with a memcpy()
  which similarly does not handle endianness.

* While the uprobe_opcode_t type is an alias for __le32, both
  arch_uprobe_analyze_insn() and arch_uprobe_skip_sstep() cast from u8[]
  to the similarly-named probe_opcode_t, which is an alias for u32.
  Hence there is no endianness conversion warning.

Fix this by changing the arch_uprobe::{insn,ixol} fields to __le32 and
adding the appropriate __le32_to_cpu() conversions prior to consuming
the instruction encoding. The core uprobes copies these fields as opaque
ranges of bytes, and so is unaffected by this change.

At the same time, remove MAX_UINSN_BYTES and consistently use
AARCH64_INSN_SIZE for clarity.

Tested with the following:

| #include <stdio.h>
| #include <stdbool.h>
|
| #define noinline __attribute__((noinline))
|
| static noinline void *adrp_self(void)
| {
|         void *addr;
|
|         asm volatile(
|         "       adrp    %x0, adrp_self\n"
|         "       add     %x0, %x0, :lo12:adrp_self\n"
|         : "=r" (addr));
| }
|
|
| int main(int argc, char *argv)
| {
|         void *ptr = adrp_self();
|         bool equal = (ptr == adrp_self);
|
|         printf("adrp_self   => %p\n"
|                "adrp_self() => %p\n"
|                "%s\n",
|                adrp_self, ptr, equal ? "EQUAL" : "NOT EQUAL");
|
|         return 0;
| }

.... where the adrp_self() function was compiled to:

| 00000000004007e0 <adrp_self>:
|   4007e0:       90000000        adrp    x0, 400000 <__ehdr_start>
|   4007e4:       911f8000        add     x0, x0, #0x7e0
|   4007e8:       d65f03c0        ret

Before this patch, the ADRP is not recognized, and is assumed to be
steppable, resulting in corruption of the result:

| # ./adrp-self
| adrp_self   => 0x4007e0
| adrp_self() => 0x4007e0
| EQUAL
| # echo 'p /root/adrp-self:0x007e0' > /sys/kernel/tracing/uprobe_events
| # echo 1 > /sys/kernel/tracing/events/uprobes/enable
| # ./adrp-self
| adrp_self   => 0x4007e0
| adrp_self() => 0xffffffffff7e0
| NOT EQUAL

After this patch, the ADRP is correctly recognized and simulated:

| # ./adrp-self
| adrp_self   => 0x4007e0
| adrp_self() => 0x4007e0
| EQUAL
| #
| # echo 'p /root/adrp-self:0x007e0' > /sys/kernel/tracing/uprobe_events
| # echo 1 > /sys/kernel/tracing/events/uprobes/enable
| # ./adrp-self
| adrp_self   => 0x4007e0
| adrp_self() => 0x4007e0
| EQUAL

Fixes: 9842ceae9f ("arm64: Add uprobe support")
Cc: stable@vger.kernel.org
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241008155851.801546-4-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-10-09 16:56:53 +01:00
Oliver Upton
c268f204f7 KVM: arm64: nv: Punt stage-2 recycling to a vCPU request
Currently, when a nested MMU is repurposed for some other MMU context,
KVM unmaps everything during vcpu_load() while holding the MMU lock for
write. This is quite a performance bottleneck for large nested VMs, as
all vCPU scheduling will spin until the unmap completes.

Start punting the MMU cleanup to a vCPU request, where it is then
possible to periodically release the MMU lock and CPU in the presence of
contention.

Ensure that no vCPU winds up using a stale MMU by tracking the pending
unmap on the S2 MMU itself and requesting an unmap on every vCPU that
finds it.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241007233028.2236133-4-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-08 10:40:27 +01:00
Oliver Upton
3c164eb946 KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed
Right now the nested code allows unmap operations on a shadow stage-2 to
block unconditionally. This is wrong in a couple places, such as a
non-blocking MMU notifier or on the back of a sched_in() notifier as
part of shadow MMU recycling.

Carry through whether or not blocking is allowed to
kvm_pgtable_stage2_unmap(). This 'fixes' an issue where stage-2 MMU
reclaim would precipitate a stack overflow from a pile of kvm_sched_in()
callbacks, all trying to recycle a stage-2 MMU.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241007233028.2236133-3-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-08 10:40:27 +01:00
Linus Torvalds
4563243ede Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
 "ARM64:

   - Fix pKVM error path on init, making sure we do not change critical
     system registers as we're about to fail

   - Make sure that the host's vector length is at capped by a value
     common to all CPUs

   - Fix kvm_has_feat*() handling of "negative" features, as the current
     code is pretty broken

   - Promote Joey to the status of official reviewer, while James steps
     down -- hopefully only temporarly

  x86:

   - Fix compilation with KVM_INTEL=KVM_AMD=n

   - Fix disabling KVM_X86_QUIRK_SLOT_ZAP_ALL when shadow MMU is in use

  Selftests:

   - Fix compilation on non-x86 architectures"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  x86/reboot: emergency callbacks are now registered by common KVM code
  KVM: x86: leave kvm.ko out of the build if no vendor module is requested
  KVM: x86/mmu: fix KVM_X86_QUIRK_SLOT_ZAP_ALL for shadow MMU
  KVM: arm64: Fix kvm_has_feat*() handling of negative features
  KVM: selftests: Fix build on architectures other than x86_64
  KVM: arm64: Another reviewer reshuffle
  KVM: arm64: Constrain the host to the maximum shared SVE VL with pKVM
  KVM: arm64: Fix __pkvm_init_vcpu cptr_el2 error path
2024-10-06 10:53:28 -07:00
Marc Zyngier
a1d402abf8 KVM: arm64: Fix kvm_has_feat*() handling of negative features
Oliver reports that the kvm_has_feat() helper is not behaviing as
expected for negative feature. On investigation, the main issue
seems to be caused by the following construct:

 #define get_idreg_field(kvm, id, fld)				\
 	(id##_##fld##_SIGNED ?					\
	 get_idreg_field_signed(kvm, id, fld) :			\
	 get_idreg_field_unsigned(kvm, id, fld))

where one side of the expression evaluates as something signed,
and the other as something unsigned. In retrospect, this is totally
braindead, as the compiler converts this into an unsigned expression.
When compared to something that is 0, the test is simply elided.

Epic fail. Similar issue exists in the expand_field_sign() macro.

The correct way to handle this is to chose between signed and unsigned
comparisons, so that both sides of the ternary expression are of the
same type (bool).

In order to keep the code readable (sort of), we introduce new
comparison primitives taking an operator as a parameter, and
rewrite the kvm_has_feat*() helpers in terms of these primitives.

Fixes: c62d7a23b9 ("KVM: arm64: Add feature checking helpers")
Reported-by: Oliver Upton <oliver.upton@linux.dev>
Tested-by: Oliver Upton <oliver.upton@linux.dev>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20241002204239.2051637-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-03 19:35:27 +01:00
Mark Rutland
924725707d arm64: cputype: Add Neoverse-N3 definitions
Add cputype definitions for Neoverse-N3. These will be used for errata
detection in subsequent patches.

These values can be found in Table A-261 ("MIDR_EL1 bit descriptions")
in issue 02 of the Neoverse-N3 TRM, which can be found at:

  https://developer.arm.com/documentation/107997/0000/?lang=en

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20240930111705.3352047-2-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-01 12:46:54 +01:00
Linus Torvalds
617a814f14 Merge tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
 "Along with the usual shower of singleton patches, notable patch series
  in this pull request are:

   - "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds
     consistency to the APIs and behaviour of these two core allocation
     functions. This also simplifies/enables Rustification.

   - "Some cleanups for shmem" from Baolin Wang. No functional changes -
     mode code reuse, better function naming, logic simplifications.

   - "mm: some small page fault cleanups" from Josef Bacik. No
     functional changes - code cleanups only.

   - "Various memory tiering fixes" from Zi Yan. A small fix and a
     little cleanup.

   - "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and
     simplifications and .text shrinkage.

   - "Kernel stack usage histogram" from Pasha Tatashin and Shakeel
     Butt. This is a feature, it adds new feilds to /proc/vmstat such as

       $ grep kstack /proc/vmstat
       kstack_1k 3
       kstack_2k 188
       kstack_4k 11391
       kstack_8k 243
       kstack_16k 0

     which tells us that 11391 processes used 4k of stack while none at
     all used 16k. Useful for some system tuning things, but
     partivularly useful for "the dynamic kernel stack project".

   - "kmemleak: support for percpu memory leak detect" from Pavel
     Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory.

   - "mm: memcg: page counters optimizations" from Roman Gushchin. "3
     independent small optimizations of page counters".

   - "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from
     David Hildenbrand. Improves PTE/PMD splitlock detection, makes
     powerpc/8xx work correctly by design rather than by accident.

   - "mm: remove arch_make_page_accessible()" from David Hildenbrand.
     Some folio conversions which make arch_make_page_accessible()
     unneeded.

   - "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David
     Finkel. Cleans up and fixes our handling of the resetting of the
     cgroup/process peak-memory-use detector.

   - "Make core VMA operations internal and testable" from Lorenzo
     Stoakes. Rationalizaion and encapsulation of the VMA manipulation
     APIs. With a view to better enable testing of the VMA functions,
     even from a userspace-only harness.

   - "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix
     issues in the zswap global shrinker, resulting in improved
     performance.

   - "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill
     in some missing info in /proc/zoneinfo.

   - "mm: replace follow_page() by folio_walk" from David Hildenbrand.
     Code cleanups and rationalizations (conversion to folio_walk())
     resulting in the removal of follow_page().

   - "improving dynamic zswap shrinker protection scheme" from Nhat
     Pham. Some tuning to improve zswap's dynamic shrinker. Significant
     reductions in swapin and improvements in performance are shown.

   - "mm: Fix several issues with unaccepted memory" from Kirill
     Shutemov. Improvements to the new unaccepted memory feature,

   - "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on
     DAX PUDs. This was missing, although nobody seems to have notied
     yet.

   - "Introduce a store type enum for the Maple tree" from Sidhartha
     Kumar. Cleanups and modest performance improvements for the maple
     tree library code.

   - "memcg: further decouple v1 code from v2" from Shakeel Butt. Move
     more cgroup v1 remnants away from the v2 memcg code.

   - "memcg: initiate deprecation of v1 features" from Shakeel Butt.
     Adds various warnings telling users that memcg v1 features are
     deprecated.

   - "mm: swap: mTHP swap allocator base on swap cluster order" from
     Chris Li. Greatly improves the success rate of the mTHP swap
     allocation.

   - "mm: introduce numa_memblks" from Mike Rapoport. Moves various
     disparate per-arch implementations of numa_memblk code into generic
     code.

   - "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly
     improves the performance of munmap() of swap-filled ptes.

   - "support large folio swap-out and swap-in for shmem" from Baolin
     Wang. With this series we no longer split shmem large folios into
     simgle-page folios when swapping out shmem.

   - "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice
     performance improvements and code reductions for gigantic folios.

   - "support shmem mTHP collapse" from Baolin Wang. Adds support for
     khugepaged's collapsing of shmem mTHP folios.

   - "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect()
     performance regression due to the addition of mseal().

   - "Increase the number of bits available in page_type" from Matthew
     Wilcox. Increases the number of bits available in page_type!

   - "Simplify the page flags a little" from Matthew Wilcox. Many legacy
     page flags are now folio flags, so the page-based flags and their
     accessors/mutators can be removed.

   - "mm: store zero pages to be swapped out in a bitmap" from Usama
     Arif. An optimization which permits us to avoid writing/reading
     zero-filled zswap pages to backing store.

   - "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race
     window which occurs when a MAP_FIXED operqtion is occurring during
     an unrelated vma tree walk.

   - "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of
     the vma_merge() functionality, making ot cleaner, more testable and
     better tested.

   - "misc fixups for DAMON {self,kunit} tests" from SeongJae Park.
     Minor fixups of DAMON selftests and kunit tests.

   - "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang.
     Code cleanups and folio conversions.

   - "Shmem mTHP controls and stats improvements" from Ryan Roberts.
     Cleanups for shmem controls and stats.

   - "mm: count the number of anonymous THPs per size" from Barry Song.
     Expose additional anon THP stats to userspace for improved tuning.

   - "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more
     folio conversions and removal of now-unused page-based APIs.

   - "replace per-quota region priorities histogram buffer with
     per-context one" from SeongJae Park. DAMON histogram
     rationalization.

   - "Docs/damon: update GitHub repo URLs and maintainer-profile" from
     SeongJae Park. DAMON documentation updates.

   - "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and
     improve related doc and warn" from Jason Wang: fixes usage of page
     allocator __GFP_NOFAIL and GFP_ATOMIC flags.

   - "mm: split underused THPs" from Yu Zhao. Improve THP=always policy.
     This was overprovisioning THPs in sparsely accessed memory areas.

   - "zram: introduce custom comp backends API" frm Sergey Senozhatsky.
     Add support for zram run-time compression algorithm tuning.

   - "mm: Care about shadow stack guard gap when getting an unmapped
     area" from Mark Brown. Fix up the various arch_get_unmapped_area()
     implementations to better respect guard areas.

   - "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability
     of mem_cgroup_iter() and various code cleanups.

   - "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge
     pfnmap support.

   - "resource: Fix region_intersects() vs add_memory_driver_managed()"
     from Huang Ying. Fix a bug in region_intersects() for systems with
     CXL memory.

   - "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches
     a couple more code paths to correctly recover from the encountering
     of poisoned memry.

   - "mm: enable large folios swap-in support" from Barry Song. Support
     the swapin of mTHP memory into appropriately-sized folios, rather
     than into single-page folios"

* tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits)
  zram: free secondary algorithms names
  uprobes: turn xol_area->pages[2] into xol_area->page
  uprobes: introduce the global struct vm_special_mapping xol_mapping
  Revert "uprobes: use vm_special_mapping close() functionality"
  mm: support large folios swap-in for sync io devices
  mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios
  mm: fix swap_read_folio_zeromap() for large folios with partial zeromap
  mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries
  set_memory: add __must_check to generic stubs
  mm/vma: return the exact errno in vms_gather_munmap_vmas()
  memcg: cleanup with !CONFIG_MEMCG_V1
  mm/show_mem.c: report alloc tags in human readable units
  mm: support poison recovery from copy_present_page()
  mm: support poison recovery from do_cow_fault()
  resource, kunit: add test case for region_intersects()
  resource: make alloc_free_mem_region() works for iomem_resource
  mm: z3fold: deprecate CONFIG_Z3FOLD
  vfio/pci: implement huge_fault support
  mm/arm64: support large pfn mappings
  mm/x86: support large pfn mappings
  ...
2024-09-21 07:29:05 -07:00
Linus Torvalds
4a39ac5b7d Merge tag 'random-6.12-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull random number generator updates from Jason Donenfeld:
 "Originally I'd planned on sending each of the vDSO getrandom()
  architecture ports to their respective arch trees. But as we started
  to work on this, we found lots of interesting issues in the shared
  code and infrastructure, the fixes for which the various archs needed
  to base their work.

  So in the end, this turned into a nice collaborative effort fixing up
  issues and porting to 5 new architectures -- arm64, powerpc64,
  powerpc32, s390x, and loongarch64 -- with everybody pitching in and
  commenting on each other's code. It was a fun development cycle.

  This contains:

   - Numerous fixups to the vDSO selftest infrastructure, getting it
     running successfully on more platforms, and fixing bugs in it.

   - Additions to the vDSO getrandom & chacha selftests. Basically every
     time manual review unearthed a bug in a revision of an arch patch,
     or an ambiguity, the tests were augmented.

     By the time the last arch was submitted for review, s390x, v1 of
     the series was essentially fine right out of the gate.

   - Fixes to the the generic C implementation of vDSO getrandom, to
     build and run successfully on all archs, decoupling it from
     assumptions we had (unintentionally) made on x86_64 that didn't
     carry through to the other architectures.

   - Port of vDSO getrandom to LoongArch64, from Xi Ruoyao and acked by
     Huacai Chen.

   - Port of vDSO getrandom to ARM64, from Adhemerval Zanella and acked
     by Will Deacon.

   - Port of vDSO getrandom to PowerPC, in both 32-bit and 64-bit
     varieties, from Christophe Leroy and acked by Michael Ellerman.

   - Port of vDSO getrandom to S390X from Heiko Carstens, the arch
     maintainer.

  While it'd be natural for there to be things to fix up over the course
  of the development cycle, these patches got a decent amount of review
  from a fairly diverse crew of folks on the mailing lists, and, for the
  most part, they've been cooking in linux-next, which has been helpful
  for ironing out build issues.

  In terms of architectures, I think that mostly takes care of the
  important 64-bit archs with hardware still being produced and running
  production loads in settings where vDSO getrandom is likely to help.

  Arguably there's still RISC-V left, and we'll see for 6.13 whether
  they find it useful and submit a port"

* tag 'random-6.12-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (47 commits)
  selftests: vDSO: check cpu caps before running chacha test
  s390/vdso: Wire up getrandom() vdso implementation
  s390/vdso: Move vdso symbol handling to separate header file
  s390/vdso: Allow alternatives in vdso code
  s390/module: Provide find_section() helper
  s390/facility: Let test_facility() generate static branch if possible
  s390/alternatives: Remove ALT_FACILITY_EARLY
  s390/facility: Disable compile time optimization for decompressor code
  selftests: vDSO: fix vdso_config for s390
  selftests: vDSO: fix ELF hash table entry size for s390x
  powerpc/vdso: Wire up getrandom() vDSO implementation on VDSO64
  powerpc/vdso: Wire up getrandom() vDSO implementation on VDSO32
  powerpc/vdso: Refactor CFLAGS for CVDSO build
  powerpc/vdso32: Add crtsavres
  mm: Define VM_DROPPABLE for powerpc/32
  powerpc/vdso: Fix VDSO data access when running in a non-root time namespace
  selftests: vDSO: don't include generated headers for chacha test
  arm64: vDSO: Wire up getrandom() vDSO implementation
  arm64: alternative: make alternative_has_cap_likely() VDSO compatible
  selftests: vDSO: also test counter in vdso_test_chacha
  ...
2024-09-18 15:26:31 +02:00
Peter Xu
3e509c9b03 mm/arm64: support large pfn mappings
Support huge pfnmaps by using bit 56 (PTE_SPECIAL) for "special" on
pmds/puds.  Provide the pmd/pud helpers to set/get special bit.

There's one more thing missing for arm64 which is the pxx_pgprot() for
pmd/pud.  Add them too, which is mostly the same as the pte version by
dropping the pfn field.  These helpers are essential to be used in the new
follow_pfnmap*() API to report valid pgprot_t results.

Note that arm64 doesn't yet support huge PUD yet, but it's still
straightforward to provide the pud helpers that we need altogether.  Only
PMD helpers will make an immediate benefit until arm64 will support huge
PUDs first in general (e.g.  in THPs).

Link: https://lkml.kernel.org/r/20240826204353.2228736-19-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gavin Shan <gshan@redhat.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-17 01:07:00 -07:00
Peter Xu
0515e022e1 mm: always define pxx_pgprot()
There're:

  - 8 archs (arc, arm64, include, mips, powerpc, s390, sh, x86) that
  support pte_pgprot().

  - 2 archs (x86, sparc) that support pmd_pgprot().

  - 1 arch (x86) that support pud_pgprot().

Always define them to be used in generic code, and then we don't need to
fiddle with "#ifdef"s when doing so.

Link: https://lkml.kernel.org/r/20240826204353.2228736-9-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gavin Shan <gshan@redhat.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-17 01:06:59 -07:00
Linus Torvalds
64dd3b6a79 Merge tag 'for-linus-non-x86' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
 "These are the non-x86 changes (mostly ARM, as is usually the case).
  The generic and x86 changes will come later"

  ARM:

   - New Stage-2 page table dumper, reusing the main ptdump
     infrastructure

   - FP8 support

   - Nested virtualization now supports the address translation
     (FEAT_ATS1A) family of instructions

   - Add selftest checks for a bunch of timer emulation corner cases

   - Fix multiple cases where KVM/arm64 doesn't correctly handle the
     guest trying to use a GICv3 that wasn't advertised

   - Remove REG_HIDDEN_USER from the sysreg infrastructure, making
     things little simpler

   - Prevent MTE tags being restored by userspace if we are actively
     logging writes, as that's a recipe for disaster

   - Correct the refcount on a page that is not considered for MTE tag
     copying (such as a device)

   - When walking a page table to split block mappings, synchronize only
     at the end the walk rather than on every store

   - Fix boundary check when transfering memory using FFA

   - Fix pKVM TLB invalidation, only affecting currently out of tree
     code but worth addressing for peace of mind

  LoongArch:

   - Revert qspinlock to test-and-set simple lock on VM.

   - Add Loongson Binary Translation extension support.

   - Add PMU support for guest.

   - Enable paravirt feature control from VMM.

   - Implement function kvm_para_has_feature().

  RISC-V:

   - Fix sbiret init before forwarding to userspace

   - Don't zero-out PMU snapshot area before freeing data

   - Allow legacy PMU access from guest

   - Fix to allow hpmcounter31 from the guest"

* tag 'for-linus-non-x86' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (64 commits)
  LoongArch: KVM: Implement function kvm_para_has_feature()
  LoongArch: KVM: Enable paravirt feature control from VMM
  LoongArch: KVM: Add PMU support for guest
  KVM: arm64: Get rid of REG_HIDDEN_USER visibility qualifier
  KVM: arm64: Simplify visibility handling of AArch32 SPSR_*
  KVM: arm64: Simplify handling of CNTKCTL_EL12
  LoongArch: KVM: Add vm migration support for LBT registers
  LoongArch: KVM: Add Binary Translation extension support
  LoongArch: KVM: Add VM feature detection function
  LoongArch: Revert qspinlock to test-and-set simple lock on VM
  KVM: arm64: Register ptdump with debugfs on guest creation
  arm64: ptdump: Don't override the level when operating on the stage-2 tables
  arm64: ptdump: Use the ptdump description from a local context
  arm64: ptdump: Expose the attribute parsing functionality
  KVM: arm64: Add memory length checks and remove inline in do_ffa_mem_xfer
  KVM: arm64: Move pagetable definitions to common header
  KVM: arm64: nv: Add support for FEAT_ATS1A
  KVM: arm64: nv: Plumb handling of AT S1* traps from EL2
  KVM: arm64: nv: Make AT+PAN instructions aware of FEAT_PAN3
  KVM: arm64: nv: Sanitise SCTLR_EL1.EPAN according to VM configuration
  ...
2024-09-16 07:38:18 +02:00
Linus Torvalds
114143a595 Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
 "The highlights are support for Arm's "Permission Overlay Extension"
  using memory protection keys, support for running as a protected guest
  on Android as well as perf support for a bunch of new interconnect
  PMUs.

  Summary:

  ACPI:
   - Enable PMCG erratum workaround for HiSilicon HIP10 and 11
     platforms.
   - Ensure arm64-specific IORT header is covered by MAINTAINERS.

  CPU Errata:
   - Enable workaround for hardware access/dirty issue on Ampere-1A
     cores.

  Memory management:
   - Define PHYSMEM_END to fix a crash in the amdgpu driver.
   - Avoid tripping over invalid kernel mappings on the kexec() path.
   - Userspace support for the Permission Overlay Extension (POE) using
     protection keys.

  Perf and PMUs:
   - Add support for the "fixed instruction counter" extension in the
     CPU PMU architecture.
   - Extend and fix the event encodings for Apple's M1 CPU PMU.
   - Allow LSM hooks to decide on SPE permissions for physical
     profiling.
   - Add support for the CMN S3 and NI-700 PMUs.

  Confidential Computing:
   - Add support for booting an arm64 kernel as a protected guest under
     Android's "Protected KVM" (pKVM) hypervisor.

  Selftests:
   - Fix vector length issues in the SVE/SME sigreturn tests
   - Fix build warning in the ptrace tests.

  Timers:
   - Add support for PR_{G,S}ET_TSC so that 'rr' can deal with
     non-determinism arising from the architected counter.

  Miscellaneous:
   - Rework our IPI-based CPU stopping code to try NMIs if regular IPIs
     don't succeed.
   - Minor fixes and cleanups"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (94 commits)
  perf: arm-ni: Fix an NULL vs IS_ERR() bug
  arm64: hibernate: Fix warning for cast from restricted gfp_t
  arm64: esr: Define ESR_ELx_EC_* constants as UL
  arm64: pkeys: remove redundant WARN
  perf: arm_pmuv3: Use BR_RETIRED for HW branch event if enabled
  MAINTAINERS: List Arm interconnect PMUs as supported
  perf: Add driver for Arm NI-700 interconnect PMU
  dt-bindings/perf: Add Arm NI-700 PMU
  perf/arm-cmn: Improve format attr printing
  perf/arm-cmn: Clean up unnecessary NUMA_NO_NODE check
  arm64/mm: use lm_alias() with addresses passed to memblock_free()
  mm: arm64: document why pte is not advanced in contpte_ptep_set_access_flags()
  arm64: Expose the end of the linear map in PHYSMEM_END
  arm64: trans_pgd: mark PTEs entries as valid to avoid dead kexec()
  arm64/mm: Delete __init region from memblock.reserved
  perf/arm-cmn: Support CMN S3
  dt-bindings: perf: arm-cmn: Add CMN S3
  perf/arm-cmn: Refactor DTC PMU register access
  perf/arm-cmn: Make cycle counts less surprising
  perf/arm-cmn: Improve build-time assertion
  ...
2024-09-16 06:55:07 +02:00
Adhemerval Zanella
712676ea2b arm64: vDSO: Wire up getrandom() vDSO implementation
Hook up the generic vDSO implementation to the aarch64 vDSO data page.
The _vdso_rng_data required data is placed within the _vdso_data vvar
page, by using a offset larger than the vdso_data.

The vDSO function requires a ChaCha20 implementation that does not write
to the stack, and that can do an entire ChaCha20 permutation.  The one
provided uses NEON on the permute operation, with a fallback to the
syscall for chips that do not support AdvSIMD.

This also passes the vdso_test_chacha test along with
vdso_test_getrandom. The vdso_test_getrandom bench-single result on
Neoverse-N1 shows:

   vdso: 25000000 times in 0.783884250 seconds
   libc: 25000000 times in 8.780275399 seconds
syscall: 25000000 times in 8.786581518 seconds

A small fixup to arch/arm64/include/asm/mman.h was required to avoid
pulling kernel code into the vDSO, similar to what's already done in
arch/arm64/include/asm/rwonce.h.

Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2024-09-13 17:28:36 +02:00
Mark Rutland
2c2ca3416b arm64: alternative: make alternative_has_cap_likely() VDSO compatible
Currently alternative_has_cap_unlikely() can be used in VDSO code, but
alternative_has_cap_likely() cannot as it references alt_cb_patch_nops,
which is not available when linking the VDSO. This is unfortunate as it
would be useful to have alternative_has_cap_likely() available in VDSO
code.

The use of alt_cb_patch_nops was added in commit:

  d926079f17 ("arm64: alternatives: add shared NOP callback")

... as removing duplicate NOPs within the kernel Image saved areasonable
amount of space.

Given the VDSO code will have nowhere near as many alternative branches
as the main kernel image, this isn't much of a concern, and a few extra
nops isn't a massive problem.

Change alternative_has_cap_likely() to only use alt_cb_patch_nops for
the main kernel image, and allow duplicate NOPs in VDSO code.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2024-09-13 17:28:35 +02:00
Will Deacon
75078ba2b3 Merge branch 'for-next/timers' into for-next/core
* for-next/timers:
  arm64: Implement prctl(PR_{G,S}ET_TSC)
2024-09-12 13:44:03 +01:00