2907 Commits

Author SHA1 Message Date
Linus Torvalds 4f7e89065e Merge tag 'bootconfig-v7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull bootconfig updates from Masami Hiramatsu:

 - bootconfig: move xbc_snprint_cmdline() to lib/bootconfig.c

   Move the xbc_snprint_cmdline() function and its buffer from
   main.c to the shared lib/bootconfig.c parser library so it
   can be reused by userspace tools.

 - render kernel.* subtree as cmdline string with -C

   Add a new -C option to print the kernel.* subtree as a flat
   command-line string at build time, allowing early parameter
   injection without runtime parsing.

* tag 'bootconfig-v7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tools/bootconfig: render kernel.* subtree as cmdline string with -C
  bootconfig: move xbc_snprint_cmdline() to lib/bootconfig.c
2026-06-16 17:29:24 +05:30
Linus Torvalds f8115f0e8a Merge tag 'slab-for-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab updates from Vlastimil Babka:

 - Support for "allocation tokens" (currently available in Clang 22+)
   for smarter partitioning of kmalloc caches based on the allocated
   object type, which can be enabled instead of the "random"
   per-caller-address-hash partitioning.

   It should be able to deterministically separate types containing a
   pointer from those that do not (Marco Elver)

 - Improvements and simplification of the kmem_cache_alloc_bulk() and
   mempool_alloc_bulk() API. This includes adaptation of callers
   (Christoph Hellwig)

 - Performance improvements and cleanups related mostly to sheaves
   refill (Hao Li, Shengming Hu, Vlastimil Babka)

 - Several fixups for the slabinfo tool (Xuewen Wang)

* tag 'slab-for-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
  mm/slab: do not limit zeroing to orig_size when only red zoning is enabled
  mm/slub: preserve original size in _kmalloc_nolock_noprof retry path
  mm: simplify the mempool_alloc_bulk API
  mm/slab: improve kmem_cache_alloc_bulk
  mm/slub: detach and reattach partial slabs in batch
  mm/slub: introduce helpers for node partial slab state
  mm/slub: use empty sheaf helpers for oversized sheaves
  tools/mm/slabinfo: remove redundant slab->partial assignment
  tools/mm/slabinfo: remove dead assignment in get_obj_and_str()
  tools/mm/slabinfo: Fix trace disable logic inversion
  MAINTAINERS: add slab-related scripts and tools to SLAB ALLOCATOR
  mm/slub: fix typo in sheaves comment
  mm, slab: simplify returning slab in __refill_objects_node()
  mm, slab: add an optimistic __slab_try_return_freelist()
  slab: fix kernel-docs for mm-api
  slab: improve KMALLOC_PARTITION_RANDOM randomness
  slab: support for compiler-assisted type-based slab cache partitioning
  mm/slub: defer freelist construction until after bulk allocation from a new slab
2026-06-16 08:44:43 +05:30
Linus Torvalds 2cbf335f8c Merge tag 'sched-core-2026-06-14' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "SMP load-balancing updates:

   - A large series to introduce infrastructure for cache-aware load
     balancing, with the goal of co-locating tasks that share data
     within the same Last Level Cache (LLC) domain. By improving cache
     locality, the scheduler can reduce cache bouncing and cache misses,
     ultimately improving data access efficiency.

     Implemented by Chen Yu and Tim Chen, based on early prototype work
     by Peter Zijlstra, with fixes by Jianyong Wu, Peter Zijlstra and
     Shrikanth Hegde.

   - A series to simplify CONFIG_SCHED_SMT ifdef usage (Shrikanth Hegde)

  Fair scheduler updates:

   - A series to improve SD_ASYM_CPUCAPACITY scheduling by introducing
     SMT awareness (Andrea Righi, K Prateek Nayak)

   - A series to optimize cfs_rq and sched_entity allocation for better
     data locality (Zecheng Li)

   - A preparatory series to change fair/cgroup scheduling to a single
     runqueue, without the final change (Peter Zijlstra)

   - Auto-manage ext/fair dl_server bandwidth (Andrea Righi)

   - Fix cpu_util runnable_avg arithmetic (Hongyan Xia)

   - Optimize update_tg_load_avg()'s rate-limiting code (Rik van Riel)

   - Allow account_cfs_rq_runtime() to throttle current hierarchy
     (K Prateek Nayak)

   - Update util_est after updating util_avg during dequeue, to fix the
     util signal update logic, which reduces signal noise (Vincent
     Guittot)

  Scheduler topology updates:

   - Allow multiple domains to claim sched_domain_shared (K Prateek
     Nayak)

   - Add parameter to split LLC (Peter Zijlstra)

  Core scheduler updates:

   - Use trace_call__<tp>() to save a static branch (Gabriele Monaco)

  Scheduler statistics updates:

   - Drop now-stale mul_u64_u64_div_u64() cputime over-approximation
     guard (Nicolas Pitre)

  Deadline scheduler updates:

   - Reject debugfs dl_server writes for offline CPUs (Andrea Righi)

   - Fix replenishment logic for non-deferred servers (Yuri Andriaccio)

  RT scheduling updates:

   - Turn RT_PUSH_IPI default off for non PREEMPT_RT (Steven Rostedt)

   - Update default bandwidth for real-time tasks to 1.0 (Yuri
     Andriaccio)

  Proxy scheduling updates:

   - A series to implement Optimized Donor Migration for Proxy Execution
     (John Stultz, Peter Zijlstra)

   - Various proxy scheduling cleanups and fixes (Peter Zijlstra,
     K Prateek Nayak)

  Misc fixes, improvements and cleanups by Aaron Lu, Andrea Righi,
  Zenghui Yu, Chen Yu, Guanyou.Chen, John Stultz, Shrikanth Hegde,
  Peter Zijlstra, Liang Luo and Yiyang Chen"

* tag 'sched-core-2026-06-14' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (91 commits)
  sched/fair: Fix newidle vs core-sched
  sched/deadline: Use task_on_rq_migrating() helper
  sched/core: Combine separate 'else' and 'if' statements
  sched/fair: Fix cpu_util runnable_avg arithmetic
  sched/fair: Unify cfs_rq throttling via account_cfs_rq_runtime()
  sched/fair: Move the throttled tasks to a local list in tg_unthrottle_up()
  sched/fair: Call update_curr() before unthrottling the hierarchy
  sched/fair: Use throttled_csd_list for local unthrottle
  sched/fair: Convert cfs bandwidth throttling to use guards
  sched/fair: Allocate cfs_tg_state with percpu allocator
  sched/fair: Remove task_group->se pointer array
  sched/fair: Co-locate cfs_rq and sched_entity in cfs_tg_state
  sched: restore timer_slack_ns when resetting RT policy on fork
  MAINTAINERS: Fix spelling mistake in Peter's name
  sched: Simplify ttwu_runnable()
  sched/proxy: Remove superfluous clear_task_blocked_in()
  sched/proxy: Remove PROXY_WAKING
  sched/proxy: Switch proxy to use p->is_blocked
  sched/proxy: Only return migrate when needed
  sched: Be more strict about p->is_blocked
  ...
2026-06-15 14:50:18 +05:30
Linus Torvalds 764e77d868 Merge tag 'locking-core-2026-06-14' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "Futex updates:

   - Optimize futex hash bucket access patterns (Peter Zijlstra)

   - Large series to address the robust futex unlock race for real, by
     Thomas Gleixner:

      "The robust futex unlock mechanism is racy in respect to the
       clearing of the robust_list_head::list_op_pending pointer because
       unlock and clearing the pointer are not atomic.

       The race window is between the unlock and clearing the pending op
       pointer. If the task is forced to exit in this window, exit will
       access a potentially invalid pending op pointer when cleaning up
       the robust list.

       That happens if another task manages to unmap the object
       containing the lock before the cleanup, which results in an UAF.

       In the worst case this UAF can lead to memory corruption when
       unrelated content has been mapped to the same address by the time
       the access happens.

       User space can't solve this problem without help from the kernel.
       This series provides the kernel side infrastructure to help it
       along:

        1) Combined unlock, pointer clearing, wake-up for the
           contended case

        2) VDSO based unlock and pointer clearing helpers with a
           fix-up function in the kernel when user space was interrupted
           within the critical section.

      ... with help by André Almeida:

        - Add a note about robust list race condition (André Almeida)
        - Add self-tests for robust release operations (André Almeida)

  Context analysis updates:

   - Implement context analysis for 'struct rt_mutex'. (Bart Van Assche)
   - Bump required Clang version to 23 (Marco Elver)

  Guard infrastructure updates:

   - Series to remove NULL check from unconditional guards (Dmitry
     Ilvokhin)

  Lockdep updates:

   - Restore self-test migrate_disable() and sched_rt_mutex state on
     PREEMPT_RT (Karl Mehltretter)

  Membarriers updates:

   - Use per-CPU mutexes for targeted commands (Aniket Gattani)
   - Modernize membarrier_global_expedited with cleanup guards (Aniket
     Gattani)
   - Add rseq stress test for CFS throttle interactions (Aniket Gattani)

  percpu-rwsems updates:

   - Extract __percpu_up_read() to optimize inlining overhead (Dmitry
     Ilvokhin)

  Seqlocks updates:

   - Allow UBSAN_ALIGNMENT to fail optimizing (Heiko Carstens)

  Lock tracing:

   - Add contended_release tracepoint to sleepable locks such as
     mutexes, percpu-rwsems, rtmutexes, rwsems and semaphores (Dmitry
     Ilvokhin)

  MAINTAINERS updates:

   - MAINTAINERS: Add RUST [SYNC] entry (Boqun Feng)

  Misc updates and fixes by Randy Dunlap, YE WEI-HONG, Fabricio Parra,
  Dmitry Ilvokhin and Peter Zijlstra"

* tag 'locking-core-2026-06-14' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (36 commits)
  locking: Add contended_release tracepoint to sleepable locks
  locking/percpu-rwsem: Extract __percpu_up_read()
  tracing/lock: Remove unnecessary linux/sched.h include
  futex: Optimize futex hash bucket access patterns
  rust: sync: completion: Mark inline complete_all and wait_for_completion
  MAINTAINERS: Add RUST [SYNC] entry
  cleanup: Specify nonnull argument index
  selftests: futex: Add tests for robust release operations
  Documentation: futex: Add a note about robust list race condition
  x86/vdso: Implement __vdso_futex_robust_try_unlock()
  x86/vdso: Prepare for robust futex unlock support
  futex: Provide infrastructure to plug the non contended robust futex unlock race
  futex: Add robust futex unlock IP range
  futex: Add support for unlocking robust futexes
  futex: Cleanup UAPI defines
  x86: Select ARCH_MEMORY_ORDER_TSO
  uaccess: Provide unsafe_atomic_store_release_user()
  futex: Provide UABI defines for robust list entry modifiers
  futex: Move futex related mm_struct data into a struct
  futex: Make futex_mm_init() void
  ...
2026-06-15 14:21:14 +05:30
Linus Torvalds b079329b86 Merge tag 'rust-7.2' of gitolite.kernel.org:pub/scm/linux/kernel/git/ojeda/linux
Pull Rust updates from Miguel Ojeda:
 "This one is big due to the vendoring of the `zerocopy` library, which
  allows us to replace a bunch of `unsafe` code dealing with conversions
  between byte sequences and other types with safe alternatives. More
  details on that below (and in its merge commit).

  Toolchain and infrastructure:

   - Introduce support for the 'zerocopy' library [1][2]:

         Fast, safe, compile error. Pick two.

         Zerocopy makes zero-cost memory manipulation effortless. We write
         `unsafe` so you don't have to.

     It essentially provides derivable traits (e.g. 'FromBytes') and
     macros (e.g. 'transmute!') for safely converting between byte
     sequences and other types. Having such support allows us to remove
     some 'unsafe' code.

     It is among the most downloaded Rust crates and it is also used by
     the Rust compiler itself.

     It is licensed under "BSD-2-Clause OR Apache-2.0 OR MIT".

     The crates are imported essentially as-is (only +2/-3 lines needed
     to be adapted), plus SPDX identifiers. Upstream has since added the
     SPDX identifiers as well as one of the tweaks at my request, thus
     reducing our future diffs on updates -- I keep the details in one
     of our usual live lists [3].

     In total, it is about ~39k lines added, ~32k without counting
     'benches/' which are just for documentation purposes.

     The series includes a few Kbuild and rust-analyzer improvements and
     an example patch using it in Nova, removing one 'unsafe impl'.

     I checked that the codegen of an isolated example function (similar
     to the Nova patch on top) is essentially identical. It also turns
     out that (for that particular case) the 'zerocopy' version, even
     with 'debug-assertions' enabled, has no remaining panics, unlike a
     few in the current code (since the compiler can prove the remaining
     'ub_checks' statically).

     So their "fast, safe" does indeed check out -- at least in that
     case.

   - Support AutoFDO. This allows Rust code to be profiled and optimized
     based on the profile. Tested with Rust Binder: ~13% slower without
     AutoFDO in the binderAddInts benchmark (using an app-launch
     benchmark for the profile).

   - Support Software Tag-Based KASAN.

     In addition, fix KASAN Kconfig by requiring Clang.

   - Add Kconfig options for each existing Rust KUnit test suite, such
     as 'CONFIG_RUST_BITMAP_KUNIT_TEST'.

     They are placed within a new menu, 'CONFIG_RUST_KUNIT_TESTS', in
     the new 'rust/kernel/Kconfig.test' file.

   - Support the upcoming Rust 1.98.0 release (expected 2026-08-20):
     lint cleanups and an unstable flag rename.

   - Disable 'rustdoc' documentation inlining for all prelude items,
     which bloats the generated documentation.

   - Ignore (in Git) and clean (in Kbuild) the (rarely) 'rustc'-generated
     '*.long-type-*.txt' files.

  'kernel' crate:

   - Add new 'bitfield' module with the 'bitfield!' macro (extracted
     from the existing 'register!' one), which declares integer types
     that are split into distinct bit fields of arbitrary length.

     Each field is a 'Bounded' of the appropriate bit width (ensuring
     values are properly validated and avoiding implicit data loss) and
     gets several generated getters and setters (infallible, 'const' and
     fallible) as well as associated constants ('_MASK', '_SHIFT' and
     '_RANGE'). It also supports fields that can be converted from/to
     custom types, either fallibly ('?=>') or infallibly ('=>').

     For instance:

         bitfield! {
             struct Rgb(u16) {
                 15:11 blue;
                 10:5 green;
                 4:0 red;
             }
         }

         // Compile-time checks.
         let color = Rgb::zeroed().with_const_green::<0x1f>();

         assert_eq!(color.green(), 0x1f);
         assert_eq!(color.into_raw(), 0x1f << Rgb::GREEN_SHIFT);

     Add as well documentation and a test suite for it, as usual; and
     update the 'register!' macro to use it.

     It will be maintained by Alexandre Courbot (with Yury Norov as
     reviewer) under a new 'MAINTAINERS' entry: 'RUST [BITFIELD]'.

   - 'ptr' module: rework index projection syntax into keyworded syntax
     and introduce panicking variant.

     The keyword syntax ('build:', 'try:', 'panic:') is more explicit
     and paves the way of perhaps adding more flavors in the future,
     e.g. an 'unsafe' index projection.

     For instance, projections now look like this:

         fn f(p: *const [u8; 32]) -> Result {
             // Ok, within bounds, checked at build time.
             project!(p, [build: 1]);

             // Build error.
             project!(p, [build: 128]);

             // `OutOfBound` runtime error (convertible to `ERANGE`).
             project!(p, [try: 128]);

             // Runtime panic.
             project!(p, [panic: 128]);

             Ok(())
         }

     Update as well the users, which now look like e.g.

         // Pointer to the first entry of the GSP message queue.
         let data = project!(self.0.as_ptr(), .gspq.msgq.data[build: 0]);

   - 'build_assert' module: make the module the home of its macros
     instead of rendering them twice.

   - 'sync' module: add 'UniqueArc::as_ptr()' associated function.

   - 'alloc' module:

       - Fix the 'Vec::reserve()' doctest to properly account for the
         existing vector length in the capacity assertion.

       - Fix an incorrect operator in the 'Vec::extend_with()' 'SAFETY'
         comment; add a doc test demonstrating basic usage and the
         zero-length case.

   - Clean imports across several modules to follow the "kernel
     vertical" import style in order to minimize conflicts.

  'pin-init' crate:

   - User visible changes:

       - Do not generate 'non_snake_case' warnings for identifiers that
         are syntactically just users of a field name. This would allow
         all '#[allow(non_snake_case)]' in nova-core to be removed,
         which Gary will send to the nova tree next cycle.

       - Filter non-cfg attributes out properly in derived structs. This
         improves pin-init compatibility with other derive macros.

       - Insert projection types' where clause properly.

   - Other changes:

       - Bump MSRV to 1.82, plus associated cleanups.

       - Overhaul how init slots are projected. The new approach is
         easier to justify with safety comments.

       - Mark more functions as inline, which should help mitigate the
         super-long symbol name issue due to lack of inlining.

  rust-analyzer:

   - Support '--envs' for passing env vars for crates like 'zerocopy'.

  'MAINTAINERS':

   - Add the following reviewers to the 'RUST' entry:
       - Daniel Almeida
       - Tamir Duberstein
       - Alexandre Courbot
       - Onur Özkan

     They have been involved in the Rust for Linux project for about 7
     collective years and bring expertise across several domains, which
     will be very useful to have around in the future.

     Thanks everyone for stepping up!

  And some other fixes, cleanups and improvements"

Link: https://github.com/google/zerocopy [1]
Link: https://docs.rs/zerocopy [2]
Link: https://github.com/Rust-for-Linux/linux/issues/1239 [3]

* tag 'rust-7.2' of gitolite.kernel.org:pub/scm/linux/kernel/git/ojeda/linux: (86 commits)
  MAINTAINERS: add Onur Özkan as Rust reviewer
  MAINTAINERS: add Alexandre Courbot as Rust reviewer
  MAINTAINERS: add Tamir Duberstein as Rust reviewer
  MAINTAINERS: add Daniel Almeida as Rust reviewer
  kbuild: rust: clean `zerocopy-derive` in `mrproper`
  rust: make `build_assert` module the home of related macros
  rust: str: clean unused import for Rust >= 1.98
  rust: str: use the "kernel vertical" imports style
  rust: aref: use the "kernel vertical" imports style
  rust: page: use the "kernel vertical" imports style
  gpu: nova-core: firmware: parse `FalconUCodeDescV2` via `zerocopy`
  rust: prelude: add `zerocopy{,_derive}::FromBytes`
  rust: zerocopy-derive: enable support in kbuild
  rust: zerocopy-derive: add `README.md`
  rust: zerocopy-derive: avoid generating non-ASCII identifiers
  rust: zerocopy-derive: add SPDX License Identifiers
  rust: zerocopy-derive: import crate
  rust: zerocopy: enable support in kbuild
  rust: zerocopy: add `README.md`
  rust: zerocopy: remove float `Display` support
  ...
2026-06-15 09:25:48 +05:30
Linus Torvalds 73f399414a Merge tag 'kbuild-7.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux
Pull Kbuild / Kconfig updates from Nathan Chancellor:
 "Kbuild:

   - Remove broken module linking exclusion for BTF

   - Add documentation around how offset header files work

   - Include unstripped vDSO libraries in pacman packages

   - Bump minimum version of LLVM for building the kernel to 17.0.1 and
     clean up unnecessary workarounds

   - Use a context manager in run-clang-tools

   - Add dist macro value if present to release tag for RPM packages

   - Detect and report truncated buf_printf() output in modpost

   - Add __llvm_covfun and __llvm_covmap to section whitelist in modpost

   - Support Clang's distributed ThinLTO mode

   - Remove architecture specific configurations for AutoFDO and
     Propeller to ease individual architecture maintenance

  Kconfig:

   - Add kconfig-sym-check target to look for dangling Kconfig symbol
     references and invalid tristate literal values

   - Harden against potential NULL pointer dereference

   - Fix typo in Kconfig test comment"

* tag 'kbuild-7.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux: (31 commits)
  kconfig: tests: fix typo in comment
  kconfig: Remove the architecture specific config for Propeller
  kconfig: Remove the architecture specific config for AutoFDO
  modpost: Add __llvm_covfun and __llvm_covmap to section_white_list
  kconfig: add kconfig-sym-check static checker
  kbuild: Remove unnecessary 'T' modifier in cmd_ar_builtin_fixup
  kbuild: distributed build support for Clang ThinLTO
  kbuild: move vmlinux.a build rule to scripts/Makefile.vmlinux_a
  scripts: modpost: detect and report truncated buf_printf() output
  kbuild: rpm-pkg: append %{?dist} macro to Release tag
  run-clang-tools: run multiprocessing.Pool as context manager
  compiler-clang.h: Drop explicit version number from "all" diagnostic macro
  compiler-clang.h: Remove __cleanup -Wunused-variable workaround
  kbuild: Remove check for broken scoping with clang < 17 in CC_HAS_ASM_GOTO_OUTPUT
  x86/entry/vdso32: Remove conditional omission of '.cfi_offset eflags'
  x86/module: Revert "Deal with GOT based stack cookie load on Clang < 17"
  x86/build: Drop unnecessary '-ffreestanding' addition to KBUILD_CFLAGS
  scripts/Makefile.warn: Drop -Wformat handling for clang < 16
  riscv: Drop tautological condition from TOOLCHAIN_NEEDS_OLD_ISA_SPEC
  riscv: Remove tautological condition from selection of ARCH_SUPPORTS_CFI
  ...
2026-06-15 05:01:15 +05:30
Linus Torvalds 7e0e7bd60d Merge tag 'vfs-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
 "Features:

   - Reduce pipe->mutex contention by pre-allocating pages outside the
     lock in anon_pipe_write().

     anon_pipe_write() called alloc_page() once per page while holding
     pipe->mutex. The allocation can sleep doing direct reclaim and runs
     memcg charging, which extends the critical section and stalls any
     concurrent reader on the same mutex. Now up to 8 pages are
     pre-allocated before the mutex is taken, leftovers are recycled
     into the per-pipe tmp_page[] cache before unlock, and any remainder
     is released after unlock, keeping the allocator out of the critical
     section on both sides. On a writers x readers sweep with 64KB
     writes against a 1 MB pipe throughput improves 6-28% and average
     write latency drops 5-22%; under memory pressure - when the cost of
     holding the mutex across reclaim is highest - throughput improves
     21-48% and latency drops 17-33%. The microbenchmark is added to
     selftests.

   - uaccess/sockptr: fix the ignored_trailing logic in
     copy_struct_to_user() to behave as documented and the usize check
     in copy_struct_from_sockptr() for user pointers, and add
     copy_struct_{from,to}_bounce_buffer() and copy_struct_to_sockptr()
     helpers for upcoming users (IPPROTO_SMBDIRECT, IPPROTO_QUIC).

   - bpf: add a sleepable bpf_real_inode() kfunc that resolves the real
     inode backing a dentry via d_real_inode(). On overlayfs the inode
     attached to the dentry doesn't carry the underlying device
     information; this is used by the filesystem restriction BPF program
     that was merged into systemd.

   - docs: add guidelines for submitting new filesystems, motivated by
     the maintenance burden abandoned and untestable filesystems impose
     on VFS developers, blocking infrastructure work like folio
     conversions and iomap migration.

  Fixes:

   - libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()
     and drop the now-redundant assignments in callers. This began as a
     one-line dma-buf fix for a path_noexec() warning; a pseudo
     filesystem has no reason not to set SB_I_NOEXEC. All init_pseudo()
     callers were audited: the only visible effect is on dma-buf where
     SB_I_NOEXEC silences the warning.

   - Handle set_blocksize() failures in legacy filesystems (bfs, hpfs,
     qnx4, jfs, befs, affs, isofs, minix, ntfs3, omfs). Mounting a
     device with a sector size > PAGE_SIZE crashed roughly half of them;
     the rest had the same missing error handling pattern. Plus a
     follow-up releasing the superblock buffer_head when setting the
     minix v3 block size fails.

   - mount: honour SB_NOUSER in the new mount API.

   - fs/fcntl: fix a SOFTIRQ-unsafe lock order in fasync signaling by
     switching the process-group paths of send_sigio() and send_sigurg()
     from read_lock(&tasklist_lock) to RCU, matching the single-PID
     path.

   - vfs: add an FS_USERNS_DELEGATABLE flag and set it for NFS, fixing
     delegated NFS mounts (fsopen() in a container with the mount
     performed by a privileged daemon) that broke when non-init
     s_user_ns was tied to FS_USERNS_MOUNT.

   - selftests/namespaces: fix a hang in nsid_test where an unreaped
     grandchild kept the TAP pipe write-end open, a waitpid(-1) race in
     listns_efault_test, and a false FAIL on kernels without listns()
     where the tests should SKIP.

   - filelock: fix the break_lease() stub signature for
     CONFIG_FILE_LOCKING=n.

   - init/initramfs_test: wait for the async initramfs unpacking before
     running; the test and do_populate_rootfs() share the parser state.

   - fs/coredump: reduce redundant log noise in
     validate_coredump_safety().

   - iomap: pass the correct length to fserror_report_io() in
     __iomap_write_begin().

   - backing-file: fix the backing_file_open() kerneldoc.

  Cleanups:

   - initramfs: refactor the cpio hex header parsing to use hex2bin()
     instead of the hand-rolled simple_strntoul() which is reverted, and
     extend the initramfs KUnit tests to cover header fields with 0x
     prefixes.

   - Replace __get_free_pages() and friends with kmalloc()/kzalloc()
     across quota, proc, ocfs2/dlm, nilfs2, nfs, nfsd, libfs, jfs, jbd2,
     isofs, fuse, select, namespace, configfs, binfmt_misc, bfs, and the
     do_mounts init code - part of the larger work of replacing page
     allocator calls with kmalloc().

   - Use clear_and_wake_up_bit() in unlock_buffer() and
     journal_end_buffer_io_sync() instead of open-coding the sequence.

   - Drop unused VFS exports: unexport drop_super_exclusive(), remove
     start_removing_user_path_at(), and fold __start_removing_path()
     into start_removing_path().

   - fs/read_write: narrow the __kernel_write() export with
     EXPORT_SYMBOL_FOR_MODULES().

   - vfs: uapi: retire octal and hex constants in favor of (1 << n) for
     the O_ flags. Finding a free bit for a new flag across the
     architectures was needlessly hard with the mixed bases.

   - dcache: add extra sanity checks of dead dentries in dentry_free()
     via a new DENTRY_WARN_ONCE() that also prints d_flags.

   - iov_iter: use kmemdup_array() in dup_iter() to harden the
     allocation against multiplication overflow.

   - fs/pipe: write to ->poll_usage only once.

   - vfs: remove an always-taken if-branch in find_next_fd().

   - dcache: use kmalloc_flex() for struct external_name in __d_alloc().

   - namei: use QSTR() instead of QSTR_INIT() in path_pts().

   - sync_file_range: delete dead S_ISLNK code.

   - Comment fixes: retire a stale comment in fget_task_next() and fix
     assorted spelling mistakes"

* tag 'vfs-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (73 commits)
  backing-file: fix backing_file_open() kerneldoc parameter
  iomap: pass the correct len to fserror_report_io in __iomap_write_begin
  vfs: add FS_USERNS_DELEGATABLE flag and set it for NFS
  filelock: fix break_lease() stub signature for CONFIG_FILE_LOCKING=n
  vfs: uapi: retire octal and hex numbers in favor of (1 << n) for O_ flags
  bpf: add bpf_real_inode() kfunc
  fs/read_write: Do not export __kernel_write() to the entire world
  libfs: drop redundant SB_I_NOEXEC/SB_I_NODEV in init_pseudo() callers
  libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()
  mount: honour SB_NOUSER in the new mount API
  fs/fcntl: fix SOFTIRQ-unsafe lock order in fasync signaling
  selftests/pipe: add pipe_bench microbenchmark
  fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
  fs: retire stale comment in fget_task_next()
  fs: fix spelling mistakes in comment
  bfs: replace get_zeroed_page() with kzalloc()
  binfmt_misc: replace __get_free_page() with kmalloc()
  configfs: replace __get_free_pages() with kzalloc()
  fs/namespace: use __getname() to allocate mntpath buffer
  fs/select: replace __get_free_page() with kmalloc()
  ...
2026-06-15 03:59:45 +05:30
Yury Norov e74b7a3f5a rust: tests: add Kconfig for KUnit test
There are 6 individual Rust KUnit test suites (plus the doctests one). All
the tests are compiled unconditionally now, which adds ~200 kB to the
kernel image for me on x86_64. As Rust matures, this bloating will
inevitably grow.

Add Kconfig.test which includes a RUST_KUNIT_TESTS menu, and all
individual tests under it.

As usual, new tests are all enabled if KUNIT_ALL_TESTS=y.

Suggested-by: Alice Ryhl <aliceryhl@google.com>
Signed-off-by: Yury Norov <ynorov@nvidia.com>
Reviewed-by: David Gow <david@davidgow.net>
Acked-by: Gary Guo <gary@garyguo.net>
Link: https://patch.msgid.link/20260417031531.315281-3-ynorov@nvidia.com
[ Fixed capitalization. Used singular for "API" for consistency.
  Reworded to clarify these are suites and that there exists
  the doctests one (which is the biggest at the moment by
  far). - Miguel ]
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
2026-06-08 02:30:33 +02:00
Thomas Gleixner 042df0c1d4 futex: Add robust futex unlock IP range
There will be a VDSO function to unlock robust futexes in user space. The
unlock sequence is racy vs. clearing the list_pending_op pointer in the
tasks robust list head. To plug this race the kernel needs to know the
instruction window. As the VDSO is per MM the addresses are stored in
mm_struct::futex.

Architectures which implement support for this have to update these
addresses when the VDSO is (re)mapped and indicate the pending op pointer
size which is matching the IP.

Arguably this could be resolved by chasing mm->context->vdso->image, but
that's architecture specific and requires to touch quite some cache
lines. Having it in mm::futex reduces the cache line impact and avoids
having yet another set of architecture specific functionality.

To support multi size robust list applications (gaming) this provides two
ranges when COMPAT is enabled.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@igalia.com>
Link: https://patch.msgid.link/20260602090535.718926819@kernel.org
2026-06-03 11:38:51 +02:00
Peter Zijlstra 1628b25248 sched: Add blocked_donor link to task for smarter mutex handoffs
Add link to the task this task is proxying for, and use it so
the mutex owner can do an intelligent hand-off of the mutex to
the task that the owner is running on behalf.

[jstultz: This patch was split out from larger proxy patch]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260512025635.2840817-8-jstultz@google.com
2026-06-02 12:26:07 +02:00
Nathan Chancellor f3de78cb19 kbuild: Remove check for broken scoping with clang < 17 in CC_HAS_ASM_GOTO_OUTPUT
Now that the minimum supported version of LLVM for building the kernel
has been raised to 17.0.1, the check added to CC_HAS_ASM_GOTO_OUTPUT by
commit e2ffa15b9b ("kbuild: Disable CC_HAS_ASM_GOTO_OUTPUT on clang <
17") can be removed, as the issue it detects is guaranteed to be fixed.

Acked-by: Nicolas Schier <nsc@kernel.org>
Link: https://patch.msgid.link/20260517-bump-minimum-supported-llvm-version-to-17-v2-14-b3b8cda46bdd@kernel.org
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
2026-05-27 15:20:06 -07:00
Jia He ec3f4e0443 init/initramfs_test: wait_for_initramfs() before running
initramfs_test_extract() and friends call unpack_to_rootfs() from a
kunit kthread while do_populate_rootfs() may still be running
asynchronously from rootfs_initcall. unpack_to_rootfs() keeps its
parser state in module-static variables (victim, byte_count, state,
this_header, header_buf, name_buf, ...), so the two writers corrupt
each other.

On arm64 v7.0-rc5+ this oopses early in boot:

  Unable to handle kernel paging request at virtual address ffff80018f9f0ffc
  pc : do_reset+0x3c/0x98
  Call trace:
   do_reset
   initramfs_test_extract
   kunit_try_run_case
  Initramfs unpacking failed: junk within compressed archive

do_reset() faults because 'victim' was overwritten by the boot-time
unpacker; the boot unpacker meanwhile logs the bogus "junk within
compressed archive" on the real initrd because the test wrecked its
state machine.

Add a .suite_init callback that calls wait_for_initramfs() so the async
unpack is quiescent before the first case runs. suite_init runs once per
suite rather than before every individual test case.

Fixes: 83c0b27266 ("initramfs_test: kunit tests for initramfs unpacking")
Signed-off-by: Jia He <justin.he@arm.com>
Link: https://patch.msgid.link/20260519093937.1064628-1-justin.he@arm.com
Reviewed-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-05-27 15:11:02 +02:00
Alice Ryhl 72d33b8bfe rust: kasan: add support for Software Tag-Based KASAN
This adds support for Software Tag-Based KASAN (KASAN_SW_TAGS) when
CONFIG_RUST is enabled. This requires that rustc includes support for
the kernel-hwaddress sanitizer, which is available since 1.96.0 [1].

Unlike with clang, we need to pass -Zsanitizer-recover in addition to
-Zsanitizer because the option is not implied automatically.

The kasan makefile uses different names for the flags depending on
whether CC is clang or gcc, but as we require that CC is clang when
using KASAN, we do not need to try to handle mixed gcc/llvm builds when
Rust is enabled.

Link: https://github.com/rust-lang/rust/pull/153049 [1]
Reviewed-by: Danilo Krummrich <dakr@kernel.org>
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Gary Guo <gary@garyguo.net>
Link: https://patch.msgid.link/20260408-kasan-rust-sw-tags-v3-2-e07964d14363@google.com
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
2026-05-27 01:54:22 +02:00
Alice Ryhl 5b271543d0 rust: kasan: KASAN+RUST requires clang
Kernel KASAN involves passing various llvm/gcc specific arguments to
the C and Rust compiler. Since these arguments differ between llvm and
gcc, it's not safe to mix an llvm-based rustc with a gcc build when
kasan is enabled.

Signed-off-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Gary Guo <gary@garyguo.net>
Cc: stable@vger.kernel.org
Fixes: e3117404b4 ("kbuild: rust: Enable KASAN support")
Link: https://patch.msgid.link/20260408-kasan-rust-sw-tags-v3-1-e07964d14363@google.com
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
2026-05-27 01:54:22 +02:00
Christian Brauner (Amutable) 6b1c66c9cc exec_state: relocate dumpable information
The dumpable flag captured at execve() is consulted by
__ptrace_may_access() and several /proc owner / visibility checks.
It lives on mm_struct today, which exit_mm() clears from the task
long before the task itself is reaped.

exec_state is anchored to the execve() that established the current
privilege domain.  CLONE_VM siblings refcount-share the parent's
exec_state via copy_exec_state(); non-CLONE_VM clones allocate a
fresh exec_state inheriting the parent's dumpable mode and user_ns
reference via task_exec_state_copy().  execve() allocates a fresh
instance (via alloc_task_exec_state() in begin_new_exec()) and
installs it under task_lock + exec_update_lock with
task_exec_state_replace().  init_task uses a static instance.

The dumpable mode now lives on task->exec_state->dumpable.
task->mm->flags no longer carries dumpability; MMF_DUMPABLE_MASK is
removed, but MMF_DUMPABLE_BITS is reserved so MMF_DUMP_FILTER_* bit
positions remain stable for the /proc/<pid>/coredump_filter ABI. The
task->user_dumpable cache bit and its assignment in exit_mm() are
removed; readers go through get_dumpable(task) directly.

coredump_params gains a snapshot field cprm.dumpable, populated from
get_dumpable(current) at vfs_coredump() entry, replacing the previous
__get_dumpable(cprm->mm_flags) consumers in fs/coredump.c and
fs/pidfs.c.

The user namespace recorded at execve() is consulted by
__ptrace_may_access() and by /proc/PID/* owner derivation. Move the
captured user_ns onto task_exec_state, which stays attached to the task
past exit_mm() and across exit_files().

bprm grows a user_ns field staged in bprm_mm_init() with the caller's
user_ns, narrowed by would_dump() to the closest privileged ancestor,
and consumed by exec_mmap() via alloc_task_exec_state(bprm->user_ns).
free_bprm() releases the staging reference.

mm_struct loses ->user_ns entirely.  Initializers in init-mm, efi_mm,
and the implicit one in mm_init()/dup_mm()/mm_alloc() are removed;
__mmdrop() drops the matching put_user_ns(). The kthread_use_mm()
WARN_ON_ONCE(!mm->user_ns) is no longer meaningful and goes too.

Reviewed-by: Jann Horn <jannh@google.com>
Link: https://patch.msgid.link/20260520-work-task_exec_state-v3-4-69f895bc1385@kernel.org
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-05-26 11:02:01 +02:00
Mike Rapoport (Microsoft) 3fb2d124b6 init: do_mounts: use kmalloc() for allocations of temporary buffers
Several places in init/do_mounts.c allocate temporary buffers for
filesystem names or options using __get_free_page() or alloc_page().

Usage of alloc_page() APIs is not required there and only creates
unnecessary noise with castings or conversion from struct page to void *.

kmalloc() is a better API for these uses and it also provides better
scalability and more debugging possibilities.

Replace use of __get_free_page() and alloc_page() with kmalloc().

While on it, add a check for -ENOMEM condition in mount_root_generic().

Link: https://lore.kernel.org/all/635405e4-9423-4a25-a6e7-e03c8ea0bcbe@redhat.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Link: https://patch.msgid.link/20260520-init-v1-1-aaf2ebac5ad9@kernel.org
Reviewed-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-05-22 12:12:20 +02:00
Andy Shevchenko ec03d259f6 initramfs: Refactor to use hex2bin() instead of custom approach
There is a simple_strntoul() function used solely as a shortcut
for hex2bin() with proper endianess conversions. Replace that
and drop the unneeded function in the next changes.

This implementation will abort if we fail to parse the cpio header,
instead of using potentially bogus header values.

Co-developed-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://patch.msgid.link/20260331070519.5974-5-ddiss@suse.de
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-21 09:32:46 +02:00
Andy Shevchenko a4d6170e86 initramfs: Sort headers alphabetically
Sorting headers alphabetically helps locating duplicates, and makes it
easier to figure out where to insert new headers.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://patch.msgid.link/20260331070519.5974-4-ddiss@suse.de
Reviewed-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-21 09:32:46 +02:00
David Disseldorp 19868f7034 initramfs_test: test header fields with 0x hex prefix
cpio header fields are 8-byte hex strings, but one "interesting"
side-effect of our historic simple_str[n]toul() use means that a "0x"
(or "0X") prefixed header field will be successfully processed when
coupled alongside a 6-byte hex remainder string.

"0x" prefix support is contrary to the initramfs specification at
Documentation/driver-api/early-userspace/buffer-format.rst which states:

  The structure of the cpio_header is as follows (all fields contain
  hexadecimal ASCII numbers fully padded with '0' on the left to the
  full width of the field, for example, the integer 4780 is represented
  by the ASCII string "000012ac"):

Test for this corner case by injecting "0x" prefixes into the uid, gid
and namesize cpio header fields. Confirm that init_stat() returns
matching uid and gid values.

This test can be modified in future to expect unpack_to_rootfs() failure
when header validation is changed to properly follow the specification.

Add some missing struct kstat initializations to account for possible
init_stat() failures.

Signed-off-by: David Disseldorp <ddiss@suse.de>
Link: https://patch.msgid.link/20260331070519.5974-3-ddiss@suse.de
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-21 09:32:46 +02:00
David Disseldorp 8d14fe78cb initramfs_test: add fill_cpio() inject_ox parameter
fill_cpio() uses sprintf() to write out the in-memory cpio archive from
an array of struct initramfs_test_cpio. This change allows callers to
modify the cpio sprintf() format string so that future tests can
intentionally corrupt the header with "0x" and "0X" prefixed fields.

Signed-off-by: David Disseldorp <ddiss@suse.de>
Link: https://patch.msgid.link/20260331070519.5974-2-ddiss@suse.de
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-21 09:32:46 +02:00
Peter Zijlstra a26d9208c1 Merge branch 'sched/cache'
Merge the cache aware balancer topic branch.

# Conflicts:
#	kernel/sched/topology.c
2026-05-19 12:18:01 +02:00
Chen Yu 03755348b8 sched/cache: Fix unpaired account_llc_enqueue/dequeue
There is a race condition that, after a task is enqueued
on a runqueue, task_llc(p) may change due to CPU hotplug,
because the llc_id is dynamically allocated and adjusted
at runtime.
Therefore, checking task_llc(p) to determine whether the
task is being dequeued from its preferred LLC is unreliable
and can cause inconsistent values.

To fix this problem, record whether p is enqueued on its
preferred LLC, in order to pair with account_llc_dequeue()
to maintain a consistent nr_pref_llc_running per runqueue.

This bug was reported by sashiko, and the solution was once
suggested by Prateek.

Fixes: 46afe3af7e ("sched/cache: Track LLC-preferred tasks per runqueue")
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/0c8c6a1571d66792a4d2ff0103ba3cc13e059046.1778703694.git.tim.c.chen@linux.intel.com
2026-05-18 21:33:17 +02:00
Marco Elver feb662d916 slab: support for compiler-assisted type-based slab cache partitioning
Rework the general infrastructure around RANDOM_KMALLOC_CACHES into more
flexible KMALLOC_PARTITION_CACHES, with the former being a partitioning
mode of the latter.

Introduce a new mode, KMALLOC_PARTITION_TYPED, which leverages a feature
available in Clang 22 and later, called "allocation tokens" via
__builtin_infer_alloc_token() [1]. Unlike KMALLOC_PARTITION_RANDOM
(formerly RANDOM_KMALLOC_CACHES), this mode deterministically assigns a
slab cache to an allocation of type T, regardless of allocation site.

The builtin __builtin_infer_alloc_token(<malloc-args>, ...) instructs
the compiler to infer an allocation type from arguments commonly passed
to memory-allocating functions and returns a type-derived token ID. The
implementation passes kmalloc-args to the builtin: the compiler performs
best-effort type inference, and then recognizes common patterns such as
`kmalloc(sizeof(T), ...)`, `kmalloc(sizeof(T) * n, ...)`, but also
`(T *)kmalloc(...)`. Where the compiler fails to infer a type the
fallback token (default: 0) is chosen.

Note: kmalloc_obj(..) APIs fix the pattern how size and result type are
expressed, and therefore ensures there's not much drift in which
patterns the compiler needs to recognize. Specifically, kmalloc_obj()
and friends expand to `(TYPE *)KMALLOC(__obj_size, GFP)`, which the
compiler recognizes via the cast to TYPE*.

Clang's default token ID calculation is described as [1]:

   typehashpointersplit: This mode assigns a token ID based on the hash
   of the allocated type's name, where the top half ID-space is reserved
   for types that contain pointers and the bottom half for types that do
   not contain pointers.

Separating pointer-containing objects from pointerless objects and data
allocations can help mitigate certain classes of memory corruption
exploits [2]: attackers who gains a buffer overflow on a primitive
buffer cannot use it to directly corrupt pointers or other critical
metadata in an object residing in a different, isolated heap region.

It is important to note that heap isolation strategies offer a
best-effort approach, and do not provide a 100% security guarantee,
albeit achievable at relatively low performance cost. Note that this
also does not prevent cross-cache attacks: while waiting for future
features like SLAB_VIRTUAL [3] to provide physical page isolation, this
feature should be deployed alongside SHUFFLE_PAGE_ALLOCATOR and
init_on_free=1 to mitigate cross-cache attacks and page-reuse attacks as
much as possible today.

With all that, my kernel (x86 defconfig) shows me a histogram of slab
cache object distribution per /proc/slabinfo (after boot):

  <slab cache>      <objs> <hist>
  kmalloc-part-15    1465  ++++++++++++++
  kmalloc-part-14    2988  +++++++++++++++++++++++++++++
  kmalloc-part-13    1656  ++++++++++++++++
  kmalloc-part-12    1045  ++++++++++
  kmalloc-part-11    1697  ++++++++++++++++
  kmalloc-part-10    1489  ++++++++++++++
  kmalloc-part-09     965  +++++++++
  kmalloc-part-08     710  +++++++
  kmalloc-part-07     100  +
  kmalloc-part-06     217  ++
  kmalloc-part-05     105  +
  kmalloc-part-04    4047  ++++++++++++++++++++++++++++++++++++++++
  kmalloc-part-03     183  +
  kmalloc-part-02     283  ++
  kmalloc-part-01     316  +++
  kmalloc            1422  ++++++++++++++

The above /proc/slabinfo snapshot shows me there are 6673 allocated
objects (slabs 00 - 07) that the compiler claims contain no pointers or
it was unable to infer the type of, and 12015 objects that contain
pointers (slabs 08 - 15). On a whole, this looks relatively sane.

Additionally, when I compile my kernel with -Rpass=alloc-token, which
provides diagnostics where (after dead-code elimination) type inference
failed, I see 186 allocation sites where the compiler failed to identify
a type (down from 966 when I sent the RFC [4]). Some initial review
confirms these are mostly variable sized buffers, but also include
structs with trailing flexible length arrays.

Link: https://clang.llvm.org/docs/AllocToken.html [1]
Link: https://blog.dfsec.com/ios/2025/05/30/blasting-past-ios-18/ [2]
Link: https://lwn.net/Articles/944647/ [3]
Link: https://lore.kernel.org/all/20250825154505.1558444-1-elver@google.com/ [4]
Link: https://discourse.llvm.org/t/rfc-a-framework-for-allocator-partitioning-hints/87434
Acked-by: GONG Ruiqi <gongruiqi1@huawei.com>
Co-developed-by: Harry Yoo (Oracle) <harry@kernel.org>
Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Link: https://patch.msgid.link/20260511200136.3201646-1-elver@google.com
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
2026-05-14 10:44:09 +02:00
Breno Leitao 5a643e4623 bootconfig: move xbc_snprint_cmdline() to lib/bootconfig.c
Move xbc_snprint_cmdline() from init/main.c to lib/bootconfig.c so the
function (and its xbc_namebuf scratch buffer) becomes part of the shared
parser library. tools/bootconfig already compiles lib/bootconfig.c
directly, which lets a follow-up patch reuse the same renderer in the
userspace tool to convert a bootconfig file into a flat cmdline string
at build time.

No functional change.

Link: https://lore.kernel.org/all/20260508-bootconfig_using_tools-v1-1-1132219aa773@debian.org/

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2026-05-12 09:44:31 +09:00
Linus Torvalds 9055c64567 Merge tag 'memblock-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock
Pull memblock updates from Mike Rapoport:

 - improve debuggability of reserve_mem kernel parameter handling with
   print outs in case of a failure and debugfs info showing what was
   actually reserved

 - Make memblock_free_late() and free_reserved_area() use the same core
   logic for freeing the memory to buddy and ensure it takes care of
   updating memblock arrays when ARCH_KEEP_MEMBLOCK is enabled.

* tag 'memblock-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
  x86/alternative: delay freeing of smp_locks section
  memblock: warn when freeing reserved memory before memory map is initialized
  memblock, treewide: make memblock_free() handle late freeing
  memblock: make free_reserved_area() update memblock if ARCH_KEEP_MEMBLOCK=y
  memblock: extract page freeing from free_reserved_area() into a helper
  memblock: make free_reserved_area() more robust
  mm: move free_reserved_area() to mm/memblock.c
  powerpc: opal-core: pair alloc_pages_exact() with free_pages_exact()
  powerpc: fadump: pair alloc_pages_exact() with free_pages_exact()
  memblock: reserve_mem: fix end caclulation in reserve_mem_release_by_name()
  memblock: move reserve_bootmem_range() to memblock.c and make it static
  memblock: Add reserve_mem debugfs info
  memblock: Print out errors on reserve_mem parser
2026-04-18 11:29:14 -07:00
Linus Torvalds 334fbe734e Merge tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:

 - "maple_tree: Replace big node with maple copy" (Liam Howlett)

   Mainly prepararatory work for ongoing development but it does reduce
   stack usage and is an improvement.

 - "mm, swap: swap table phase III: remove swap_map" (Kairui Song)

   Offers memory savings by removing the static swap_map. It also yields
   some CPU savings and implements several cleanups.

 - "mm: memfd_luo: preserve file seals" (Pratyush Yadav)

   File seal preservation to LUO's memfd code

 - "mm: zswap: add per-memcg stat for incompressible pages" (Jiayuan
   Chen)

   Additional userspace stats reportng to zswap

 - "arch, mm: consolidate empty_zero_page" (Mike Rapoport)

   Some cleanups for our handling of ZERO_PAGE() and zero_pfn

 - "mm/kmemleak: Improve scan_should_stop() implementation" (Zhongqiu
   Han)

   A robustness improvement and some cleanups in the kmemleak code

 - "Improve khugepaged scan logic" (Vernon Yang)

   Improve khugepaged scan logic and reduce CPU consumption by
   prioritizing scanning tasks that access memory frequently

 - "Make KHO Stateless" (Jason Miu)

   Simplify Kexec Handover by transitioning KHO from an xarray-based
   metadata tracking system with serialization to a radix tree data
   structure that can be passed directly to the next kernel

 - "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" (Thomas
   Ballasi and Steven Rostedt)

   Enhance vmscan's tracepointing

 - "mm: arch/shstk: Common shadow stack mapping helper and
   VM_NOHUGEPAGE" (Catalin Marinas)

   Cleanup for the shadow stack code: remove per-arch code in favour of
   a generic implementation

 - "Fix KASAN support for KHO restored vmalloc regions" (Pasha Tatashin)

   Fix a WARN() which can be emitted the KHO restores a vmalloc area

 - "mm: Remove stray references to pagevec" (Tal Zussman)

   Several cleanups, mainly udpating references to "struct pagevec",
   which became folio_batch three years ago

 - "mm: Eliminate fake head pages from vmemmap optimization" (Kiryl
   Shutsemau)

   Simplify the HugeTLB vmemmap optimization (HVO) by changing how tail
   pages encode their relationship to the head page

 - "mm/damon/core: improve DAMOS quota efficiency for core layer
   filters" (SeongJae Park)

   Improve two problematic behaviors of DAMOS that makes it less
   efficient when core layer filters are used

 - "mm/damon: strictly respect min_nr_regions" (SeongJae Park)

   Improve DAMON usability by extending the treatment of the
   min_nr_regions user-settable parameter

 - "mm/page_alloc: pcp locking cleanup" (Vlastimil Babka)

   The proper fix for a previously hotfixed SMP=n issue. Code
   simplifications and cleanups ensued

 - "mm: cleanups around unmapping / zapping" (David Hildenbrand)

   A bunch of cleanups around unmapping and zapping. Mostly
   simplifications, code movements, documentation and renaming of
   zapping functions

 - "support batched checking of the young flag for MGLRU" (Baolin Wang)

   Batched checking of the young flag for MGLRU. It's part cleanups; one
   benchmark shows large performance benefits for arm64

 - "memcg: obj stock and slab stat caching cleanups" (Johannes Weiner)

   memcg cleanup and robustness improvements

 - "Allow order zero pages in page reporting" (Yuvraj Sakshith)

   Enhance free page reporting - it is presently and undesirably order-0
   pages when reporting free memory.

 - "mm: vma flag tweaks" (Lorenzo Stoakes)

   Cleanup work following from the recent conversion of the VMA flags to
   a bitmap

 - "mm/damon: add optional debugging-purpose sanity checks" (SeongJae
   Park)

   Add some more developer-facing debug checks into DAMON core

 - "mm/damon: test and document power-of-2 min_region_sz requirement"
   (SeongJae Park)

   An additional DAMON kunit test and makes some adjustments to the
   addr_unit parameter handling

 - "mm/damon/core: make passed_sample_intervals comparisons
   overflow-safe" (SeongJae Park)

   Fix a hard-to-hit time overflow issue in DAMON core

 - "mm/damon: improve/fixup/update ratio calculation, test and
   documentation" (SeongJae Park)

   A batch of misc/minor improvements and fixups for DAMON

 - "mm: move vma_(kernel|mmu)_pagesize() out of hugetlb.c" (David
   Hildenbrand)

   Fix a possible issue with dax-device when CONFIG_HUGETLB=n. Some code
   movement was required.

 - "zram: recompression cleanups and tweaks" (Sergey Senozhatsky)

   A somewhat random mix of fixups, recompression cleanups and
   improvements in the zram code

 - "mm/damon: support multiple goal-based quota tuning algorithms"
   (SeongJae Park)

   Extend DAMOS quotas goal auto-tuning to support multiple tuning
   algorithms that users can select

 - "mm: thp: reduce unnecessary start_stop_khugepaged()" (Breno Leitao)

   Fix the khugpaged sysfs handling so we no longer spam the logs with
   reams of junk when starting/stopping khugepaged

 - "mm: improve map count checks" (Lorenzo Stoakes)

   Provide some cleanups and slight fixes in the mremap, mmap and vma
   code

 - "mm/damon: support addr_unit on default monitoring targets for
   modules" (SeongJae Park)

   Extend the use of DAMON core's addr_unit tunable

 - "mm: khugepaged cleanups and mTHP prerequisites" (Nico Pache)

   Cleanups to khugepaged and is a base for Nico's planned khugepaged
   mTHP support

 - "mm: memory hot(un)plug and SPARSEMEM cleanups" (David Hildenbrand)

   Code movement and cleanups in the memhotplug and sparsemem code

 - "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup
   CONFIG_MIGRATION" (David Hildenbrand)

   Rationalize some memhotplug Kconfig support

 - "change young flag check functions to return bool" (Baolin Wang)

   Cleanups to change all young flag check functions to return bool

 - "mm/damon/sysfs: fix memory leak and NULL dereference issues" (Josh
   Law and SeongJae Park)

   Fix a few potential DAMON bugs

 - "mm/vma: convert vm_flags_t to vma_flags_t in vma code" (Lorenzo
   Stoakes)

   Convert a lot of the existing use of the legacy vm_flags_t data type
   to the new vma_flags_t type which replaces it. Mainly in the vma
   code.

 - "mm: expand mmap_prepare functionality and usage" (Lorenzo Stoakes)

   Expand the mmap_prepare functionality, which is intended to replace
   the deprecated f_op->mmap hook which has been the source of bugs and
   security issues for some time. Cleanups, documentation, extension of
   mmap_prepare into filesystem drivers

 - "mm/huge_memory: refactor zap_huge_pmd()" (Lorenzo Stoakes)

   Simplify and clean up zap_huge_pmd(). Additional cleanups around
   vm_normal_folio_pmd() and the softleaf functionality are performed.

* tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
  mm: fix deferred split queue races during migration
  mm/khugepaged: fix issue with tracking lock
  mm/huge_memory: add and use has_deposited_pgtable()
  mm/huge_memory: add and use normal_or_softleaf_folio_pmd()
  mm: add softleaf_is_valid_pmd_entry(), pmd_to_softleaf_folio()
  mm/huge_memory: separate out the folio part of zap_huge_pmd()
  mm/huge_memory: use mm instead of tlb->mm
  mm/huge_memory: remove unnecessary sanity checks
  mm/huge_memory: deduplicate zap deposited table call
  mm/huge_memory: remove unnecessary VM_BUG_ON_PAGE()
  mm/huge_memory: add a common exit path to zap_huge_pmd()
  mm/huge_memory: handle buggy PMD entry in zap_huge_pmd()
  mm/huge_memory: have zap_huge_pmd return a boolean, add kdoc
  mm/huge: avoid big else branch in zap_huge_pmd()
  mm/huge_memory: simplify vma_is_specal_huge()
  mm: on remap assert that input range within the proposed VMA
  mm: add mmap_action_map_kernel_pages[_full]()
  uio: replace deprecated mmap hook with mmap_prepare in uio_info
  drivers: hv: vmbus: replace deprecated mmap hook with mmap_prepare
  mm: allow handling of stacked mmap_prepare hooks in more drivers
  ...
2026-04-15 12:59:16 -07:00
Linus Torvalds 5bdb4078e1 Merge tag 'sched_ext-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext updates from Tejun Heo:

 - cgroup sub-scheduler groundwork

   Multiple BPF schedulers can be attached to cgroups and the dispatch
   path is made hierarchical. This involves substantial restructuring of
   the core dispatch, bypass, watchdog, and dump paths to be
   per-scheduler, along with new infrastructure for scheduler ownership
   enforcement, lifecycle management, and cgroup subtree iteration

   The enqueue path is not yet updated and will follow in a later cycle

 - scx_bpf_dsq_reenq() generalized to support any DSQ including remote
   local DSQs and user DSQs

   Built on top of this, SCX_ENQ_IMMED guarantees that tasks dispatched
   to local DSQs either run immediately or get reenqueued back through
   ops.enqueue(), giving schedulers tighter control over queueing
   latency

   Also useful for opportunistic CPU sharing across sub-schedulers

 - ops.dequeue() was only invoked when the core knew a task was in BPF
   data structures, missing scheduling property change events and
   skipping callbacks for non-local DSQ dispatches from ops.select_cpu()

   Fixed to guarantee exactly one ops.dequeue() call when a task leaves
   BPF scheduler custody

 - Kfunc access validation moved from runtime to BPF verifier time,
   removing runtime mask enforcement

 - Idle SMT sibling prioritization in the idle CPU selection path

 - Documentation, selftest, and tooling updates. Misc bug fixes and
   cleanups

* tag 'sched_ext-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (134 commits)
  tools/sched_ext: Add explicit cast from void* in RESIZE_ARRAY()
  sched_ext: Make string params of __ENUM_set() const
  tools/sched_ext: Kick home CPU for stranded tasks in scx_qmap
  sched_ext: Drop spurious warning on kick during scheduler disable
  sched_ext: Warn on task-based SCX op recursion
  sched_ext: Rename scx_kf_allowed_on_arg_tasks() to scx_kf_arg_task_ok()
  sched_ext: Remove runtime kfunc mask enforcement
  sched_ext: Add verifier-time kfunc context filter
  sched_ext: Drop redundant rq-locked check from scx_bpf_task_cgroup()
  sched_ext: Decouple kfunc unlocked-context check from kf_mask
  sched_ext: Fix ops.cgroup_move() invocation kf_mask and rq tracking
  sched_ext: Track @p's rq lock across set_cpus_allowed_scx -> ops.set_cpumask
  sched_ext: Add select_cpu kfuncs to scx_kfunc_ids_unlocked
  sched_ext: Drop TRACING access to select_cpu kfuncs
  selftests/sched_ext: Fix wrong DSQ ID in peek_dsq error message
  sched_ext: Documentation: improve accuracy of task lifecycle pseudo-code
  selftests/sched_ext: Improve runner error reporting for invalid arguments
  sched_ext: Documentation: Fix scx_bpf_move_to_local kfunc name
  sched_ext: Documentation: Add ops.dequeue() to task lifecycle
  tools/sched_ext: Fix off-by-one in scx_sdt payload zeroing
  ...
2026-04-15 10:54:24 -07:00
Linus Torvalds 1c3b68f0d5 Merge tag 'sched-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "Fair scheduling updates:
   - Skip SCHED_IDLE rq for SCHED_IDLE tasks (Christian Loehle)
   - Remove superfluous rcu_read_lock() in the wakeup path (K Prateek Nayak)
   - Simplify the entry condition for update_idle_cpu_scan() (K Prateek Nayak)
   - Simplify SIS_UTIL handling in select_idle_cpu() (K Prateek Nayak)
   - Avoid overflow in enqueue_entity() (K Prateek Nayak)
   - Update overutilized detection (Vincent Guittot)
   - Prevent negative lag increase during delayed dequeue (Vincent Guittot)
   - Clear buddies for preempt_short (Vincent Guittot)
   - Implement more complex proportional newidle balance (Peter Zijlstra)
   - Increase weight bits for avg_vruntime (Peter Zijlstra)
   - Use full weight to __calc_delta() (Peter Zijlstra)

  RT and DL scheduling updates:
   - Fix incorrect schedstats for rt and dl thread (Dengjun Su)
   - Skip group schedulable check with rt_group_sched=0 (Michal Koutný)
   - Move group schedulability check to sched_rt_global_validate()
     (Michal Koutný)
   - Add reporting of runtime left & abs deadline to sched_getattr()
     for DEADLINE tasks (Tommaso Cucinotta)

  Scheduling topology updates by K Prateek Nayak:
   - Compute sd_weight considering cpuset partitions
   - Extract "imb_numa_nr" calculation into a separate helper
   - Allocate per-CPU sched_domain_shared in s_data
   - Switch to assigning "sd->shared" from s_data
   - Remove sched_domain_shared allocation with sd_data

  Energy-aware scheduling updates:
   - Filter false overloaded_group case for EAS (Vincent Guittot)
   - PM: EM: Switch to rcu_dereference_all() in wakeup path
     (Dietmar Eggemann)

  Infrastructure updates:
   - Replace use of system_unbound_wq with system_dfl_wq (Marco Crivellari)

  Proxy scheduling updates by John Stultz:
   - Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
   - Minimise repeated sched_proxy_exec() checking
   - Fix potentially missing balancing with Proxy Exec
   - Fix and improve task::blocked_on et al handling
   - Add assert_balance_callbacks_empty() helper
   - Add logic to zap balancing callbacks if we pick again
   - Move attach_one_task() and attach_task() helpers to sched.h
   - Handle blocked-waiter migration (and return migration)
   - Add K Prateek Nayak to scheduler reviewers for proxy execution

  Misc cleanups and fixes by John Stultz, Joseph Salisbury, Peter
  Zijlstra, K Prateek Nayak, Michal Koutný, Randy Dunlap, Shrikanth
  Hegde, Vincent Guittot, Zhan Xusheng, Xie Yuanbin and Vincent Guittot"

* tag 'sched-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
  sched/eevdf: Clear buddies for preempt_short
  sched/rt: Cleanup global RT bandwidth functions
  sched/rt: Move group schedulability check to sched_rt_global_validate()
  sched/rt: Skip group schedulable check with rt_group_sched=0
  sched/fair: Avoid overflow in enqueue_entity()
  sched: Use u64 for bandwidth ratio calculations
  sched/fair: Prevent negative lag increase during delayed dequeue
  sched/fair: Use sched_energy_enabled()
  sched: Handle blocked-waiter migration (and return migration)
  sched: Move attach_one_task and attach_task helpers to sched.h
  sched: Add logic to zap balance callbacks if we pick again
  sched: Add assert_balance_callbacks_empty helper
  sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration
  sched: Fix modifying donor->blocked on without proper locking
  locking: Add task::blocked_lock to serialize blocked_on state
  sched: Fix potentially missing balancing with Proxy Exec
  sched: Minimise repeated sched_proxy_exec() checking
  sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
  MAINTAINERS: Add K Prateek Nayak to scheduler reviewers
  sched/core: Get this cpu once in ttwu_queue_cond()
  ...
2026-04-14 13:33:36 -07:00
Linus Torvalds f21f7b5162 Merge tag 'timers-vdso-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull vdso updates from Thomas Gleixner:

 - Make the handling of compat functions consistent and more robust

 - Rework the underlying data store so that it is dynamically allocated,
   which allows the conversion of the last holdout SPARC64 to the
   generic VDSO implementation

 - Rework the SPARC64 VDSO to utilize the generic implementation

 - Mop up the left overs of the non-generic VDSO support in the core
   code

 - Expand the VDSO selftest and make them more robust

 - Allow time namespaces to be enabled independently of the generic VDSO
   support, which was not possible before due to SPARC64 not using it

 - Various cleanups and improvements in the related code

* tag 'timers-vdso-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
  timens: Use task_lock guard in timens_get*()
  timens: Use mutex guard in proc_timens_set_offset()
  timens: Simplify some calls to put_time_ns()
  timens: Add a __free() wrapper for put_time_ns()
  timens: Remove dependency on the vDSO
  vdso/timens: Move functions to new file
  selftests: vDSO: vdso_test_correctness: Add a test for time()
  selftests: vDSO: vdso_test_correctness: Use facilities from parse_vdso.c
  selftests: vDSO: vdso_test_correctness: Handle different tv_usec types
  selftests: vDSO: vdso_test_correctness: Drop SYS_getcpu fallbacks
  selftests: vDSO: vdso_test_gettimeofday: Remove nolibc checks
  Revert "selftests: vDSO: parse_vdso: Use UAPI headers instead of libc headers"
  random: vDSO: Remove ifdeffery
  random: vDSO: Trim vDSO includes
  vdso/datapage: Trim down unnecessary includes
  vdso/datapage: Remove inclusion of gettimeofday.h
  vdso/helpers: Explicitly include vdso/processor.h
  vdso/gettimeofday: Add explicit includes
  random: vDSO: Add explicit includes
  MIPS: vdso: Explicitly include asm/vdso/vdso.h
  ...
2026-04-14 10:53:44 -07:00
Linus Torvalds 5d0d362330 Merge tag 'kbuild-7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux
Pull Kbuild/Kconfig updates from Nicolas Schier:
 "Kbuild:
   - reject unexpected values for LLVM=
   - uapi: remove usage of toolchain headers
   - switch from '-fms-extensions' to '-fms-anonymous-structs' when
     available (currently: clang >= 23.0.0)
   - reduce the number of compiler-generated suffixes for clang thin-lto
     build
   - reduce output spam ("GEN Makefile") when building out of tree
   - improve portability for testing headers
   - also test UAPI headers against C++ compilers
   - drop build ID architecture allow-list in vdso_install
   - only run checksyscalls when necessary
   - update the debug information notes in reproducible-builds.rst
   - expand inlining hints with -fdiagnostics-show-inlining-chain

  Kconfig:
   - forbid multiple entries with the same symbol in a choice
   - error out on duplicated kconfig inclusion"

* tag 'kbuild-7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux: (35 commits)
  kbuild: expand inlining hints with -fdiagnostics-show-inlining-chain
  kconfig: forbid multiple entries with the same symbol in a choice
  Documentation: kbuild: Update the debug information notes in reproducible-builds.rst
  checksyscalls: move instance functionality into generic code
  checksyscalls: only run when necessary
  checksyscalls: fail on all intermediate errors
  checksyscalls: move path to reference table to a variable
  kbuild: vdso_install: drop build ID architecture allow-list
  kbuild: vdso_install: gracefully handle images without build ID
  kbuild: vdso_install: hide readelf warnings
  kbuild: vdso_install: split out the readelf invocation
  kbuild: uapi: also test UAPI headers against C++ compilers
  kbuild: uapi: provide a C++ compatible dummy definition of NULL
  kbuild: uapi: handle UML in architecture-specific exclusion lists
  kbuild: uapi: move all include path flags together
  kbuild: uapi: move some compiler arguments out of the command definition
  check-uapi: use dummy libc includes
  check-uapi: honor ${CROSS_COMPILE} setting
  check-uapi: link into shared objects
  kbuild: reduce output spam when building out of tree
  ...
2026-04-14 09:18:40 -07:00
Linus Torvalds 4793dae01f Merge tag 'driver-core-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core
Pull driver core updates from Danilo Krummrich:
 "debugfs:
   - Fix NULL pointer dereference in debugfs_create_str()
   - Fix misplaced EXPORT_SYMBOL_GPL for debugfs_create_str()
   - Fix soundwire debugfs NULL pointer dereference from uninitialized
     firmware_file

  device property:
   - Make fwnode flags modifications thread safe; widen the field to
     unsigned long and use set_bit() / clear_bit() based accessors
   - Document how to check for the property presence

  devres:
   - Separate struct devres_node from its "subclasses" (struct devres,
     struct devres_group); give struct devres_node its own release and
     free callbacks for per-type dispatch
   - Introduce struct devres_action for devres actions, avoiding the
     ARCH_DMA_MINALIGN alignment overhead of struct devres
   - Export struct devres_node and its init/add/remove/dbginfo
     primitives for use by Rust Devres<T>
   - Fix missing node debug info in devm_krealloc()
   - Use guard(spinlock_irqsave) where applicable; consolidate unlock
     paths in devres_release_group()

  driver_override:
   - Convert PCI, WMI, vdpa, s390/cio, s390/ap, and fsl-mc to the
     generic driver_override infrastructure, replacing per-bus
     driver_override strings, sysfs attributes, and match logic; fixes a
     potential UAF from unsynchronized access to driver_override in bus
     match() callbacks
   - Simplify __device_set_driver_override() logic

  kernfs:
   - Send IN_DELETE_SELF and IN_IGNORED inotify events on kernfs file
     and directory removal
   - Add corresponding selftests for memcg

  platform:
   - Allow attaching software nodes when creating platform devices via a
     new 'swnode' field in struct platform_device_info
   - Add kerneldoc for struct platform_device_info

  software node:
   - Move software node initialization from postcore_initcall() to
     driver_init(), making it available early in the boot process
   - Move kernel_kobj initialization (ksysfs_init) earlier to support
     the above
   - Remove software_node_exit(); dead code in a built-in unit

  SoC:
   - Introduce of_machine_read_compatible() and of_machine_read_model()
     OF helpers and export soc_attr_read_machine() to replace direct
     accesses to of_root from SoC drivers; also enables
     CONFIG_COMPILE_TEST coverage for these drivers

  sysfs:
   - Constify attribute group array pointers to
     'const struct attribute_group *const *' in sysfs functions,
     device_add_groups() / device_remove_groups(), and struct class

  Rust:
   - Devres:
      - Embed struct devres_node directly in Devres<T> instead of going
        through devm_add_action(), avoiding the extra allocation and the
        unnecessary ARCH_DMA_MINALIGN alignment

   - I/O:
      - Turn IoCapable from a marker trait into a functional trait
        carrying the raw I/O accessor implementation (io_read /
        io_write), providing working defaults for the per-type Io
        methods
      - Add RelaxedMmio wrapper type, making relaxed accessors usable in
        code generic over the Io trait
      - Remove overloaded per-type Io methods and per-backend macros
        from Mmio and PCI ConfigSpace

   - I/O (Register):
      - Add IoLoc trait and generic read/write/update methods to the Io
        trait, making I/O operations parameterizable by typed locations
      - Add register! macro for defining hardware register types with
        typed bitfield accessors backed by Bounded values; supports
        direct, relative, and array register addressing
      - Add write_reg() / try_write_reg() and LocatedRegister trait
      - Update PCI sample driver to demonstrate the register! macro

         Example:

         ```
             register! {
                 /// UART control register.
                 CTRL(u32) @ 0x18 {
                     /// Receiver enable.
                     19:19   rx_enable => bool;
                     /// Parity configuration.
                     14:13   parity ?=> Parity;
                 }

                 /// FIFO watermark and counter register.
                 WATER(u32) @ 0x2c {
                     /// Number of datawords in the receive FIFO.
                     26:24   rx_count;
                     /// RX interrupt threshold.
                     17:16   rx_water;
                 }
             }

             impl WATER {
                 fn rx_above_watermark(&self) -> bool {
                     self.rx_count() > self.rx_water()
                 }
             }

             fn init(bar: &pci::Bar<BAR0_SIZE>) {
                 let water = WATER::zeroed()
                     .with_const_rx_water::<1>(); // > 3 would not compile
                 bar.write_reg(water);

                 let ctrl = CTRL::zeroed()
                     .with_parity(Parity::Even)
                     .with_rx_enable(true);
                 bar.write_reg(ctrl);
             }

             fn handle_rx(bar: &pci::Bar<BAR0_SIZE>) {
                 if bar.read(WATER).rx_above_watermark() {
                     // drain the FIFO
                 }
             }

             fn set_parity(bar: &pci::Bar<BAR0_SIZE>, parity: Parity) {
                 bar.update(CTRL, |r| r.with_parity(parity));
             }
         ```

   - IRQ:
      - Move 'static bounds from where clauses to trait declarations for
        IRQ handler traits

   - Misc:
      - Enable the generic_arg_infer Rust feature
      - Extend Bounded with shift operations, single-bit bool
        conversion, and const get()

  Misc:
   - Make deferred_probe_timeout default a Kconfig option
   - Drop auxiliary_dev_pm_ops; the PM core falls back to driver PM
     callbacks when no bus type PM ops are set
   - Add conditional guard support for device_lock()
   - Add ksysfs.c to the DRIVER CORE MAINTAINERS entry
   - Fix kernel-doc warnings in base.h
   - Fix stale reference to memory_block_add_nid() in documentation"

* tag 'driver-core-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: (67 commits)
  bus: fsl-mc: use generic driver_override infrastructure
  s390/ap: use generic driver_override infrastructure
  s390/cio: use generic driver_override infrastructure
  vdpa: use generic driver_override infrastructure
  platform/wmi: use generic driver_override infrastructure
  PCI: use generic driver_override infrastructure
  driver core: make software nodes available earlier
  software node: remove software_node_exit()
  kernel: ksysfs: initialize kernel_kobj earlier
  MAINTAINERS: add ksysfs.c to the DRIVER CORE entry
  drivers/base/memory: fix stale reference to memory_block_add_nid()
  device property: Document how to check for the property presence
  soundwire: debugfs: initialize firmware_file to empty string
  debugfs: fix placement of EXPORT_SYMBOL_GPL for debugfs_create_str()
  debugfs: check for NULL pointer in debugfs_create_str()
  driver core: Make deferred_probe_timeout default a Kconfig option
  driver core: simplify __device_set_driver_override() clearing logic
  driver core: auxiliary bus: Drop auxiliary_dev_pm_ops
  device property: Make modifications of fwnode "flags" thread safe
  rust: devres: embed struct devres_node directly
  ...
2026-04-13 19:03:11 -07:00
Linus Torvalds d568788baa Merge tag 'hardening-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:

 - randomize_kstack: Improve implementation across arches (Ryan Roberts)

 - lkdtm/fortify: Drop unneeded FORTIFY_STR_OBJECT test

 - refcount: Remove unused __signed_wrap function annotations

* tag 'hardening-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  lkdtm/fortify: Drop unneeded FORTIFY_STR_OBJECT test
  refcount: Remove unused __signed_wrap function annotations
  randomize_kstack: Unify random source across arches
  randomize_kstack: Maintain kstack_offset per task
2026-04-13 17:52:29 -07:00
Linus Torvalds ef3da345cc Merge tag 'vfs-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
 "Features:
   - coredump: add tracepoint for coredump events
   - fs: hide file and bfile caches behind runtime const machinery

  Fixes:
   - fix architecture-specific compat_ftruncate64 implementations
   - dcache: Limit the minimal number of bucket to two
   - fs/omfs: reject s_sys_blocksize smaller than OMFS_DIR_START
   - fs/mbcache: cancel shrink work before destroying the cache
   - dcache: permit dynamic_dname()s up to NAME_MAX

  Cleanups:
   - remove or unexport unused fs_context infrastructure
   - trivial ->setattr cleanups
   - selftests/filesystems: Assume that TIOCGPTPEER is defined
   - writeback: fix kernel-doc function name mismatch for wb_put_many()
   - autofs: replace manual symlink buffer allocation in autofs_dir_symlink
   - init/initramfs.c: trivial fix: FSM -> Finite-state machine
   - fs: remove stale and duplicate forward declarations
   - readdir: Introduce dirent_size()
   - fs: Replace user_access_{begin/end} by scoped user access
   - kernel: acct: fix duplicate word in comment
   - fs: write a better comment in step_into() concerning .mnt assignment
   - fs: attr: fix comment formatting and spelling issues"

* tag 'vfs-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits)
  dcache: permit dynamic_dname()s up to NAME_MAX
  fs: attr: fix comment formatting and spelling issues
  fs: hide file and bfile caches behind runtime const machinery
  fs: write a better comment in step_into() concerning .mnt assignment
  proc: rename proc_notify_change to proc_setattr
  proc: rename proc_setattr to proc_nochmod_setattr
  affs: rename affs_notify_change to affs_setattr
  adfs: rename adfs_notify_change to adfs_setattr
  hfs: update comments on hfs_inode_setattr
  kernel: acct: fix duplicate word in comment
  fs: Replace user_access_{begin/end} by scoped user access
  readdir: Introduce dirent_size()
  coredump: add tracepoint for coredump events
  fs: remove do_sys_truncate
  fs: pass on FTRUNCATE_* flags to do_truncate
  fs: fix archiecture-specific compat_ftruncate64
  fs: remove stale and duplicate forward declarations
  init/initramfs.c: trivial fix: FSM -> Finite-state machine
  autofs: replace manual symlink buffer allocation in autofs_dir_symlink
  fs/mbcache: cancel shrink work before destroying the cache
  ...
2026-04-13 14:20:11 -07:00
Linus Torvalds 26ff969926 Merge tag 'rust-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux
Pull Rust updates from Miguel Ojeda:
 "Toolchain and infrastructure:

   - Bump the minimum Rust version to 1.85.0 (and 'bindgen' to 0.71.1).

     As proposed in LPC 2025 and the Maintainers Summit [1], we are
     going to follow Debian Stable's Rust versions as our minimum
     versions.

     Debian Trixie was released on 2025-08-09 with a Rust 1.85.0 and
     'bindgen' 0.71.1 toolchain, which is a fair amount of time for e.g.
     kernel developers to upgrade.

     Other major distributions support a Rust version that is high
     enough as well, including:

       + Arch Linux.
       + Fedora Linux.
       + Gentoo Linux.
       + Nix.
       + openSUSE Slowroll and openSUSE Tumbleweed.
       + Ubuntu 25.10 and 26.04 LTS. In addition, 24.04 LTS using
         their versioned packages.

     The merged patch series comes with the associated cleanups and
     simplifications treewide that can be performed thanks to both
     bumps, as well as documentation updates.

     In addition, start using 'bindgen''s '--with-attribute-custom-enum'
     feature to set the 'cfi_encoding' attribute for the 'lru_status'
     enum used in Binder.

     Link: https://lwn.net/Articles/1050174/ [1]

   - Add experimental Kconfig option ('CONFIG_RUST_INLINE_HELPERS') that
     inlines C helpers into Rust.

     Essentially, it performs a step similar to LTO, but just for the
     helpers, i.e. very local and fast.

     It relies on 'llvm-link' and its '--internalize' flag, and requires
     a compatible LLVM between Clang and 'rustc' (i.e. same major
     version, 'CONFIG_RUSTC_CLANG_LLVM_COMPATIBLE'). It is only enabled
     for two architectures for now.

     The result is a measurable speedup in different workloads that
     different users have tested. For instance, for the null block
     driver, it amounts to a 2%.

   - Support global per-version flags.

     While we already have per-version flags in many places, we didn't
     have a place to set global ones that depend on the compiler
     version, i.e. in 'rust_common_flags', which sometimes is needed to
     e.g. tweak the lints set per version.

     Use that to allow the 'clippy::precedence' lint for Rust < 1.86.0,
     since it had a change in behavior.

   - Support overriding the crate name and apply it to Rust Binder,
     which wanted the module to be called 'rust_binder'.

   - Add the remaining '__rust_helper' annotations (started in the
     previous cycle).

  'kernel' crate:

   - Introduce the 'const_assert!' macro: a more powerful version of
     'static_assert!' that can refer to generics inside functions or
     implementation bodies, e.g.:

         fn f<const N: usize>() {
             const_assert!(N > 1);
         }

         fn g<T>() {
             const_assert!(size_of::<T>() > 0, "T cannot be ZST");
         }

     In addition, reorganize our set of build-time assertion macros
     ('{build,const,static_assert}!') to live in the 'build_assert'
     module.

     Finally, improve the docs as well to clarify how these are
     different from one another and how to pick the right one to use,
     and their equivalence (if any) to the existing C ones for extra
     clarity.

   - 'sizes' module: add 'SizeConstants' trait.

     This gives us typed 'SZ_*' constants (avoiding casts) for use in
     device address spaces where the address width depends on the
     hardware (e.g. 32-bit MMIO windows, 64-bit GPU framebuffers, etc.),
     e.g.:

         let gpu_heap = 14 * u64::SZ_1M;
         let mmio_window = u32::SZ_16M;

   - 'clk' module: implement 'Send' and 'Sync' for 'Clk' and thus
     simplify the users in Tyr and PWM.

   - 'ptr' module: add 'const_align_up'.

   - 'str' module: improve the documentation of the 'c_str!' macro to
     explain that one should only use it for non-literal cases (for the
     other case we instead use C string literals, e.g. 'c"abc"').

   - Disallow the use of 'CStr::{as_ptr,from_ptr}' and clean one such
     use in the 'task' module.

   - 'sync' module: finish the move of 'ARef' and 'AlwaysRefCounted'
     outside of the 'types' module, i.e. update the last remaining
     instances and finally remove the re-exports.

   - 'error' module: clarify that 'from_err_ptr' can return 'Ok(NULL)',
     including runtime-tested examples.

     The intention is to hopefully prevent UB that assumes the result of
     the function is not 'NULL' if successful. This originated from a
     case of UB I noticed in 'regulator' that created a 'NonNull' on it.

  Timekeeping:

   - Expand the example section in the 'HrTimer' documentation.

   - Mark the 'ClockSource' trait as unsafe to ensure valid values for
     'ktime_get()'.

   - Add 'Delta::from_nanos()'.

  'pin-init' crate:

   - Replace the 'Zeroable' impls for 'Option<NonZero*>' with impls of
     'ZeroableOption' for 'NonZero*'.

   - Improve feature gate handling for unstable features.

   - Declutter the documentation of implementations of 'Zeroable' for
     tuples.

   - Replace uses of 'addr_of[_mut]!' with '&raw [mut]'.

  rust-analyzer:

   - Add type annotations to 'generate_rust_analyzer.py'.

   - Add support for scripts written in Rust ('generate_rust_target.rs',
     'rustdoc_test_builder.rs', 'rustdoc_test_gen.rs').

   - Refactor 'generate_rust_analyzer.py' to explicitly identify host
     and target crates, improve readability, and reduce duplication.

  And some other fixes, cleanups and improvements"

* tag 'rust-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux: (79 commits)
  rust: sizes: add SizeConstants trait for device address space constants
  rust: kernel: update `file_with_nul` comment
  rust: kbuild: allow `clippy::precedence` for Rust < 1.86.0
  rust: kbuild: support global per-version flags
  rust: declare cfi_encoding for lru_status
  docs: rust: general-information: use real example
  docs: rust: general-information: simplify Kconfig example
  docs: rust: quick-start: remove GDB/Binutils mention
  docs: rust: quick-start: remove Nix "unstable channel" note
  docs: rust: quick-start: remove Gentoo "testing" note
  docs: rust: quick-start: add Ubuntu 26.04 LTS and remove subsection title
  docs: rust: quick-start: update minimum Ubuntu version
  docs: rust: quick-start: update Ubuntu versioned packages
  docs: rust: quick-start: openSUSE provides `rust-src` package nowadays
  rust: kbuild: remove "dummy parameter" workaround for `bindgen` < 0.71.1
  rust: kbuild: update `bindgen --rust-target` version and replace comment
  rust: rust_is_available: remove warning for `bindgen` < 0.69.5 && libclang >= 19.1
  rust: rust_is_available: remove warning for `bindgen` 0.66.[01]
  rust: bump `bindgen` minimum supported version to 0.71.1 (Debian Trixie)
  rust: block: update `const_refs_to_static` MSRV TODO comment
  ...
2026-04-13 09:54:20 -07:00
Tim Chen 47d8696b95 sched/cache: Assign preferred LLC ID to processes
With cache-aware scheduling enabled, each task is assigned a
preferred LLC ID. This allows quick identification of the LLC domain
where the task prefers to run, similar to numa_preferred_nid in
NUMA balancing.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/f2ceecba5858680349ad4ce9303a2121f0bb7272.1775065312.git.tim.c.chen@linux.intel.com
2026-04-09 15:49:49 +02:00
Peter Zijlstra (Intel) df0d984759 sched/cache: Introduce infrastructure for cache-aware load balancing
Adds infrastructure to enable cache-aware load balancing,
which improves cache locality by grouping tasks that share resources
within the same cache domain. This reduces cache misses and improves
overall data access efficiency.

In this initial implementation, threads belonging to the same process
are treated as entities that likely share working sets. The mechanism
tracks per-process CPU occupancy across cache domains and attempts to
migrate threads toward cache-hot domains where their process already
has active threads, thereby enhancing locality.

This provides a basic model for cache affinity. While the current code
targets the last-level cache (LLC), the approach could be extended to
other domain types such as clusters (L2) or node-internal groupings.

At present, the mechanism selects the CPU within an LLC that has the
highest recent runtime. Subsequent patches in this series will use this
information in the load-balancing path to guide task placement toward
preferred LLCs.

In the future, more advanced policies could be integrated through NUMA
balancing-for example, migrating a task to its preferred LLC when spare
capacity exists, or swapping tasks across LLCs to improve cache affinity.
Grouping of tasks could also be generalized from that of a process
to be that of a NUMA group, or be user configurable.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/6269a53221b9439b9ca00d18a9d1946fb64d8cff.1775065312.git.tim.c.chen@linux.intel.com
2026-04-09 15:49:47 +02:00
Miguel Ojeda 93553d9922 rust: kbuild: remove "dummy parameter" workaround for bindgen < 0.71.1
Until the version bump of `bindgen`, we needed to pass a dummy parameter
to avoid failing the `--version` call.

Thus remove it.

Reviewed-by: Tamir Duberstein <tamird@kernel.org>
Reviewed-by: Gary Guo <gary@garyguo.net>
Link: https://patch.msgid.link/20260405235309.418950-22-ojeda@kernel.org
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
2026-04-07 10:00:24 +02:00
Miguel Ojeda 4ab22c543f rust: remove RUSTC_HAS_COERCE_POINTEE and simplify code
With the Rust version bump in place, the `RUSTC_HAS_COERCE_POINTEE`
Kconfig (automatic) option is always true.

Thus remove the option and simplify the code.

In particular, this includes removing our use of the predecessor unstable
features we used with Rust < 1.84.0 (`coerce_unsized`, `dispatch_from_dyn`
and `unsize`).

Reviewed-by: Tamir Duberstein <tamird@kernel.org>
Acked-by: Danilo Krummrich <dakr@kernel.org>
Reviewed-by: Gary Guo <gary@garyguo.net>
Link: https://patch.msgid.link/20260405235309.418950-11-ojeda@kernel.org
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
2026-04-07 10:00:23 +02:00
Miguel Ojeda 9b398d0565 rust: remove RUSTC_HAS_SLICE_AS_FLATTENED and simplify code
With the Rust version bump in place, the `RUSTC_HAS_SLICE_AS_FLATTENED`
Kconfig (automatic) option is always true.

Thus remove the option and simplify the code.

In particular, this includes removing the `slice` module which contained
the temporary slice helpers, i.e. the `AsFlattened` extension trait and
its `impl`s.

Reviewed-by: Tamir Duberstein <tamird@kernel.org>
Reviewed-by: Gary Guo <gary@garyguo.net>
Link: https://patch.msgid.link/20260405235309.418950-10-ojeda@kernel.org
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
2026-04-07 10:00:23 +02:00
Miguel Ojeda b28711ac98 rust: simplify RUSTC_VERSION Kconfig conditions
With the Rust version bump in place, several Kconfig conditions based on
`RUSTC_VERSION` are always true.

Thus simplify them.

The minimum supported major LLVM version by our new Rust minimum version
is now LLVM 18, instead of LLVM 16. However, there are no possible
cleanups for `RUSTC_LLVM_VERSION`.

Reviewed-by: Tamir Duberstein <tamird@kernel.org>
Reviewed-by: Gary Guo <gary@garyguo.net>
Link: https://patch.msgid.link/20260405235309.418950-9-ojeda@kernel.org
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
2026-04-07 10:00:23 +02:00
David Hildenbrand (Arm) 6ebf98d71f mm: introduce CONFIG_NUMA_MIGRATION and simplify CONFIG_MIGRATION
CONFIG_MEMORY_HOTREMOVE, CONFIG_COMPACTION and CONFIG_CMA all select
CONFIG_MIGRATION, because they require it to work (users).

Only CONFIG_NUMA_BALANCING and CONFIG_BALLOON_MIGRATION depend on
CONFIG_MIGRATION.  CONFIG_BALLOON_MIGRATION is not an actual user, but an
implementation of migration support, so the dependency is correct
(CONFIG_BALLOON_MIGRATION does not make any sense without
CONFIG_MIGRATION).

However, kconfig-language.rst clearly states "In general use select only
for non-visible symbols".  So far CONFIG_MIGRATION is user-visible ... 
and the dependencies rather confusing.

The whole reason why CONFIG_MIGRATION is user-visible is because of
CONFIG_NUMA: some users might want CONFIG_NUMA but not page migration
support.

Let's clean all that up by introducing a dedicated CONFIG_NUMA_MIGRATION
config option for that purpose only.  Make CONFIG_NUMA_BALANCING that so
far depended on CONFIG_NUMA && CONFIG_MIGRATION to depend on
CONFIG_MIGRATION instead.  CONFIG_NUMA_MIGRATION will depend on
CONFIG_NUMA && CONFIG_MMU.

CONFIG_NUMA_MIGRATION is user-visible and will default to "y".  We use
that default so new configs will automatically enable it, just like it was
the case with CONFIG_MIGRATION.  The downside is that some configs that
used to have CONFIG_MIGRATION=n might get it re-enabled by
CONFIG_NUMA_MIGRATION=y, which shouldn't be a problem.

CONFIG_MIGRATION is now a non-visible config option.  Any code that select
CONFIG_MIGRATION (as before) must depend directly or indirectly on
CONFIG_MMU.

CONFIG_NUMA_MIGRATION is responsible for any NUMA migration code, which is
mempolicy migration code, memory-tiering code, and move_pages() code in
migrate.c.  CONFIG_NUMA_BALANCING uses its functionality.

Note that this implies that with CONFIG_NUMA_MIGRATION=n, move_pages()
will not be available even though CONFIG_MIGRATION=y, which is an expected
change.

In migrate.c, we can remove the CONFIG_NUMA check as both
CONFIG_NUMA_MIGRATION and CONFIG_NUMA_BALANCING depend on it.

With this change, CONFIG_MIGRATION is an internal config, all users of
migration selects CONFIG_MIGRATION, and only CONFIG_BALLOON_MIGRATION
depends on it.

Link: https://lkml.kernel.org/r/20260319-config_migration-v1-2-42270124966f@kernel.org
Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:53:33 -07:00
Bartosz Golaszewski 9617b5b62c kernel: ksysfs: initialize kernel_kobj earlier
Software nodes depend on kernel_kobj which is initialized pretty late
into the boot process - as a core_initcall(). Ahead of moving the
software node initialization to driver_init() we must first make
kernel_kobj available before it.

Make ksysfs_init() visible in a new header - ksysfs.h - and call it in
do_basic_setup() right before driver_init().

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Link: https://patch.msgid.link/20260402-nokia770-gpio-swnodes-v5-1-d730db3dd299@oss.qualcomm.com
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
2026-04-03 19:39:52 +02:00
John Stultz fa4a1ff8ab locking: Add task::blocked_lock to serialize blocked_on state
So far, we have been able to utilize the mutex::wait_lock
for serializing the blocked_on state, but when we move to
proxying across runqueues, we will need to add more state
and a way to serialize changes to this state in contexts
where we don't hold the mutex::wait_lock.

So introduce the task::blocked_lock, which nests under the
mutex::wait_lock in the locking order, and rework the locking
to use it.

Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260324191337.1841376-5-jstultz@google.com
2026-04-03 14:23:39 +02:00
Mike Rapoport (Microsoft) b2129a3951 memblock: make free_reserved_area() update memblock if ARCH_KEEP_MEMBLOCK=y
On architectures that keep memblock after boot, freeing of reserved memory
with free_reserved_area() is paired with an update of memblock arrays,
usually by a call to memblock_free().

Make free_reserved_area() directly update memblock.reserved when
ARCH_KEEP_MEMBLOCK is enabled.

Remove the now-redundant explicit memblock_free() call from
arm64::free_initmem() and the #ifdef CONFIG_ARCH_KEEP_MEMBLOCK block
from the generic free_initrd_mem().

Link: https://patch.msgid.link/20260323074836.3653702-8-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2026-04-01 11:20:15 +03:00
Gary Guo e90f97ce20 kbuild: rust: add CONFIG_RUSTC_CLANG_LLVM_COMPATIBLE
This config detects if Rust and Clang have matching LLVM major version.
All IR or bitcode operations (e.g. LTO) rely on LLVM major version to be
matching, otherwise it may generate errors, or worse, miscompile
silently due to change of IR semantics.

It's usually suggested to use the exact same LLVM version, but this can
be difficult to guarantee. Rust's suggestion [1] is also major-version
only, so I think this check is sufficient for the kernel.

Link: https://doc.rust-lang.org/rustc/linker-plugin-lto.html [1]
Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
Signed-off-by: Gary Guo <gary@garyguo.net>
Signed-off-by: Matthew Maurer <mmaurer@google.com>
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Nicolas Schier <nsc@kernel.org>
Tested-by: Nicolas Schier <nsc@kernel.org>
Tested-by: Andreas Hindborg <a.hindborg@kernel.org>
Link: https://patch.msgid.link/20260203-inline-helpers-v2-1-beb8547a03c9@google.com
[ Fixed typo. - Miguel ]
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
2026-03-30 02:03:52 +02:00
Thomas Weißschuh 1b6c89285d timens: Remove dependency on the vDSO
Previously, missing time namespace support in the vDSO meant that time
namespaces needed to be disabled globally. This was expressed in a hard
dependency on the generic vDSO library. This also meant that architectures
without any vDSO or only a stub vDSO could not enable time namespaces.
Now that all architectures using a real vDSO are using the generic library,
that dependency is not necessary anymore.

Remove the dependency and let all architectures enable time namespaces.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260326-vdso-timens-decoupling-v2-2-c82693a7775f@linutronix.de
2026-03-26 15:44:23 +01:00
Linus Torvalds d2a43e7f89 Merge tag 'hardening-v7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening fixes from Kees Cook:

 - fix required Clang version for CC_HAS_COUNTED_BY_PTR (Nathan
   Chancellor)

 - update Coccinelle script used for kmalloc_obj

* tag 'hardening-v7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  init/Kconfig: Require a release version of clang-22 for CC_HAS_COUNTED_BY_PTR
  coccinelle: kmalloc_obj: Remove default GFP_KERNEL arg
2026-03-25 14:47:18 -07:00
Ryan Roberts a96ef5848c randomize_kstack: Unify random source across arches
Previously different architectures were using random sources of
differing strength and cost to decide the random kstack offset. A number
of architectures (loongarch, powerpc, s390, x86) were using their
timestamp counter, at whatever the frequency happened to be. Other
arches (arm64, riscv) were using entropy from the crng via
get_random_u16().

There have been concerns that in some cases the timestamp counters may
be too weak, because they can be easily guessed or influenced by user
space. And get_random_u16() has been shown to be too costly for the
level of protection kstack offset randomization provides.

So let's use a common, architecture-agnostic source of entropy; a
per-cpu prng, seeded at boot-time from the crng. This has a few
benefits:

  - We can remove choose_random_kstack_offset(); That was only there to
    try to make the timestamp counter value a bit harder to influence
    from user space [*].

  - The architecture code is simplified. All it has to do now is call
    add_random_kstack_offset() in the syscall path.

  - The strength of the randomness can be reasoned about independently
    of the architecture.

  - Arches previously using get_random_u16() now have much faster
    syscall paths, see below results.

[*] Additionally, this gets rid of some redundant work on s390 and x86.
Before this patch, those architectures called
choose_random_kstack_offset() under arch_exit_to_user_mode_prepare(),
which is also called for exception returns to userspace which were *not*
syscalls (e.g. regular interrupts). Getting rid of
choose_random_kstack_offset() avoids a small amount of redundant work
for the non-syscall cases.

In some configurations, add_random_kstack_offset() will now call
instrumentable code, so for a couple of arches, I have moved the call a
bit later to the first point where instrumentation is allowed. This
doesn't impact the efficacy of the mechanism.

There have been some claims that a prng may be less strong than the
timestamp counter if not regularly reseeded. But the prng has a period
of about 2^113. So as long as the prng state remains secret, it should
not be possible to guess. If the prng state can be accessed, we have
bigger problems.

Additionally, we are only consuming 6 bits to randomize the stack, so
there are only 64 possible random offsets. I assert that it would be
trivial for an attacker to brute force by repeating their attack and
waiting for the random stack offset to be the desired one. The prng
approach seems entirely proportional to this level of protection.

Performance data are provided below. The baseline is v6.18 with rndstack
on for each respective arch. (I)/(R) indicate statistically significant
improvement/regression. arm64 platform is AWS Graviton3 (m7g.metal).
x86_64 platform is AWS Sapphire Rapids (m7i.24xlarge):

+-----------------+--------------+---------------+---------------+
| Benchmark       | Result Class |  per-cpu-prng |  per-cpu-prng |
|                 |              | arm64 (metal) |   x86_64 (VM) |
+=================+==============+===============+===============+
| syscall/getpid  | mean (ns)    |    (I) -9.50% |   (I) -17.65% |
|                 | p99 (ns)     |   (I) -59.24% |   (I) -24.41% |
|                 | p99.9 (ns)   |   (I) -59.52% |   (I) -28.52% |
+-----------------+--------------+---------------+---------------+
| syscall/getppid | mean (ns)    |    (I) -9.52% |   (I) -19.24% |
|                 | p99 (ns)     |   (I) -59.25% |   (I) -25.03% |
|                 | p99.9 (ns)   |   (I) -59.50% |   (I) -28.17% |
+-----------------+--------------+---------------+---------------+
| syscall/invalid | mean (ns)    |   (I) -10.31% |   (I) -18.56% |
|                 | p99 (ns)     |   (I) -60.79% |   (I) -20.06% |
|                 | p99.9 (ns)   |   (I) -61.04% |   (I) -25.04% |
+-----------------+--------------+---------------+---------------+

I tested an earlier version of this change on x86 bare metal and it
showed a smaller but still significant improvement. The bare metal
system wasn't available this time around so testing was done in a VM
instance. I'm guessing the cost of rdtsc is higher for VMs.

Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Link: https://patch.msgid.link/20260303150840.3789438-3-ryan.roberts@arm.com
Signed-off-by: Kees Cook <kees@kernel.org>
2026-03-24 21:12:03 -07:00
Ryan Roberts 37beb42560 randomize_kstack: Maintain kstack_offset per task
kstack_offset was previously maintained per-cpu, but this caused a
couple of issues. So let's instead make it per-task.

Issue 1: add_random_kstack_offset() and choose_random_kstack_offset()
expected and required to be called with interrupts and preemption
disabled so that it could manipulate per-cpu state. But arm64, loongarch
and risc-v are calling them with interrupts and preemption enabled. I
don't _think_ this causes any functional issues, but it's certainly
unexpected and could lead to manipulating the wrong cpu's state, which
could cause a minor performance degradation due to bouncing the cache
lines. By maintaining the state per-task those functions can safely be
called in preemptible context.

Issue 2: add_random_kstack_offset() is called before executing the
syscall and expands the stack using a previously chosen random offset.
choose_random_kstack_offset() is called after executing the syscall and
chooses and stores a new random offset for the next syscall. With
per-cpu storage for this offset, an attacker could force cpu migration
during the execution of the syscall and prevent the offset from being
updated for the original cpu such that it is predictable for the next
syscall on that cpu. By maintaining the state per-task, this problem
goes away because the per-task random offset is updated after the
syscall regardless of which cpu it is executing on.

Fixes: 39218ff4c6 ("stack: Optionally randomize kernel stack offset each syscall")
Closes: https://lore.kernel.org/all/dd8c37bc-795f-4c7a-9086-69e584d8ab24@arm.com/
Cc: stable@vger.kernel.org
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Link: https://patch.msgid.link/20260303150840.3789438-2-ryan.roberts@arm.com
Signed-off-by: Kees Cook <kees@kernel.org>
2026-03-24 21:12:03 -07:00
Cheng-Yang Chou e73b1d7210 sched_ext: Fix build errors and unused label warning in non-cgroup configs
When building with SCHED_CLASS_EXT=y but CGROUPS=n, clang reports errors
for undeclared cgroup_put() and cgroup_get() calls, and a warning for the
unused err_stop_helper label.

EXT_SUB_SCHED is def_bool y depending only on SCHED_CLASS_EXT, but it
fundamentally requires cgroups (cgroup_path, cgroup_get, cgroup_put,
cgroup_id, etc.). Add the missing CGROUPS dependency to EXT_SUB_SCHED in
init/Kconfig.

Guard cgroup_put() and cgroup_get() in the common paths with:
  #if defined(CONFIG_EXT_GROUP_SCHED) || defined(CONFIG_EXT_SUB_SCHED)

Guard the err_stop_helper label with #ifdef CONFIG_EXT_SUB_SCHED since
all gotos targeting it are inside that same ifdef block.

Tested with both CGROUPS enabled and disabled.

Fixes: ebeca1f930 ("sched_ext: Introduce cgroup sub-sched support")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202603210903.IrKhPd6k-lkp@intel.com/
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-22 10:02:11 -10:00