Files
linux-stable-mirror/arch/xtensa/include/asm/page.h
T
David Hildenbrand 8e38607aa4 treewide: provide a generic clear_user_page() variant
Patch series "mm: folio_zero_user: clear page ranges", v11.

This series adds clearing of contiguous page ranges for hugepages.

The series improves on the current discontiguous clearing approach in two
ways:

  - clear pages in a contiguous fashion.
  - use batched clearing via clear_pages() wherever exposed.

The first is useful because it allows us to make much better use of
hardware prefetchers.

The second, enables advertising the real extent to the processor.  Where
specific instructions support it (ex.  string instructions on x86; "mops"
on arm64 etc), a processor can optimize based on this because, instead of
seeing a sequence of 8-byte stores, or a sequence of 4KB pages, it sees a
larger unit being operated on.

For instance, AMD Zen uarchs (for extents larger than LLC-size) switch to
a mode where they start eliding cacheline allocation.  This is helpful not
just because it results in higher bandwidth, but also because now the
cache is not evicting useful cachelines and replacing them with zeroes.

Demand faulting a 64GB region shows performance improvement:

 $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5

                       baseline              +series
                   (GBps +- %stdev)      (GBps +- %stdev)

   pg-sz=2MB       11.76 +- 1.10%        25.34 +- 1.18% [*]   +115.47%  	preempt=*

   pg-sz=1GB       24.85 +- 2.41%        39.22 +- 2.32%       + 57.82%  	preempt=none|voluntary
   pg-sz=1GB         (similar)           52.73 +- 0.20% [#]   +112.19%  	preempt=full|lazy

 [*] This improvement is because switching to sequential clearing
  allows the hardware prefetchers to do a much better job.

 [#] For pg-sz=1GB a large part of the improvement is because of the
  cacheline elision mentioned above. preempt=full|lazy improves upon
  that because, not needing explicit invocations of cond_resched() to
  ensure reasonable preemption latency, it can clear the full extent
  as a single unit. In comparison the maximum extent used for
  preempt=none|voluntary is PROCESS_PAGES_NON_PREEMPT_BATCH (32MB).

  When provided the full extent the processor forgoes allocating
  cachelines on this path almost entirely.

  (The hope is that eventually, in the fullness of time, the lazy
   preemption model will be able to do the same job that none or
   voluntary models are used for, allowing us to do away with
   cond_resched().)

Raghavendra also tested previous version of the series on AMD Genoa and
sees similar improvement [1] with preempt=lazy.

  $ perf bench mem map -p $page-size -f populate -s 64GB -l 10

                    base               patched              change
   pg-sz=2MB       12.731939 GB/sec    26.304263 GB/sec     106.6%
   pg-sz=1GB       26.232423 GB/sec    61.174836 GB/sec     133.2%


This patch (of 8):

Let's drop all variants that effectively map to clear_page() and provide
it in a generic variant instead.

We'll use the macro clear_user_page to indicate whether an architecture
provides it's own variant.

Also, clear_user_page() is only called from the generic variant of
clear_user_highpage(), so define it only if the architecture does not
provide a clear_user_highpage().  And, for simplicity define it in
linux/highmem.h.

Note that for parisc, clear_page() and clear_user_page() map to
clear_page_asm(), so we can just get rid of the custom clear_user_page()
implementation.  There is a clear_user_page_asm() function on parisc, that
seems to be unused.  Not sure what's up with that.

Link: https://lkml.kernel.org/r/20260107072009.1615991-1-ankur.a.arora@oracle.com
Link: https://lkml.kernel.org/r/20260107072009.1615991-2-ankur.a.arora@oracle.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Co-developed-by: Ankur Arora <ankur.a.arora@oracle.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ankur Arora <ankur.a.arora@oracle.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:24:39 -08:00

178 lines
5.0 KiB
C

/*
* include/asm-xtensa/page.h
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License version2 as
* published by the Free Software Foundation.
*
* Copyright (C) 2001 - 2007 Tensilica Inc.
*/
#ifndef _XTENSA_PAGE_H
#define _XTENSA_PAGE_H
#include <linux/const.h>
#include <asm/processor.h>
#include <asm/types.h>
#include <asm/cache.h>
#include <asm/kmem_layout.h>
#include <vdso/page.h>
#ifdef CONFIG_MMU
#define PAGE_OFFSET XCHAL_KSEG_CACHED_VADDR
#define PHYS_OFFSET XCHAL_KSEG_PADDR
#define MAX_LOW_PFN (PHYS_PFN(XCHAL_KSEG_PADDR) + \
PHYS_PFN(XCHAL_KSEG_SIZE))
#else
#define PAGE_OFFSET _AC(CONFIG_DEFAULT_MEM_START, UL)
#define PHYS_OFFSET _AC(CONFIG_DEFAULT_MEM_START, UL)
#define MAX_LOW_PFN PHYS_PFN(0xfffffffful)
#endif
/*
* Cache aliasing:
*
* If the cache size for one way is greater than the page size, we have to
* deal with cache aliasing. The cache index is wider than the page size:
*
* | |cache| cache index
* | pfn |off| virtual address
* |xxxx:X|zzz|
* | : | |
* | \ / | |
* |trans.| |
* | / \ | |
* |yyyy:Y|zzz| physical address
*
* When the page number is translated to the physical page address, the lowest
* bit(s) (X) that are part of the cache index are also translated (Y).
* If this translation changes bit(s) (X), the cache index is also afected,
* thus resulting in a different cache line than before.
* The kernel does not provide a mechanism to ensure that the page color
* (represented by this bit) remains the same when allocated or when pages
* are remapped. When user pages are mapped into kernel space, the color of
* the page might also change.
*
* We use the address space VMALLOC_END ... VMALLOC_END + DCACHE_WAY_SIZE * 2
* to temporarily map a patch so we can match the color.
*/
#if DCACHE_WAY_SIZE > PAGE_SIZE
# define DCACHE_ALIAS_ORDER (DCACHE_WAY_SHIFT - PAGE_SHIFT)
# define DCACHE_ALIAS_MASK (PAGE_MASK & (DCACHE_WAY_SIZE - 1))
# define DCACHE_ALIAS(a) (((a) & DCACHE_ALIAS_MASK) >> PAGE_SHIFT)
# define DCACHE_ALIAS_EQ(a,b) ((((a) ^ (b)) & DCACHE_ALIAS_MASK) == 0)
#else
# define DCACHE_ALIAS_ORDER 0
# define DCACHE_ALIAS(a) ((void)(a), 0)
#endif
#define DCACHE_N_COLORS (1 << DCACHE_ALIAS_ORDER)
#if ICACHE_WAY_SIZE > PAGE_SIZE
# define ICACHE_ALIAS_ORDER (ICACHE_WAY_SHIFT - PAGE_SHIFT)
# define ICACHE_ALIAS_MASK (PAGE_MASK & (ICACHE_WAY_SIZE - 1))
# define ICACHE_ALIAS(a) (((a) & ICACHE_ALIAS_MASK) >> PAGE_SHIFT)
# define ICACHE_ALIAS_EQ(a,b) ((((a) ^ (b)) & ICACHE_ALIAS_MASK) == 0)
#else
# define ICACHE_ALIAS_ORDER 0
#endif
#ifdef __ASSEMBLER__
#define __pgprot(x) (x)
#else
/*
* These are used to make use of C type-checking..
*/
typedef struct { unsigned long pte; } pte_t; /* page table entry */
typedef struct { unsigned long pgd; } pgd_t; /* PGD table entry */
typedef struct { unsigned long pgprot; } pgprot_t;
typedef struct page *pgtable_t;
#define pte_val(x) ((x).pte)
#define pgd_val(x) ((x).pgd)
#define pgprot_val(x) ((x).pgprot)
#define __pte(x) ((pte_t) { (x) } )
#define __pgd(x) ((pgd_t) { (x) } )
#define __pgprot(x) ((pgprot_t) { (x) } )
# include <asm-generic/getorder.h>
struct page;
struct vm_area_struct;
extern void clear_page(void *page);
extern void copy_page(void *to, void *from);
/*
* If we have cache aliasing and writeback caches, we might have to do
* some extra work
*/
#if defined(CONFIG_MMU) && DCACHE_WAY_SIZE > PAGE_SIZE
extern void clear_page_alias(void *vaddr, unsigned long paddr);
extern void copy_page_alias(void *to, void *from,
unsigned long to_paddr, unsigned long from_paddr);
#define clear_user_highpage clear_user_highpage
void clear_user_highpage(struct page *page, unsigned long vaddr);
#define __HAVE_ARCH_COPY_USER_HIGHPAGE
void copy_user_highpage(struct page *to, struct page *from,
unsigned long vaddr, struct vm_area_struct *vma);
#else
# define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
#endif
/*
* This handles the memory map. We handle pages at
* XCHAL_KSEG_CACHED_VADDR for kernels with 32 bit address space.
* These macros are for conversion of kernel address, not user
* addresses.
*/
#define ARCH_PFN_OFFSET (PHYS_OFFSET >> PAGE_SHIFT)
#ifdef CONFIG_MMU
static inline unsigned long ___pa(unsigned long va)
{
unsigned long off = va - PAGE_OFFSET;
if (off >= XCHAL_KSEG_SIZE)
off -= XCHAL_KSEG_SIZE;
#ifndef CONFIG_XIP_KERNEL
return off + PHYS_OFFSET;
#else
if (off < XCHAL_KSEG_SIZE)
return off + PHYS_OFFSET;
off -= XCHAL_KSEG_SIZE;
if (off >= XCHAL_KIO_SIZE)
off -= XCHAL_KIO_SIZE;
return off + XCHAL_KIO_PADDR;
#endif
}
#define __pa(x) ___pa((unsigned long)(x))
#else
#define __pa(x) \
((unsigned long) (x) - PAGE_OFFSET + PHYS_OFFSET)
#endif
#define __va(x) \
((void *)((unsigned long) (x) - PHYS_OFFSET + PAGE_OFFSET))
#define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
#define page_to_virt(page) __va(page_to_pfn(page) << PAGE_SHIFT)
#define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT)
#endif /* __ASSEMBLER__ */
#include <asm-generic/memory_model.h>
#endif /* _XTENSA_PAGE_H */