[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1a738e0a-ac11-4cd3-be2f-6b6e7cb4980a@nvidia.com>
Date: Tue, 28 Nov 2023 19:42:36 -0800
From: John Hubbard <jhubbard@...dia.com>
To: Ryan Roberts <ryan.roberts@....com>,
Andrew Morton <akpm@...ux-foundation.org>,
Matthew Wilcox <willy@...radead.org>,
"Yin Fengwei" <fengwei.yin@...el.com>,
David Hildenbrand <david@...hat.com>,
"Yu Zhao" <yuzhao@...gle.com>,
Catalin Marinas <catalin.marinas@....com>,
"Anshuman Khandual" <anshuman.khandual@....com>,
Yang Shi <shy828301@...il.com>,
"Huang, Ying" <ying.huang@...el.com>, Zi Yan <ziy@...dia.com>,
Luis Chamberlain <mcgrof@...nel.org>,
Itaru Kitayama <itaru.kitayama@...il.com>,
"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
David Rientjes <rientjes@...gle.com>,
Vlastimil Babka <vbabka@...e.cz>,
Hugh Dickins <hughd@...gle.com>,
Kefeng Wang <wangkefeng.wang@...wei.com>
CC: <linux-mm@...ck.org>, <linux-arm-kernel@...ts.infradead.org>,
<linux-kernel@...r.kernel.org>
Subject: Re: [RESEND PATCH v7 03/10] mm: thp: Introduce per-size thp sysfs
interface
On 11/22/23 08:29, Ryan Roberts wrote:
> In preparation for adding support for anonymous small-sized THP,
> introduce new sysfs structure that will be used to control the new
> behaviours. A new directory is added under transparent_hugepage for each
> supported THP size, and contains an `enabled` file, which can be set to
> "global" (to inherrit the global setting), "always", "madvise" or
> "never". For now, the kernel still only supports PMD-sized anonymous
> THP, so only 1 directory is populated.
>
> The first half of the change converts transhuge_vma_suitable() and
> hugepage_vma_check() so that they take a bitfield of orders for which
> the user wants to determine support, and the functions filter out all
> the orders that can't be supported, given the current sysfs
> configuration and the VMA dimensions. If there is only 1 order set in
> the input then the output can continue to be treated like a boolean;
> this is the case for most call sites.
>
> The second half of the change implements the new sysfs interface. It has
> been done so that each supported THP size has a `struct thpsize`, which
> describes the relevant metadata and is itself a kobject. This is pretty
> minimal for now, but should make it easy to add new per-thpsize files to
> the interface if needed in future (e.g. per-size defrag). Rather than
> keep the `enabled` state directly in the struct thpsize, I've elected to
> directly encode it into huge_anon_orders_[always|madvise|global]
> bitfields since this reduces the amount of work required in
> transhuge_vma_suitable() which is called for every page fault.
>
> The remainder is copied from Documentation/admin-guide/mm/transhuge.rst,
> as modified by this commit. See that file for further details.
>
> Transparent Hugepage Support for anonymous memory can be entirely
> disabled (mostly for debugging purposes) or only enabled inside
> MADV_HUGEPAGE regions (to avoid the risk of consuming more memory
> resources) or enabled system wide. This can be achieved
> per-supported-THP-size with one of::
>
> echo always >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
> echo madvise >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
> echo never >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
>
> where <size> is the hugepage size being addressed, the available sizes
> for which vary by system. Alternatively it is possible to specify that
> a given hugepage size will inherrit the global enabled setting::
>
> echo global >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
>
> The global (legacy) enabled setting can be set as follows::
>
> echo always >/sys/kernel/mm/transparent_hugepage/enabled
> echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
> echo never >/sys/kernel/mm/transparent_hugepage/enabled
>
> By default, PMD-sized hugepages have enabled="global" and all other
> hugepage sizes have enabled="never". If enabling multiple hugepage
> sizes, the kernel will select the most appropriate enabled size for a
> given allocation.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@....com>
> ---
> Documentation/admin-guide/mm/transhuge.rst | 74 ++++--
> Documentation/filesystems/proc.rst | 6 +-
> fs/proc/task_mmu.c | 3 +-
> include/linux/huge_mm.h | 100 +++++---
> mm/huge_memory.c | 263 +++++++++++++++++++--
> mm/khugepaged.c | 16 +-
> mm/memory.c | 6 +-
> mm/page_vma_mapped.c | 3 +-
> 8 files changed, 387 insertions(+), 84 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index b0cc8243e093..52565e0bd074 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -45,10 +45,23 @@ components:
> the two is using hugepages just because of the fact the TLB miss is
> going to run faster.
>
> +As well as PMD-sized THP described above, it is also possible to
> +configure the system to allocate "small-sized THP" to back anonymous
Here's one of the places to change to the new name, which lately is
"multi-size THP", or mTHP or m_thp for short. (I've typed "multi-size"
instead of "multi-sized", because the 'd' doesn't add significantly to
the meaning, and if in doubt, shorter is better.
> +memory (for example 16K, 32K, 64K, etc). These THPs continue to be
> +PTE-mapped, but in many cases can still provide similar benefits to
> +those outlined above: Page faults are significantly reduced (by a
> +factor of e.g. 4, 8, 16, etc), but latency spikes are much less
> +prominent because the size of each page isn't as huge as the PMD-sized
> +variant and there is less memory to clear in each page fault. Some
> +architectures also employ TLB compression mechanisms to squeeze more
> +entries in when a set of PTEs are virtually and physically contiguous
> +and approporiately aligned. In this case, TLB misses will occur less
> +often.
> +
OK, all of the above still seems like it can remain the same.
> THP can be enabled system wide or restricted to certain tasks or even
> memory ranges inside task's address space. Unless THP is completely
> disabled, there is ``khugepaged`` daemon that scans memory and
> -collapses sequences of basic pages into huge pages.
> +collapses sequences of basic pages into PMD-sized huge pages.
>
> The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
> interface and using madvise(2) and prctl(2) system calls.
> @@ -95,12 +108,29 @@ Global THP controls
> Transparent Hugepage Support for anonymous memory can be entirely disabled
> (mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
> regions (to avoid the risk of consuming more memory resources) or enabled
> -system wide. This can be achieved with one of::
> +system wide. This can be achieved per-supported-THP-size with one of::
> +
> + echo always >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
> + echo madvise >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
> + echo never >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
> +
> +where <size> is the hugepage size being addressed, the available sizes
> +for which vary by system. Alternatively it is possible to specify that
> +a given hugepage size will inherrit the global enabled setting::
typo: inherrit
> +
> + echo global >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
> +
> +The global (legacy) enabled setting can be set as follows::
>
> echo always >/sys/kernel/mm/transparent_hugepage/enabled
> echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
> echo never >/sys/kernel/mm/transparent_hugepage/enabled
>
> +By default, PMD-sized hugepages have enabled="global" and all other
> +hugepage sizes have enabled="never". If enabling multiple hugepage
> +sizes, the kernel will select the most appropriate enabled size for a
> +given allocation.
> +
This is slightly murky. I wonder if "inherited" is a little more directly
informative than global; it certainly felt that way my first time running
this and poking at it.
And a few trivial examples would be a nice touch.
And so overall with a few other minor tweaks, I'd suggest this:
...
where <size> is the hugepage size being addressed, the available sizes
for which vary by system.
For example:
echo always >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
Alternatively it is possible to specify that a given hugepage size will inherit
the top-level "enabled" value:
echo inherited >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
For example:
echo inherited >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
The top-level setting (for use with "inherited") can be by issuing one of the
following commands::
echo always >/sys/kernel/mm/transparent_hugepage/enabled
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
echo never >/sys/kernel/mm/transparent_hugepage/enabled
By default, PMD-sized hugepages have enabled="inherited" and all other
hugepage sizes have enabled="never".
> It's also possible to limit defrag efforts in the VM to generate
> anonymous hugepages in case they're not immediately free to madvise
> regions or to never try to defrag memory and simply fallback to regular
> @@ -146,25 +176,34 @@ madvise
> never
> should be self-explanatory.
>
> -By default kernel tries to use huge zero page on read page fault to
> -anonymous mapping. It's possible to disable huge zero page by writing 0
> -or enable it back by writing 1::
> +By default kernel tries to use huge, PMD-mappable zero page on read
> +page fault to anonymous mapping. It's possible to disable huge zero
> +page by writing 0 or enable it back by writing 1::
>
> echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
> echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
>
> -Some userspace (such as a test program, or an optimized memory allocation
> -library) may want to know the size (in bytes) of a transparent hugepage::
> +Some userspace (such as a test program, or an optimized memory
> +allocation library) may want to know the size (in bytes) of a
> +PMD-mappable transparent hugepage::
>
> cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
>
> -khugepaged will be automatically started when
> -transparent_hugepage/enabled is set to "always" or "madvise, and it'll
> -be automatically shutdown if it's set to "never".
> +khugepaged will be automatically started when one or more hugepage
> +sizes are enabled (either by directly setting "always" or "madvise",
> +or by setting "global" while the global enabled is set to "always" or
> +"madvise"), and it'll be automatically shutdown when the last hugepage
> +size is disabled (either by directly setting "never", or by setting
> +"global" while the global enabled is set to "never").
>
> Khugepaged controls
> -------------------
>
> +.. note::
> + khugepaged currently only searches for opportunities to collapse to
> + PMD-sized THP and no attempt is made to collapse to small-sized
> + THP.
> +
> khugepaged runs usually at low frequency so while one may not want to
> invoke defrag algorithms synchronously during the page faults, it
> should be worth invoking defrag at least in khugepaged. However it's
> @@ -282,10 +321,11 @@ force
> Need of application restart
> ===========================
>
> -The transparent_hugepage/enabled values and tmpfs mount option only affect
> -future behavior. So to make them effective you need to restart any
> -application that could have been using hugepages. This also applies to the
> -regions registered in khugepaged.
> +The transparent_hugepage/enabled and
> +transparent_hugepage/hugepages-<size>kB/enabled values and tmpfs mount
> +option only affect future behavior. So to make them effective you need
> +to restart any application that could have been using hugepages. This
> +also applies to the regions registered in khugepaged.
>
> Monitoring usage
> ================
> @@ -308,6 +348,10 @@ frequently will incur overhead.
> There are a number of counters in ``/proc/vmstat`` that may be used to
> monitor how successfully the system is providing huge pages for use.
>
> +.. note::
> + Currently the below counters only record events relating to
> + PMD-sized THP. Events relating to small-sized THP are not included.
Here's another spot to rename to "multi-size THP".
> +
> thp_fault_alloc
> is incremented every time a huge page is successfully
> allocated to handle a page fault.
> @@ -413,7 +457,7 @@ for huge pages.
> Optimizing the applications
> ===========================
>
> -To be guaranteed that the kernel will map a 2M page immediately in any
> +To be guaranteed that the kernel will map a THP immediately in any
> memory region, the mmap region has to be hugepage naturally
> aligned. posix_memalign() can provide that guarantee.
>
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 49ef12df631b..f8e8dd1fd148 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -528,9 +528,9 @@ replaced by copy-on-write) part of the underlying shmem object out on swap.
> does not take into account swapped out page of underlying shmem objects.
> "Locked" indicates whether the mapping is locked in memory or not.
>
> -"THPeligible" indicates whether the mapping is eligible for allocating THP
> -pages as well as the THP is PMD mappable or not - 1 if true, 0 otherwise.
> -It just shows the current status.
> +"THPeligible" indicates whether the mapping is eligible for allocating
> +naturally aligned THP pages of any currently enabled size. 1 if true, 0
> +otherwise. It just shows the current status.
"It just shows the current status"...as opposed to what? I'm missing an
educational opportunity here--not sure what this means. :)
>
> "VmFlags" field deserves a separate description. This member represents the
> kernel flags associated with the particular virtual memory area in two letter
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 51e0ec658457..2e25362ca9fa 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -865,7 +865,8 @@ static int show_smap(struct seq_file *m, void *v)
> __show_smap(m, &mss, false);
>
> seq_printf(m, "THPeligible: %8u\n",
> - hugepage_vma_check(vma, vma->vm_flags, true, false, true));
> + !!hugepage_vma_check(vma, vma->vm_flags, true, false, true,
> + THP_ORDERS_ALL));
>
> if (arch_pkeys_enabled())
> seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma));
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index fa0350b0812a..7d6f7d96b039 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -67,6 +67,21 @@ extern struct kobj_attribute shmem_enabled_attr;
> #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
> #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
>
> +/*
> + * Mask of all large folio orders supported for anonymous THP.
> + */
> +#define THP_ORDERS_ALL_ANON BIT(PMD_ORDER)
> +
> +/*
> + * Mask of all large folio orders supported for file THP.
> + */
> +#define THP_ORDERS_ALL_FILE (BIT(PMD_ORDER) | BIT(PUD_ORDER))
Is there something about file THP that implies PUD_ORDER? This
deserves an explanatory comment, I think.
> +
> +/*
> + * Mask of all large folio orders supported for THP.
> + */
> +#define THP_ORDERS_ALL (THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_FILE)
> +
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> #define HPAGE_PMD_SHIFT PMD_SHIFT
> #define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT)
> @@ -78,13 +93,18 @@ extern struct kobj_attribute shmem_enabled_attr;
>
> extern unsigned long transparent_hugepage_flags;
>
> -#define hugepage_flags_enabled() \
> - (transparent_hugepage_flags & \
> - ((1<<TRANSPARENT_HUGEPAGE_FLAG) | \
> - (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)))
> -#define hugepage_flags_always() \
> - (transparent_hugepage_flags & \
> - (1<<TRANSPARENT_HUGEPAGE_FLAG))
Are macros actually required? If not, static inlines would be nicer.
> +bool hugepage_flags_enabled(void);
> +
> +static inline int first_order(unsigned long orders)
> +{
> + return fls_long(orders) - 1;
> +}
> +
> +static inline int next_order(unsigned long *orders, int prev)
> +{
> + *orders &= ~BIT(prev);
> + return first_order(*orders);
> +}
>
> /*
> * Do the below checks:
> @@ -97,23 +117,39 @@ extern unsigned long transparent_hugepage_flags;
> * - For all vmas, check if the haddr is in an aligned HPAGE_PMD_SIZE
> * area.
> */
> -static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
> - unsigned long addr)
> -{
> - unsigned long haddr;
> -
> - /* Don't have to check pgoff for anonymous vma */
> - if (!vma_is_anonymous(vma)) {
> - if (!IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
> - HPAGE_PMD_NR))
> - return false;
> +static inline unsigned long transhuge_vma_suitable(struct vm_area_struct *vma,
> + unsigned long addr, unsigned long orders)
Changing this from a bool to a mask of orders is quite significant, and
both the function name and the function-level comment documentation need
to also be adjusted. Here, perhaps one of these names would work:
transhuge_vma_suitable_orders()
transhuge_vma_orders()
> +{
> + int order;
> +
> + /*
> + * Iterate over orders, highest to lowest, removing orders that don't
> + * meet alignment requirements from the set. Exit loop at first order
> + * that meets requirements, since all lower orders must also meet
> + * requirements.
Should we add a little note here, to the effect of, "this is unilaterally
taking over a certain amount of allocation-like policy, by deciding how
to search for folios of a certain size"?
Or is this The Only Way To Do It, after all? I know we had some discussion
about it, and intuitively it feels reasonable, but wanted to poke at it
one last time anyway.
> + */
> +
> + order = first_order(orders);
> +
> + while (orders) {
> + unsigned long hpage_size = PAGE_SIZE << order;
> + unsigned long haddr = ALIGN_DOWN(addr, hpage_size);
> +
> + if (haddr >= vma->vm_start &&
> + haddr + hpage_size <= vma->vm_end) {
> + if (!vma_is_anonymous(vma)) {
> + if (IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) -
> + vma->vm_pgoff,
> + hpage_size >> PAGE_SHIFT))
> + break;
> + } else
> + break;
> + }
> +
> + order = next_order(&orders, order);
> }
>
> - haddr = addr & HPAGE_PMD_MASK;
> -
> - if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
> - return false;
> - return true;
> + return orders;
> }
>
> static inline bool file_thp_enabled(struct vm_area_struct *vma)
> @@ -130,8 +166,9 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
> !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
> }
>
> -bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
> - bool smaps, bool in_pf, bool enforce_sysfs);
> +unsigned long hugepage_vma_check(struct vm_area_struct *vma,
> + unsigned long vm_flags, bool smaps, bool in_pf,
> + bool enforce_sysfs, unsigned long orders);
>
> #define transparent_hugepage_use_zero_page() \
> (transparent_hugepage_flags & \
> @@ -267,17 +304,18 @@ static inline bool folio_test_pmd_mappable(struct folio *folio)
> return false;
> }
>
> -static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
> - unsigned long addr)
> +static inline unsigned long transhuge_vma_suitable(struct vm_area_struct *vma,
> + unsigned long addr, unsigned long orders)
> {
> - return false;
> + return 0;
> }
>
> -static inline bool hugepage_vma_check(struct vm_area_struct *vma,
> - unsigned long vm_flags, bool smaps,
> - bool in_pf, bool enforce_sysfs)
> +static inline unsigned long hugepage_vma_check(struct vm_area_struct *vma,
> + unsigned long vm_flags, bool smaps,
> + bool in_pf, bool enforce_sysfs,
> + unsigned long orders)
> {
> - return false;
> + return 0;
> }
>
> static inline void folio_prep_large_rmappable(struct folio *folio) {}
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 6eb55f97a3d2..0b545d56420c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -74,12 +74,60 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
> static atomic_t huge_zero_refcount;
> struct page *huge_zero_page __read_mostly;
> unsigned long huge_zero_pfn __read_mostly = ~0UL;
> +static unsigned long huge_anon_orders_always __read_mostly;
> +static unsigned long huge_anon_orders_madvise __read_mostly;
> +static unsigned long huge_anon_orders_global __read_mostly;
>
> -bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
> - bool smaps, bool in_pf, bool enforce_sysfs)
> +#define hugepage_global_enabled() \
> + (transparent_hugepage_flags & \
> + ((1<<TRANSPARENT_HUGEPAGE_FLAG) | \
> + (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)))
> +
> +#define hugepage_global_always() \
> + (transparent_hugepage_flags & \
> + (1<<TRANSPARENT_HUGEPAGE_FLAG))
> +
Here again, I'd like to request that we avoid macros, as I don't think
they are strictly required.
> +bool hugepage_flags_enabled(void)
> {
> + /*
> + * We cover both the anon and the file-backed case here; we must return
> + * true if globally enabled, even when all anon sizes are set to never.
> + * So we don't need to look at huge_anon_orders_global.
> + */
> + return hugepage_global_enabled() ||
> + huge_anon_orders_always ||
> + huge_anon_orders_madvise;
> +}
> +
> +/**
> + * hugepage_vma_check - determine which hugepage orders can be applied to vma
> + * @vma: the vm area to check
> + * @vm_flags: use these vm_flags instead of vma->vm_flags
> + * @smaps: whether answer will be used for smaps file
> + * @in_pf: whether answer will be used by page fault handler
> + * @enforce_sysfs: whether sysfs config should be taken into account
> + * @orders: bitfield of all orders to consider
> + *
> + * Calculates the intersection of the requested hugepage orders and the allowed
> + * hugepage orders for the provided vma. Permitted orders are encoded as a set
> + * bit at the corresponding bit position (bit-2 corresponds to order-2, bit-3
> + * corresponds to order-3, etc). Order-0 is never considered a hugepage order.
> + *
> + * Return: bitfield of orders allowed for hugepage in the vma. 0 if no hugepage
> + * orders are allowed.
> + */
> +unsigned long hugepage_vma_check(struct vm_area_struct *vma,
> + unsigned long vm_flags, bool smaps, bool in_pf,
> + bool enforce_sysfs, unsigned long orders)
Here again, a bool return type has been changed to a bitfield. Let's
also update the function name. Maybe one of these:
hugepage_vma_orders()
hugepage_vma_allowable_orders()
?
thanks,
--
John Hubbard
NVIDIA
Powered by blists - more mailing lists