linux-kernel - Re: [PATCH v8 03/10] mm: thp: Introduce multi-size THP sysfs interface

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <004ed23d-5571-4474-b7fe-7bc08817c165@redhat.com>
Date:   Tue, 5 Dec 2023 17:57:58 +0100
From:   David Hildenbrand <david@...hat.com>
To:     Ryan Roberts <ryan.roberts@....com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Matthew Wilcox <willy@...radead.org>,
        Yin Fengwei <fengwei.yin@...el.com>,
        Yu Zhao <yuzhao@...gle.com>,
        Catalin Marinas <catalin.marinas@....com>,
        Anshuman Khandual <anshuman.khandual@....com>,
        Yang Shi <shy828301@...il.com>,
        "Huang, Ying" <ying.huang@...el.com>, Zi Yan <ziy@...dia.com>,
        Luis Chamberlain <mcgrof@...nel.org>,
        Itaru Kitayama <itaru.kitayama@...il.com>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        John Hubbard <jhubbard@...dia.com>,
        David Rientjes <rientjes@...gle.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Hugh Dickins <hughd@...gle.com>,
        Kefeng Wang <wangkefeng.wang@...wei.com>,
        Barry Song <21cnbao@...il.com>,
        Alistair Popple <apopple@...dia.com>
Cc:     linux-mm@...ck.org, linux-arm-kernel@...ts.infradead.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH v8 03/10] mm: thp: Introduce multi-size THP sysfs
 interface

On 04.12.23 11:20, Ryan Roberts wrote:
> In preparation for adding support for anonymous multi-size THP,
> introduce new sysfs structure that will be used to control the new
> behaviours. A new directory is added under transparent_hugepage for each
> supported THP size, and contains an `enabled` file, which can be set to
> "inherit" (to inherit the global setting), "always", "madvise" or
> "never". For now, the kernel still only supports PMD-sized anonymous
> THP, so only 1 directory is populated.
> 
> The first half of the change converts transhuge_vma_suitable() and
> hugepage_vma_check() so that they take a bitfield of orders for which
> the user wants to determine support, and the functions filter out all
> the orders that can't be supported, given the current sysfs
> configuration and the VMA dimensions. If there is only 1 order set in
> the input then the output can continue to be treated like a boolean;
> this is the case for most call sites. The resulting functions are
> renamed to thp_vma_suitable_orders() and thp_vma_allowable_orders()
> respectively.
> 
> The second half of the change implements the new sysfs interface. It has
> been done so that each supported THP size has a `struct thpsize`, which
> describes the relevant metadata and is itself a kobject. This is pretty
> minimal for now, but should make it easy to add new per-thpsize files to
> the interface if needed in future (e.g. per-size defrag). Rather than
> keep the `enabled` state directly in the struct thpsize, I've elected to
> directly encode it into huge_anon_orders_[always|madvise|inherit]
> bitfields since this reduces the amount of work required in
> thp_vma_allowable_orders() which is called for every page fault.
> 
> See Documentation/admin-guide/mm/transhuge.rst, as modified by this
> commit, for details of how the new sysfs interface works.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@....com>
> ---

Some comments mostly regarding thp_vma_allowable_orders and friends. In 
general, LGTM. I'll have to go over the order logic once again, I got a 
bit lost once we started mixing anon and file orders.

[...]

Doc updates all looked good to me, skimming over them.

> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index fa0350b0812a..bd0eadd3befb 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h

[...]

> +static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
> +		unsigned long addr, unsigned long orders)
> +{
> +	int order;
> +
> +	/*
> +	 * Iterate over orders, highest to lowest, removing orders that don't
> +	 * meet alignment requirements from the set. Exit loop at first order
> +	 * that meets requirements, since all lower orders must also meet
> +	 * requirements.
> +	 */
> +
> +	order = first_order(orders);

nit: "highest_order" or "largest_order" would be more expressive 
regarding the actual semantics.

> +
> +	while (orders) {
> +		unsigned long hpage_size = PAGE_SIZE << order;
> +		unsigned long haddr = ALIGN_DOWN(addr, hpage_size);
> +
> +		if (haddr >= vma->vm_start &&
> +		    haddr + hpage_size <= vma->vm_end) {
> +			if (!vma_is_anonymous(vma)) {
> +				if (IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) -
> +						vma->vm_pgoff,
> +						hpage_size >> PAGE_SHIFT))
> +					break;
> +			} else
> +				break;

Comment: Codying style wants you to use if () {} else {}

But I'd recommend for the conditions:

if (haddr < vma->vm_start ||
     haddr + hpage_size > vma->vm_end)
	continue;
/* Don't have to check pgoff for anonymous vma */
if (!vma_is_anonymous(vma))
	break;
if (IS_ALIGNED((...
	break;

[...]


> +/**
> + * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
> + * @vma:  the vm area to check
> + * @vm_flags: use these vm_flags instead of vma->vm_flags
> + * @smaps: whether answer will be used for smaps file
> + * @in_pf: whether answer will be used by page fault handler
> + * @enforce_sysfs: whether sysfs config should be taken into account
> + * @orders: bitfield of all orders to consider
> + *
> + * Calculates the intersection of the requested hugepage orders and the allowed
> + * hugepage orders for the provided vma. Permitted orders are encoded as a set
> + * bit at the corresponding bit position (bit-2 corresponds to order-2, bit-3
> + * corresponds to order-3, etc). Order-0 is never considered a hugepage order.
> + *
> + * Return: bitfield of orders allowed for hugepage in the vma. 0 if no hugepage
> + * orders are allowed.
> + */
> +unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> +				       unsigned long vm_flags, bool smaps,
> +				       bool in_pf, bool enforce_sysfs,
> +				       unsigned long orders)
> +{
> +	/* Check the intersection of requested and supported orders. */
> +	orders &= vma_is_anonymous(vma) ?
> +			THP_ORDERS_ALL_ANON : THP_ORDERS_ALL_FILE;
> +	if (!orders)
> +		return 0;

Comment: if this is called from some hot path, we might want to move as 
much as possible into a header, so we can avoid this function call here 
when e.g., THP are completely disabled etc.

> +
>   	if (!vma->vm_mm)		/* vdso */
> -		return false;
> +		return 0;
>   
>   	/*
>   	 * Explicitly disabled through madvise or prctl, or some
> @@ -88,16 +141,16 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
>   	 * */
>   	if ((vm_flags & VM_NOHUGEPAGE) ||
>   	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
> -		return false;
> +		return 0;
>   	/*
>   	 * If the hardware/firmware marked hugepage support disabled.
>   	 */
>   	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
> -		return false;
> +		return 0;
>   
>   	/* khugepaged doesn't collapse DAX vma, but page fault is fine. */
>   	if (vma_is_dax(vma))
> -		return in_pf;
> +		return in_pf ? orders : 0;
>   
>   	/*
>   	 * khugepaged special VMA and hugetlb VMA.
> @@ -105,17 +158,29 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
>   	 * VM_MIXEDMAP set.
>   	 */
>   	if (!in_pf && !smaps && (vm_flags & VM_NO_KHUGEPAGED))
> -		return false;
> +		return 0;
>   
>   	/*
> -	 * Check alignment for file vma and size for both file and anon vma.
> +	 * Check alignment for file vma and size for both file and anon vma by
> +	 * filtering out the unsuitable orders.
>   	 *
>   	 * Skip the check for page fault. Huge fault does the check in fault
> -	 * handlers. And this check is not suitable for huge PUD fault.
> +	 * handlers.
>   	 */
> -	if (!in_pf &&
> -	    !transhuge_vma_suitable(vma, (vma->vm_end - HPAGE_PMD_SIZE)))
> -		return false;
> +	if (!in_pf) {
> +		int order = first_order(orders);
> +		unsigned long addr;
> +
> +		while (orders) {
> +			addr = vma->vm_end - (PAGE_SIZE << order);
> +			if (thp_vma_suitable_orders(vma, addr, BIT(order)))
> +				break;

Comment: you'd want a "thp_vma_suitable_order" helper here. But maybe 
the compiler is smart enough to optimize the loop and everyything else out.

[...]

> +
> +static ssize_t thpsize_enabled_store(struct kobject *kobj,
> +				     struct kobj_attribute *attr,
> +				     const char *buf, size_t count)
> +{
> +	int order = to_thpsize(kobj)->order;
> +	ssize_t ret = count;
> +
> +	if (sysfs_streq(buf, "always")) {
> +		set_bit(order, &huge_anon_orders_always);
> +		clear_bit(order, &huge_anon_orders_inherit);
> +		clear_bit(order, &huge_anon_orders_madvise);
> +	} else if (sysfs_streq(buf, "inherit")) {
> +		set_bit(order, &huge_anon_orders_inherit);
> +		clear_bit(order, &huge_anon_orders_always);
> +		clear_bit(order, &huge_anon_orders_madvise);
> +	} else if (sysfs_streq(buf, "madvise")) {
> +		set_bit(order, &huge_anon_orders_madvise);
> +		clear_bit(order, &huge_anon_orders_always);
> +		clear_bit(order, &huge_anon_orders_inherit);
> +	} else if (sysfs_streq(buf, "never")) {
> +		clear_bit(order, &huge_anon_orders_always);
> +		clear_bit(order, &huge_anon_orders_inherit);
> +		clear_bit(order, &huge_anon_orders_madvise);

Note: I was wondering for a second if some concurrent cames could lead 
to an inconsistent state. I think in the worst case we'll simply end up 
with "never" on races.

[...]

>   static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
>   {
>   	int err;
> +	struct thpsize *thpsize;
> +	unsigned long orders;
> +	int order;
> +
> +	/*
> +	 * Default to setting PMD-sized THP to inherit the global setting and
> +	 * disable all other sizes. powerpc's PMD_ORDER isn't a compile-time
> +	 * constant so we have to do this here.
> +	 */
> +	huge_anon_orders_inherit = BIT(PMD_ORDER);
>   
>   	*hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj);
>   	if (unlikely(!*hugepage_kobj)) {
> @@ -434,8 +631,24 @@ static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
>   		goto remove_hp_group;
>   	}
>   
> +	orders = THP_ORDERS_ALL_ANON;
> +	order = first_order(orders);
> +	while (orders) {
> +		thpsize = thpsize_create(order, *hugepage_kobj);
> +		if (IS_ERR(thpsize)) {
> +			pr_err("failed to create thpsize for order %d\n", order);
> +			err = PTR_ERR(thpsize);
> +			goto remove_all;
> +		}
> +		list_add(&thpsize->node, &thpsize_list);
> +		order = next_order(&orders, order);
> +	}
> +
>   	return 0;
>   

[...]

>   	page = compound_head(page);
> @@ -5116,7 +5116,8 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>   		return VM_FAULT_OOM;
>   retry_pud:
>   	if (pud_none(*vmf.pud) &&
> -	    hugepage_vma_check(vma, vm_flags, false, true, true)) {
> +	    thp_vma_allowable_orders(vma, vm_flags, false, true, true,
> +				     BIT(PUD_ORDER))) {
>   		ret = create_huge_pud(&vmf);
>   		if (!(ret & VM_FAULT_FALLBACK))
>   			return ret;
> @@ -5150,7 +5151,8 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>   		goto retry_pud;
>   
>   	if (pmd_none(*vmf.pmd) &&
> -	    hugepage_vma_check(vma, vm_flags, false, true, true)) {
> +	    thp_vma_allowable_orders(vma, vm_flags, false, true, true,
> +				     BIT(PMD_ORDER))) {

Comment: A helper like "thp_vma_allowable_order(vma, PMD_ORDER)" might 
make this easier to read -- and the implemenmtation will be faster.

>   		ret = create_huge_pmd(&vmf);
>   		if (!(ret & VM_FAULT_FALLBACK))
>   			return ret;
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index e0b368e545ed..64da127cc267 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -268,7 +268,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>   			 * cleared *pmd but not decremented compound_mapcount().
>   			 */
>   			if ((pvmw->flags & PVMW_SYNC) &&
> -			    transhuge_vma_suitable(vma, pvmw->address) &&
> +			    thp_vma_suitable_orders(vma, pvmw->address,
> +						    BIT(PMD_ORDER)) &&

Comment: Similarly, a helper like "thp_vma_suitable_order(vma, 
PMD_ORDER)" might make this easier to read.

>   			    (pvmw->nr_pages >= HPAGE_PMD_NR)) {
>   				spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);
>   

-- 
Cheers,

David / dhildenb