lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0cbb1f29-61a2-40ab-8b78-f7550ef41532@lucifer.local>
Date: Tue, 20 May 2025 16:31:47 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Usama Arif <usamaarif642@...il.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, david@...hat.com,
        linux-mm@...ck.org, hannes@...xchg.org, shakeel.butt@...ux.dev,
        riel@...riel.com, ziy@...dia.com, laoar.shao@...il.com,
        baolin.wang@...ux.alibaba.com, Liam.Howlett@...cle.com,
        npache@...hat.com, ryan.roberts@....com, vbabka@...e.cz,
        jannh@...gle.com, Arnd Bergmann <arnd@...db.de>,
        linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org,
        kernel-team@...a.com
Subject: Re: [PATCH v3 1/7] mm: khugepaged: extract vm flag setting outside
 of hugepage_madvise

On Tue, May 20, 2025 at 03:57:35PM +0100, Usama Arif wrote:
>
>
> On 20/05/2025 15:43, Lorenzo Stoakes wrote:
> > This commit message is really poor. You're also not mentioning that you're
> > changing s390 behaviour?
> >
> > On Mon, May 19, 2025 at 11:29:53PM +0100, Usama Arif wrote:
> >> This is so that flag setting can be resused later in other functions,
> >
> > Typo.
> >
> >> to reduce code duplication (including the s390 exception).
> >>
> >> No functional change intended with this patch.
> >
> > I'm pretty sure somebody reviewed that this should just be merged with whatever
> > uses this? I'm not sure this is all that valuable as you're not really changing
> > this structurally very much.
> >
>
> Please see patch 2 where hugepage_set_vmflags is reused.
> I was just trying to follow your feedback from previous revision that the flag
> setting and s390 code part is duplicate code and should be common in the prctl
> and madvise function.

Sure, but I think it'd be better as part of that patch probably. Perhaps I
was thinking of another comment in reference to a 'no function change'
remark.

>
> I realize I messed up the arg not having vma and the order of the if statement.

I am getting the strong impression here that you're rushing :)

I strongly suggest slowing thing down here. We're in RC7, this is (or
should be) an RFC for us to explore concepts. There's no need for it.

I appreciate your input and enthusiasm, but clearly rushing is causing you
to make mistakes. I get it, we've all been there.

But right now we have what 5 maybe? THP series in-flight at the same time,
all touching similar stuff, and it'll make everybody's lives easier and
less chaotic if we take a little more time to assess.

We are ultimately going to choose what's best for the kernel, there's no
'race' as to which series is 'ready' first.

>
> >>
> >> Signed-off-by: Usama Arif <usamaarif642@...il.com>
> >
> > Yeah I'm not a fan of this patch, it's buggy and really unclear what the
> > purpose is here.
>
> No functional change was intended (I realized the order below broke it but can be fixed).
>
> In the previous revision it was:
> +       case PR_SET_THP_POLICY:
> +               if (arg3 || arg4 || arg5)
> +                       return -EINVAL;
> +               if (mmap_write_lock_killable(me->mm))
> +                       return -EINTR;
> +               switch (arg2) {
> +               case PR_DEFAULT_MADV_HUGEPAGE:
> +                       if (!hugepage_global_enabled())
> +                               error = -EPERM;
> +#ifdef CONFIG_S390
> +                       /*
> +                       * qemu blindly sets MADV_HUGEPAGE on all allocations, but s390
> +                       * can't handle this properly after s390_enable_sie, so we simply
> +                       * ignore the madvise to prevent qemu from causing a SIGSEGV.
> +                       */
> +                       else if (mm_has_pgste(vma->vm_mm))
> +                               error = -EPERM;
> +#endif
> +                       else {
> +                               me->mm->def_flags &= ~VM_NOHUGEPAGE;
> +                               me->mm->def_flags |= VM_HUGEPAGE;
> +                               process_default_madv_hugepage(me->mm, MADV_HUGEPAGE);
> +                       }
> +                       break;
> ...
>
> Now with this hugepage_set_vmflags, it would be
>
> +       case PR_SET_THP_POLICY:
> +               if (arg3 || arg4 || arg5)
> +                       return -EINVAL;
> +               if (mmap_write_lock_killable(mm))
> +                       return -EINTR;
> +               switch (arg2) {
> +               case PR_DEFAULT_MADV_HUGEPAGE:
> +                       if (!hugepage_global_enabled())
> +                               error = -EPERM;
> +                       error = hugepage_set_vmflags(&mm->def_flags, MADV_HUGEPAGE);
> +                       if (!error)
> +                               process_default_madv_hugepage(mm, MADV_HUGEPAGE);
> +                       break;
>
>
> I am happy to go with either of the methods above, but was just trying to
> incorporate your feedback :)
>
> Would you like the method from previous version?

I'm going to go ahead and overlook what would be in the UK 100% a
deployment of the finest British sarcasm here, and assume not intended :)

Very obviously we do not want to duplicate architecture-specific code. I'm
a little concerned you're ok with both (imagine if one changed but not the
other for instance), but clearly this series is unmergeable without
de-duplicating this.

My objections here are that you submitted a totally broken patch with a
poor commit message that seems that it could well be merged with the
subsequent patch.

I also have concerns about your levels of testing here - you completely
broken MADV_NOHUGEPAGE but didn't notice? Are you running self-tests? Do we
have one that'd pick that up? If not, can we have one like that?

Thanks!

>
> >
> >> ---
> >>  include/linux/huge_mm.h |  1 +
> >>  mm/khugepaged.c         | 26 +++++++++++++++++---------
> >>  2 files changed, 18 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >> index 2f190c90192d..23580a43787c 100644
> >> --- a/include/linux/huge_mm.h
> >> +++ b/include/linux/huge_mm.h
> >> @@ -431,6 +431,7 @@ change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >>  			__split_huge_pud(__vma, __pud, __address);	\
> >>  	}  while (0)
> >>
> >> +int hugepage_set_vmflags(unsigned long *vm_flags, int advice);
> >>  int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
> >>  		     int advice);
> >>  int madvise_collapse(struct vm_area_struct *vma,
> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >> index b04b6a770afe..ab3427c87422 100644
> >> --- a/mm/khugepaged.c
> >> +++ b/mm/khugepaged.c
> >> @@ -346,8 +346,7 @@ struct attribute_group khugepaged_attr_group = {
> >>  };
> >>  #endif /* CONFIG_SYSFS */
> >>
> >> -int hugepage_madvise(struct vm_area_struct *vma,
> >> -		     unsigned long *vm_flags, int advice)
> >> +int hugepage_set_vmflags(unsigned long *vm_flags, int advice)
> >
> >
> >>  {
> >>  	switch (advice) {
> >>  	case MADV_HUGEPAGE:
> >> @@ -358,16 +357,10 @@ int hugepage_madvise(struct vm_area_struct *vma,
> >>  		 * ignore the madvise to prevent qemu from causing a SIGSEGV.
> >>  		 */
> >>  		if (mm_has_pgste(vma->vm_mm))
> >
> > This is broken, you refer to vma which doesn't exist.
> >
> > As the kernel bots are telling you...
> >
> >> -			return 0;
> >> +			return -EPERM;
> >
> > Why are you now returning an error?
> >
> > This seems like a super broken way of making the caller return 0. Just make this
> > whole thing a bool return if you're going to treat it like a boolean function.
> >
> >>  #endif
> >>  		*vm_flags &= ~VM_NOHUGEPAGE;
> >>  		*vm_flags |= VM_HUGEPAGE;
> >> -		/*
> >> -		 * If the vma become good for khugepaged to scan,
> >> -		 * register it here without waiting a page fault that
> >> -		 * may not happen any time soon.
> >> -		 */
> >> -		khugepaged_enter_vma(vma, *vm_flags);
> >>  		break;
> >>  	case MADV_NOHUGEPAGE:
> >>  		*vm_flags &= ~VM_HUGEPAGE;
> >> @@ -383,6 +376,21 @@ int hugepage_madvise(struct vm_area_struct *vma,
> >>  	return 0;
> >>  }
> >>
> >> +int hugepage_madvise(struct vm_area_struct *vma,
> >> +		     unsigned long *vm_flags, int advice)
> >> +{
> >> +	if (advice == MADV_HUGEPAGE && !hugepage_set_vmflags(vm_flags, advice)) {
> >
> > So now you've completely broken MADV_NOHUGEPAGE haven't you?
> >
>
> Yeah order needs to be reversed.
>
> >> +		/*
> >> +		 * If the vma become good for khugepaged to scan,
> >> +		 * register it here without waiting a page fault that
> >> +		 * may not happen any time soon.
> >> +		 */
> >> +		khugepaged_enter_vma(vma, *vm_flags);
> >> +	}
> >> +
> >> +	return 0;
> >> +}
> >> +
> >>  int __init khugepaged_init(void)
> >>  {
> >>  	mm_slot_cache = KMEM_CACHE(khugepaged_mm_slot, 0);
> >> --
> >> 2.47.1
> >>
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ