lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 28 Apr 2022 12:34:07 -0400
From:   Peter Xu <peterx@...hat.com>
To:     David Hildenbrand <david@...hat.com>
Cc:     Bibo Mao <maobibo@...ngson.cn>,
        Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, Yang Shi <shy828301@...il.com>
Subject: Re: [PATCH v3] mm/khugepaged: sched to numa node when collapse huge
 page

On Thu, Apr 28, 2022 at 05:17:07PM +0200, David Hildenbrand wrote:
> On 17.03.22 07:50, Bibo Mao wrote:
> > collapse huge page will copy huge page from general small pages,
> > dest node is calculated from most one of source pages, however
> > THP daemon is not scheduled on dest node. The performance may be
> > poor since huge page copying across nodes, also cache is not used
> > for target node. With this patch, khugepaged daemon switches to
> > the same numa node with huge page. It saves copying time and makes
> > use of local cache better.
> > 
> > With this patch, specint 2006 base performance is improved with 6%
> > on Loongson 3C5000L platform with 32 cores and 8 numa nodes.
> 
> If it helps, that's nice as long as it doesn't hurt other cases.
> 
> > 
> > Signed-off-by: Bibo Mao <maobibo@...ngson.cn>
> > ---
> > changelog:
> > V2: remove node record for thp daemon
> > V3: remove unlikely statement
> > ---
> >  mm/khugepaged.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 131492fd1148..b3cf0885f5a2 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1066,6 +1066,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> >  	struct vm_area_struct *vma;
> >  	struct mmu_notifier_range range;
> >  	gfp_t gfp;
> > +	const struct cpumask *cpumask;
> >  
> >  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >  
> > @@ -1079,6 +1080,13 @@ static void collapse_huge_page(struct mm_struct *mm,
> >  	 * that. We will recheck the vma after taking it again in write mode.
> >  	 */
> >  	mmap_read_unlock(mm);
> > +
> > +	/* sched to specified node before huage page memory copy */
> 
> huage? I assume "huge"
> 
> > +	if (task_node(current) != node) {
> > +		cpumask = cpumask_of_node(node);
> > +		if (!cpumask_empty(cpumask))
> > +			set_cpus_allowed_ptr(current, cpumask);
> > +	}
> 
> I wonder if that will always be optimized out without NUMA and if we
> want to check for IS_ENABLED(CONFIG_NUMA).
> 
> 
> Regarding comments from others, I agree: I think what we'd actually want
> is something like "try to reschedule to one of these CPUs immediately.
> If they are all busy, just stay here.
> 
> 
> Also, I do wonder if there could already be scenarios where someone
> wants to let khugepaged run only on selected housekeeping CPUs (e.g.,
> when pinning VCPUs in a VM to physical CPUs). It might even degrade the
> VM performance in that case if we schedule something unrelated on these
> CPUs. (I don't know which interfaces we might already have to configure
> housekeeping CPUs for kthreads).
> 
> I can spot in kernel/kthread.c:kthread()
> 
> set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_TYPE_KTHREAD));
> 
> Hmmmmm ...

Yes that's a valid point, for RT afaik many users tunes the kernel threads
specifically on demand by pinning them.  So I'm not sure how this new
algorithm could break some users already, by either (1) trying to pin
khugepaged onto some isolated cores (which can cause spikes?), or (2) mess
up with the admin's previous pin settings on the khugepagd kthread.

The other thing is the khugepaged movement on the cores seems to be quite
random, because the pages it scans can be unpredictably stored on different
numa nodes, so logically it can start bouncing easily on some hosts and
that does sound questionalbe.. as I raised the (pure) question previously
on the 2nd point irrelevant of the benchmark result.

-- 
Peter Xu

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ