linux-kernel - Re: [PATCH] sched/numa: Stagger NUMA balancing scan periods for new threads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20180427073021.76egd5j3tyj4kp7d@techsingularity.net>
Date:   Fri, 27 Apr 2018 08:30:21 +0100
From:   Mel Gorman <mgorman@...hsingularity.net>
To:     Ingo Molnar <mingo@...nel.org>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Mike Galbraith <efault@....de>,
        Matt Fleming <matt@...eblueprint.co.uk>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] sched/numa: Stagger NUMA balancing scan periods for new
 threads

On Fri, Apr 27, 2018 at 08:50:57AM +0200, Ingo Molnar wrote:
> >                             4.17.0-rc1             4.17.0-rc1
> >                                vanilla           stagger-v1r1
> > sys-time-bt.D         48.78 (   0.00%)       48.22 (   1.15%)
> > sys-time-cg.D         25.31 (   0.00%)       26.63 (  -5.22%)
> > sys-time-ep.D          1.65 (   0.00%)        0.62 (  62.42%)
> > sys-time-is.D         40.05 (   0.00%)       24.45 (  38.95%)
> > sys-time-lu.D         37.55 (   0.00%)       29.02 (  22.72%)
> > sys-time-mg.D         47.52 (   0.00%)       34.92 (  26.52%)
> > sys-time-sp.D        119.01 (   0.00%)      109.05 (   8.37%)
> > sys-time-ua.D         51.52 (   0.00%)       45.13 (  12.40%)
> > 
> > NUMA scan activity is reduced as well as other balancing activity.
> > 
> > NUMA alloc local               1042828     1342670
> > NUMA base PTE updates        140481138    93577468
> > NUMA huge PMD updates           272171      180766
> > NUMA page range updates      279832690   186129660
> > NUMA hint faults               1395972     1193897
> > NUMA hint local faults          877925      855053
> > NUMA hint local percent             62          71
> > NUMA pages migrated           12057909     9158023
> 
> Looks like a nice reduction in scanning overhead - which was always the main worry 
> with the fault based active NUMA balancing technique.
> 
> I have a couple of minor code cleanliness nit:
> 

No problem.

> > +void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
> > +{
> > +	int mm_users = 0;
> > +
> > +	if (p->mm) {
> > +		mm_users = atomic_read(&p->mm->mm_users);
> > +		if (mm_users == 1) {
> > +			p->mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
> > +			p->mm->numa_scan_seq = 0;
> > +		}
> > +	}
> > +	p->node_stamp = 0ULL;
> > +	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
> > +	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
> > +	p->numa_work.next = &p->numa_work;
> > +	p->numa_faults = NULL;
> > +	p->last_task_numa_placement = 0;
> > +	p->last_sum_exec_runtime = 0;
> > +	p->numa_group = NULL;
> 
> While this is pre-existing code that you moved, could we please use a bit more 
> organization to make this more readable:
> 
> 	p->node_stamp			= 0ULL;
> 	p->numa_scan_seq		= p->mm ? p->mm->numa_scan_seq : 0;
> 	p->numa_scan_period		= sysctl_numa_balancing_scan_delay;
> 	p->numa_work.next		= &p->numa_work;
> 	p->numa_faults			= NULL;
> 	p->last_task_numa_placement	= 0;
> 	p->last_sum_exec_runtime	= 0;
> 	p->numa_group			= NULL;
> 
> ?
> 
> This form made me notice a detail: the 0ULL asymmetry looks weird, this integer 
> literal type specification is entirely superfluous here, we can just write '0'.
> 

Can do. I'll revise it and should send a new version on Monday. You're
right that the type is superfluous, this was simply a code movement so I
kept the structure.

> > +	/* New address space */
> > +	if (!(clone_flags & CLONE_VM)) {
> > +		p->numa_preferred_nid = -1;
> > +		return;
> > +	}
> > +
> > +	/* New thread, use existing preferred nid but stagger scans */
> > +	if (p->mm) {
> > +		unsigned int delay;
> > +
> > +		delay = min_t(unsigned int, task_scan_max(current),
> > +			current->numa_scan_period * mm_users * NSEC_PER_MSEC);
> > +		delay += 2 * TICK_NSEC;
> > +		p->numa_preferred_nid = current->numa_preferred_nid;
> > +		p->node_stamp = delay;
> > +	}
> 
> So this is a fork time function, shouldn't p->numa_preferred_nid be equal to 
> current->numa_preferred_nid already?
> 

It should but I see no harm in being explicit.

> This is what happens in the !p->mm && CLONE_VM case anyway, right?
> 

!p->mm should never have changed numa_preferred_nid and it's useless
information anyway. Kernel threads do not have a user address space to
sample and migrate.

> So we could leave out the superfluous assignment and make it clear in a comment 
> that we inherit the parent's ->numa_preferred_nid intentionally.
> 

I can do that.

> Also, there's a lot of p->mm use, could we add this helper local variable to 
> simplify the code some more:
> 
> 	struct mm_struct *mm = p->mm;
> 
> ?

That should be harmless.

Thanks!

-- 
Mel Gorman
SUSE Labs