linux-kernel - Re: [PATCH 0/18] Basic scheduler support for automatic NUMA balancing V5

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20130731161141.GX2296@suse.de>
Date:	Wed, 31 Jul 2013 17:11:41 +0100
From:	Mel Gorman <mgorman@...e.de>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
	Ingo Molnar <mingo@...nel.org>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Linux-MM <linux-mm@...ck.org>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 0/18] Basic scheduler support for automatic NUMA
 balancing V5

On Wed, Jul 31, 2013 at 05:30:18PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 31, 2013 at 12:57:19PM +0100, Mel Gorman wrote:
> 
> > > Right, so what Ingo did is have the scan rate depend on the convergence.
> > > What exactly did you dislike about that?
> > > 
> > 
> > It depended entirely on properly detecting if we are converged or not. As
> > things like false share detection within THP is still not there I was
> > worried that it was too easy to make the wrong decision here and keep it
> > pinned at the maximum scan rate.
> > 
> > > We could define the convergence as all the faults inside the interleave
> > > mask vs the total faults, and then run at: min + (1 - c)*(max-min).
> > > 
> > 
> > And when we have such things properly in place then I think we can kick
> > away the current crutch.
> 
> OK, so I'll go write that patch I suppose ;-)
> 
> > > Ah, well the reasoning on that was that all this NUMA business is
> > > 'expensive' so we'd better only bother with tasks that persist long
> > > enough for it to pay off.
> > > 
> > 
> > Which is fair enough but tasks that lasted *just* longer than the interval
> > still got punished. Processes running with a slightly slower CPU gets
> > hurts meaning that it would be a difficult bug report to digest.
> > 
> > > In that regard it makes perfect sense to wait a fixed amount of runtime
> > > before we start scanning.
> > > 
> > > So it was not a pure hack to make kbuild work again.. that is did was
> > > good though.
> > > 
> > 
> > Maybe we should reintroduce the delay then but I really would prefer that
> > it was triggered on some sort of event.
> 
> Humm:
> 
> kernel/sched/fair.c:
> 
> /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
> unsigned int sysctl_numa_balancing_scan_delay = 1000;
> 
> 
> kernel/sched/core.c:__sched_fork():
> 
> 	numa_scan_period = sysctl_numa_balancing_scan_delay
> 
> 
> It seems its still there, no need to resuscitate.
> 

Yes, reverting 5bca23035391928c4c7301835accca3551b96cc2 effectively restores
the behaviour you are looking for. It just seems very crude. Then again,
I also should not have left the scan delay on top of the first_nid
check.

> I share your preference for a clear event, although nothing really comes
> to mind. The entire multi-process space seems devoid of useful triggers.
> 

RSS was another option it felt as arbitrary as a plain delay.

Should I revert 5bca23035391928c4c7301835accca3551b96cc2 with an
explanation that it potentially is completely useless in the purely
multi-process shared case?

> > > On that rate-limit, this looks to be a hard-coded number unrelated to
> > > the actual hardware.
> > 
> > Guesstimate.
> > 
> > > I think we should at the very least make it a
> > > configurable number and preferably scale the number with the SLIT info.
> > > Or alternatively actually measure the node to node bandwidth.
> > > 
> > 
> > Ideally we should just kick it away because scan rate limiting works
> > properly. Lets not make it a tunable just yet so we can avoid having to
> > deprecate it later.
> 
> I'm not seeing how the rate-limit as per the convergence is going to
> help here.

It should reduce the potential number of NUMA hinting faults that can be
incurred. However, I accept your point because even it does not directly
avoid a large number of migration events.

> Suppose we migrate the task to another node and its going to
> stay there. Then our convergence is going down to 0 (all our memory is
> remote) so we end up at the max scan rate migrating every single page
> ASAP.
> 
> This would completely and utterly saturate any interconnect.
> 

Good point and we'd arrive back at rate limiting the migration in an
attempt to avoid it.

> Also, in the case we don't have a fully connected system the memory
> transfers will need multiple hops, which greatly complicates the entire
> accounting trick :-)
> 

Also unfortunately true. The larger the machine, the more likely this
becomes.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/