linux-kernel - Re: RFC vmstat: On demand vmstat threads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.02.1309201930590.4089@ionos.tec.linutronix.de>
Date:	Fri, 20 Sep 2013 23:03:05 +0200 (CEST)
From:	Thomas Gleixner <tglx@...utronix.de>
To:	Frederic Weisbecker <fweisbec@...il.com>
cc:	Christoph Lameter <cl@...ux.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Gilad Ben-Yossef <gilad@...yossef.com>,
	Tejun Heo <tj@...nel.org>, John Stultz <johnstul@...ibm.com>,
	Mike Frysinger <vapier@...too.org>,
	Minchan Kim <minchan.kim@...il.com>,
	Hakan Akkan <hakanakkan@...il.com>,
	Max Krasnyansky <maxk@...lcomm.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Linux-MM <linux-mm@...ck.org>,
	Peter Zijlstra <peterz@...radead.org>
Subject: Re: RFC vmstat: On demand vmstat threads

B1;3202;0cOn Fri, 20 Sep 2013, Frederic Weisbecker wrote:
> On Fri, Sep 20, 2013 at 12:41:02PM +0200, Thomas Gleixner wrote:
> > On Thu, 19 Sep 2013, Christoph Lameter wrote:
> > > On Thu, 19 Sep 2013, Thomas Gleixner wrote:
> > > 
> > > > The vmstat accounting is not the only thing which we want to delegate
> > > > to dedicated core(s) for the full NOHZ mode.
> > > >
> > > > So instead of playing broken games with explicitly not exposed core
> > > > code variables, we should implement a core code facility which is
> > > > aware of the NOHZ details and provides a sane way to delegate stuff to
> > > > a certain subset of CPUs.
> > > 
> > > I would be happy to use such a facility. Otherwise I would just be adding
> > > yet another kernel option or boot parameter I guess.
> > 
> > Uuurgh, no.
> > 
> > The whole delegation stuff is necessary not just for vmstat. We have
> > the same issue for scheduler stats and other parts of the kernel, so
> > we are better off in having a core facility to schedule such functions
> > in consistency with the current full NOHZ state.
> 
> Agreed.
> 
> So we have the choice between having this performed from callers in
> the kernel with functions that enforce the affinity of some
> asynchronous tasks, like "schedule_on_timekeeper()" or
> "schedule_on_housekeeers()" with workqueues for example.

Why do you need different targets?

> Or we can add interface to define the affinity of such things from
> userspace, at the ....

We already have the relevant information in the kernel. And it's not
too hard to come up with a rather simple and robust scheme for this.

For the following I use the terms enter/leave isolation mode in that
way:

    Enter/leave isolation mode is when the full NOHZ mode is
    enabled/disabled for a cpu, not when the CPU actually
    enters/leaves that state (i.e. single cpu bound userspace task).

So what you want is something like this:

int housekeeping_register(int (*cb)(struct cpumask *mask),
    			  unsinged period_ms, bool always);

cb: 	    the callback to execute. it processes the data for all cores
	    which are set in the cpumask handed in by the housekeeping
	    scheduler.

period_ms:  period of the callback, can be 0 for immediate
	    one time execution

always:     the always argument tells the core code whether to schedule
	    the callback unconditionally. If false it only schedules it
	    when the core enters isolation mode.

In the beginning we simply schedule the callbacks on each online cpu,
if the always bit is set. For the callbacks which are registered with
the always bit off, we schedule them only on entry into isolation
mode.

Now when a cpu becomes isolated we stop the callback scheduling on
that cpu and assign it to the cpu with the smallest NUMA
distance. So that cpu will process the data for itself and for the
newly isolated cpu.

When a cpu leaves isolation mode then it gets its housekeeping task
assigned back.

We need to be clever about the NOHZ idle interaction. If a cpu has
assigned more than its own data to process, then it shouldn't use a
deferrable timer. CPUs which only take care of their own data can use
a deferrable timer.

This works out of the box for stuff like vmstat, where the callback is
already done in a workqueue and we can register them with always =
true.

The scheduler stats are a slightly different beast, but it's not
rocket science to handle that.

We register the callback with always = false. So for a bog standard
system nothing happens, except the registering. Once the full NOHZ
mode is enabled on a cpu we schedule the work with a reasonable slow
period (e.g. 1 sec) on a non isolated cpu. That's where stuff gets
interesting.

On the isolated cpu we might still execute the scheduler tick because
we did not yet reach a condition to disable it. So we need to protect
the on cpu accounting against the scheduled one on the remote
cpu. Unfortunately that requires locking. The only reasonable lock
here is runqueue lock of the isolated cpu. Though this sounds worse
than it is. We take the cpu local rq lock from the tick anyway in
scheduler_tick(). So we can move the account_process_tick() call to
this code. Zero impact for the non isolated case.

In the isolated case we only might get contention, when the isolated
cpu was not yet able to disable the tick, but the remote update is
going to be slow anyway and that update can exit early when it notices
that the last on cpu update was less than a tick away.

Now if we run the remote update with a slow period (1 sec) there might
be some delay in the stats, but once the cpu vanished into user space
the while(1) mode we can really live with the slightly inaccurate
accumulation.

The only other issue might be posix cpu timers. For the start I really
would just ignore them. There are other means to watchdog a task
runtime, but we can extend the remote slow update scheme to posix cpu
timers as well if the need arises.

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/