linux-kernel - Re: Requirements to control kernel isolation/nohz

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200909223400.GA20541@lenoir>
Date:   Thu, 10 Sep 2020 00:34:01 +0200
From:   Frederic Weisbecker <frederic@...nel.org>
To:     Marcelo Tosatti <mtosatti@...hat.com>
Cc:     Phil Auld <pauld@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        "Paul E. McKenney" <paulmck@...nel.org>,
        Joel Fernandes <joel@...lfernandes.org>,
        linux-kernel@...r.kernel.org
Subject: Re: Requirements to control kernel isolation/nohz_full at runtime

On Thu, Sep 03, 2020 at 03:23:59PM -0300, Marcelo Tosatti wrote:
> On Tue, Sep 01, 2020 at 12:46:41PM +0200, Frederic Weisbecker wrote:
> > == Unbound affinity ==
> > 
> > Restore kernel threads, workqueue, timers, etc... wide affinity. But take care of cpumasks that have been set through other
> > interfaces: sysfs, procfs, etc...
> 
> We were looking at a userspace interface: what would be a proper
> (unified, similar to isolcpus= interface) and its implementation:
> 
> The simplest idea for interface seemed to be exposing the integer list of
> CPUs and isolation flags to userspace (probably via sysfs).
> 
> The scheme would allow flags to be separately enabled/disabled, 
> with not all flags being necessary toggable (could for example
> disallow nohz_full= toggling until it is implemented, but allow for
> other isolation features to be toggable).
> 
> This would require per flag housekeeping_masks (instead of a single).

Right, I think cpusets provide exactly.

> Back to the userspace interface, you mentioned earlier that cpusets
> was a possibility for it. However:
> 
> "Cpusets provide a Linux kernel mechanism to constrain which CPUs and
> Memory Nodes are used by a process or set of processes.
> 
> The Linux kernel already has a pair of mechanisms to specify on which
> CPUs a task may be scheduled (sched_setaffinity) and on which Memory
> Nodes it may obtain memory (mbind, set_mempolicy).
> 
> Cpusets extends these two mechanisms as follows:"
> 
> The isolation flags do not necessarily have anything to do with
> tasks, but with CPUs: a given feature is disabled or enabled on a
> given CPU. 
> No?

When cpusets are set as exclusive, they become strict CPU properties.
I think we'll need to enforce the exclusive property to set the isolated
flags.

Then you're free to move the tasks you like into any isolated cpusets.

> Regarding locking of the masks, since housekeeping_masks can be called
> from hot paths (eg: get_nohz_timer_target) it seems RCU is a natural
> fit, so userspace would:
> 
> 1) use interface to change cpumask for a given feature:
> 
> 	-> set_rcu_pointer
> 	-> wait for grace period

Yep, could be a solution.

> 2) proceed to trigger actions that rely on housekeeping_cpumask, 
> to validate the cpumask at 1) is being used.

Exactly. I guess we can simply call directly to subsystems (timers,
workqueue, kthreads, ...) from the isolation code upon cpumask update.
This way we avoid ordering surprises that would come with a notifier.

> Regarding nohz_full=, a way to get an immediate implementation 
> (without handling the issues you mention above) would be to boot
> with a set of CPUs as "nohz_full toggable" and others not. For 
> the nohz_full toggable ones, you'd introduce a per-CPU tick
> dependency that is enabled/disabled on runtime. Probably better
> to avoid this one if possible...

Right but you would still have all the overhead that comes with nohz full
(kernel entry/exit tracking, RCU userspace extended grace period, RCU callbacks
offloaded, vtime accounting, ...). It will become really interesting once we
can switch all that overhead off.

Thanks.