linux-kernel - Re: crazy idea: big percpu lock (Re: task isolation)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALCETrVDREf59yQk2SYn8dgzp3XbMp0opNmzz05e6Apig7=-SA@mail.gmail.com>
Date:	Thu, 8 Oct 2015 15:28:10 -0700
From:	Andy Lutomirski <luto@...capital.net>
To:	Christoph Lameter <cl@...ux.com>
Cc:	Chris Metcalf <cmetcalf@...hip.com>,
	Luiz Capitulino <lcapitulino@...hat.com>,
	Gilad Ben Yossef <giladb@...hip.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Rik van Riel <riel@...hat.com>, Tejun Heo <tj@...nel.org>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Viresh Kumar <viresh.kumar@...aro.org>,
	Catalin Marinas <catalin.marinas@....com>,
	Will Deacon <will.deacon@....com>,
	"linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: crazy idea: big percpu lock (Re: task isolation)

On Thu, Oct 8, 2015 at 3:01 PM, Christoph Lameter <cl@...ux.com> wrote:
> On Thu, 8 Oct 2015, Andy Lutomirski wrote:
>
>> It seems to me that a big part of the problem is that there's all
>> kinds of per-cpu deferred housekeeping work that can be done on the
>> CPU in question without any complicated or heavyweight locking but
>> that can't be done remotely without a mess.  This presumably includes
>> vmstat, draining the LRU list, etc.  This is a problem for allowing
>> CPUs to spend a long time without any interrupts.
>
> Well its not a problem if the task does a prctl to ask for the kernel to
> quiet down. In that case we can simply flush all the pending stuff on the
> cpu that owns the percpu section.
>

Will this really end up working?  I can see two problems:

1. It's rather expensive.  For processes that still make syscalls but
just not many, it means that you're forcibly quiescing every time.

2. It only really makes sense for work that results from local kernel
actions, happens once, and won't recur.  I admit that I don't know how
many of the offenders are like this, but I can imagine there being
some periodic tasks that could be done locally or remotely with a big
percpu lock.

>> I want to propose a new primitive that might go a long way toward
>> solving this issue.  The new primitive would be called the "big percpu
>> lock".  Non-nohz CPUs would hold their big percpu lock all the time.
>> Nohz CPUs would hold it all the time unless idle.  Full nohz cpus
>> would hold it all the time except when idle or in user mode.  No CPU
>> promises to hold it while processing an NMI or similar NMI-like work.
>
> Not sure that there is an issue to solve. So this is a lock per cpu that
> signals that the processor can handle its per cpu data alone. If its not
> held then other cpus can access the percpu data remotely?
>
>> This should help in a ton of cases.
>>
>> For vunmap global kernel TLB flushes, we could stick the flushes in a
>> list of deferred flushes to be processed on entry, and that list would
>> be protected by the big percpu lock.  For any kind of draining of
>> non-NMI-safe percpu data (LRU, vmstat, whatever), we could have a
>> housekeeping cpu try to do it using the big percpu lock
>
> Ok what is the problem with using the cpu that owns the percpu data to
> flush it?

Nothing, but only if flushing gets the job done.

> Or simply ignore the situation until the cpu is entering the
> kernel again?

Maybe.  I wonder if, for things like vmstat, that would be better in
general (not just NOHZ).  We have task_work nowadays...

> Caches can be useful later again when the process wants to
> allocate memory etc. We would have to repopulate them if we flush them.

True.  But we don't need to flush them at all until there's memory
pressure, and the big percpu lock solves this particular problem quite
nicely -- a remote CPU can simply drain the cache itself instead of
using an IPI.

>
>> There's a race here that affects task isolation.  On exit to user
>> mode, there's no obvious way to tell that an IPI is already pending.
>> We could add that, too: whenever we send an IPI to a nohz_full CPU, we
>> increment a percpu pending IPI count, then try to get the big percpu
>> lock, and then, if we fail, send the IPI.  IOW, we might want a helper
>> that takes a remote big percpu lock or calls a remote function that
>> guards against this race.
>>
>> Thoughts?  Am I nuts?
>
> Generally having a lock that signals that other can access the per cpu
> data may make sense. However, what is the overhead of handling that lock?
>
> One definitely does not want to handle that in latency critical sections.

On the accessing side, it's just a standard try spinlock operation.
On the nohz side, it's a spinlock acquire on entry and a spinlock
release on exit.  This is actually probably considerably cheaper than
whatever the context tracking code already does, but it does put a
lower bound on how cheap we can make it.

>
> And one cannot handle the lock in interrupt disabled sections like IPIs.
> But if one can remotely acquire that lock then no IPI is needed anymore if
> the only thing we want to do is manipulate per cpu data.

Sure you can.  If you're in an IPI handler, you have the lock.

>
> There is a complication that many of these flushing functions are written
> using this_cpu operations that can only be run on the cpu owning the per
> cpu section because the per cpu base is different on other processors. If
> you want to change that then more expensive instructions have to be used.
> So you end up with two different versions of the function.
>

That's a fair point.

How many of these things are there?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/