linux-kernel - Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150507124437.GB17443@gmail.com>
Date:	Thu, 7 May 2015 14:44:37 +0200
From:	Ingo Molnar <mingo@...nel.org>
To:	Andy Lutomirski <luto@...capital.net>
Cc:	fweisbec@...hat.com, Paolo Bonzini <pbonzini@...hat.com>,
	Thomas Gleixner <tglx@...utronix.de>, X86 ML <x86@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>,
	Heiko Carstens <heiko.carstens@...ibm.com>,
	Mike Galbraith <umgwanakikbuti@...il.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Rik van Riel <riel@...hat.com>, williams@...hat.com
Subject: Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable
 & enable from context tracking on syscall entry

* Andy Lutomirski <luto@...capital.net> wrote:

> > > We cannot take the lock_trace(task) from irq context, and we 
> > > probably do not need to anyway, since we do not care about a 
> > > precise stack trace for the task.
> >
> > So one worry with this and similar approaches of statistically 
> > detecting user mode would be the fact that on the way out to 
> > user-space we don't really destroy the previous call trace - we 
> > just pop off the stack (non-destructively), restore RIPs and are 
> > gone.
> >
> > We'll need that percpu flag I suspect.
> >
> > And once we have the flag, we can get rid of the per syscall RCU 
> > callback as well, relatively easily: with CMPXCHG (in 
> > synchronize_rcu()!) we can reliably sample whether a CPU is in 
> > user mode right now, while the syscall entry/exit path does not 
> > use any atomics, we can just use a simple MOV.
> >
> > Once we observe 'user mode', then we have observed quiescent state 
> > and synchronize_rcu() can continue. If we've observed kernel mode 
> > we can frob the remote task's TIF_ flags to make it go into a 
> > quiescent state publishing routine on syscall-return.
> >
> 
> How does that work?
>
> If the exit code writes the per-cpu flag and then checks 
> TIF_whatever, we need a barrier to avoid a race where we end up in 
> user mode without seeing the flag.

No, the TIF_RCU_SYNC flag would be independent of the user-mode flag.

Also, the user-mode flag would be changed from the 'kernel-mode' value 
to the 'user-mode' value after we do the regular TIF checks - just 
before we hit user-space.

The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() (being 
executed on some other CPU not doing RT work) to intelligently wait 
for the remote (RT work doing) CPU to finish executing kernel code, 
without polling or so.

And yes, the TIF_RCU_QS handler on the remote CPU would have to 
execute an SMP barrier, before it allows an RCU quiescent state to 
pass. Note that the regular RCU code for quiescent state publishing 
already has that barrier, typically something like:

	this_cpu_inc(rcu_qs_ctr);

That in itself is enough for synchronize_rcu(), it could do a small 
timing loop with short sleeps, until it waits for rcu_qs_ctr to 
increase.

> I think the right solution is to accept that race and just have the 
> RCU code send an IPI (or check again) if it sees too long of a 
> period elapse in kernel mode.

I don't think there's any need for an IPI.

> I think the flag should be a counter, though.  That way a workload 
> that makes lots of syscalls will be quickly detected as going 
> through quiescent states even if it's never actually observed in 
> user mode.

Flag write is easier on the CPU than an INC/DEC: only a store to a 
percpu location, no load needed, and no value dependencies. The store 
will be program ordered, so it's a spin_unlock() work-alike.

> > The only hard requirement of this scheme from the RCU 
> > synchronization POV is that all kernel contexts that may touch RCU 
> > state need to flip this flag reliably to 'kernel mode': i.e. all 
> > irq handlers, traps, NMIs and all syscall variants need to do 
> > this.
> >
> > But once it's there, it's really neat.
> 
> We already have to do this with the current code.  I went through 
> and checked all of the IST entries a couple versions ago.

Yeah - I just mean if the primary flag is _not_ TIF driven (which is 
my suggestion, and which I thought you suggested as well) then all 
code paths need to be covered.

Things like the irq-tracking callbacks are debugging and 
instrumentation code - the user/kernel mode flag would be a hard 
requirement on RCU correctness: missing a flag update might cause the 
kernel to crash in non-obvious ways.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/