linux-kernel - Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1263922238.31321.14.camel@gandalf.stny.rr.com>
Date:	Tue, 19 Jan 2010 12:30:38 -0500
From:	Steven Rostedt <rostedt@...dmis.org>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>,
	linux-kernel@...r.kernel.org,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Oleg Nesterov <oleg@...hat.com>, Ingo Molnar <mingo@...e.hu>,
	akpm@...ux-foundation.org, josh@...htriplett.org,
	tglx@...utronix.de, Valdis.Kletnieks@...edu, dhowells@...hat.com,
	laijs@...fujitsu.com, dipankar@...ibm.com
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory
 barrier (v5)

On Tue, 2010-01-19 at 17:47 +0100, Peter Zijlstra wrote:
> On Thu, 2010-01-14 at 11:26 -0500, Mathieu Desnoyers wrote:

> > It's this scenario that is causing problem. Let's consider this
> > execution:
> > 
> >        CPU 0 (membarrier)                  CPU 1 (another mm -> our mm)
> >        <kernel-space>                      <kernel-space>
> >                                            switch_mm()
> >                                              smp_mb()
> >                                              clear_mm_cpumask()
> >                                              set_mm_cpumask()
> >                                              smp_mb() (by load_cr3() on x86)
> >                                            switch_to()
> >        mm_cpumask includes CPU 1
> >        rcu_read_lock()
> >        if (CPU 1 mm != our mm)
> >          skip CPU 1.
> >        rcu_read_unlock()
> >                                              current = next (1)
> 
> OK, so on x86 current uses esp and will be flipped somewhere in the
> switch_to() magic, cpu_curr(cpu) as used by CPU 0 uses rq->curr, which
> will be set before context_switch() and that always implies a mb() for
> non matching ->mm's [*]


This explanation by Mathieu still does not show the issue. He finally
explained it correctly here:

 http://lkml.org/lkml/2010/1/14/319

The issue is before switch_to. If the read side reads in a pointer but
gets preempted and goes to the kernel, and the kernel updates next
(before the switch_to), and the write side does sys_membarrier(), the:

if (cpu_curr(1)->mm != our mm)

will fail, so no memory barrier will happen yet.

Then the userspace side of the writer will free or change the data, but
the reader side read is still stuck in the bus (still before the
switch_to). Then the reader can get a bad pointer.

Here's a simple explanation:

	CPU 0				CPU 1
      ----------		     -----------
	<userspace>			<userspace>
	obj = get_obj();
					rcu_read_lock();
					obj = get_obj(); <load of obj>
					<obj still being retrieved from memory>
					<interrupt, schedule out>
					<kernelspace>
					rq->curr = next;
					<bus stall>
					<obj still in memory>

	sys_membarrier();
	<kernel space>
	if (curr_cpu(1)->mm != mm) <-- false
	<userspace>

	modify object

					switch_to()
					<obj finally is loaded>
					<schedule kernel thread>
					<schedule task back>
					<userspace>
					refer(obj) <-- corruption!

	
					

> 
> >                                            <switch back to user-space>
> >                                            read-lock()
> >                                              read gp, store local gp
> >                                              barrier()
> >                                              access critical section (2)
> > 
> > So if we don't have any memory barrier between (1) and (2), the memory
> > operations can be reordered in such a way that CPU 0 will not send IPI
> > to a CPU that would need to have it's barrier() promoted into a
> > smp_mb().
> 
> OK, so I'm utterly failing to make sense of the above, do you need more
> than the 2 cpus discussed to make it go boom?

Just 2 cpu's as explained above.

> 
> > Replacing these kernel rcu_read_lock/unlock() by rq locks ensures that
> > when the scheduler runs concurrently on another CPU, _all_ the scheduling
> > code is executed atomically wrt the spin lock taken on cpu 0.
> 
> Sure, but taking the rq->lock is fairly heavy handed.

Yes, but for now it seems to be the only safe way.

> 
> > When x86 uses iret to return to user-space, then we have a serializing
> > instruction. But if it uses sysexit, or if we are on a different
> > architecture, are we sure that a memory barrier is issued before
> > returning to user-space ?
> 
> [*] and possibly also for matching ->mm's, because:
> 
> OK, so I had a quick look at the switch_to() magic, and from what I can
> make of it it implies an mb, if only because poking at the segment
> registers implies LOCK semantics.

The problem is that this issue can be caused before we get to
switch_to().

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/