linux-kernel - Re: [PATCH] poll: allow f

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.1.10.0811252313420.32523@alien.or.mcafeemobile.com>
Date:	Wed, 26 Nov 2008 11:36:29 -0800 (PST)
From:	Davide Libenzi <davidel@...ilserver.org>
To:	Tejun Heo <htejun@...il.com>
cc:	Oleg Nesterov <oleg@...hat.com>,
	Eric Van Hensbergen <ericvh@...il.com>,
	Ron Minnich <rminnich@...dia.gov>, Ingo Molnar <mingo@...e.hu>,
	Christoph Hellwig <hch@...radead.org>,
	Miklos Szeredi <mszeredi@...e.cz>,
	Brad Boyer <flar@...andria.com>,
	Al Viro <viro@...iv.linux.org.uk>,
	Roland McGrath <roland@...hat.com>,
	Mauro Carvalho Chehab <mchehab@...radead.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] poll: allow f_op->poll to sleep, take#5

On Wed, 26 Nov 2008, Tejun Heo wrote:

> Hello,
> 
> Davide Libenzi wrote:
> > Look, pollwake() does:
> > 
> > w1) WR triggered (1)
> > w2) WMB
> > w3) WR task->state (RUNNING)
> > 
> > While poll_schedule_timeout() does:
> > 
> > s1) WR task->state (TASK_INTERRUPTIBLE)
> > s2) MB
> > s3) RD triggered
> > s4) IF0 => RD task->state (if !RUNNING -> sleep)
> s5) after waking up, WR triggered to zero
> 
> > The only risk is that w3 preceed s1, so that we go to sleep even though a 
> > wakeup has been issued. But if w3 is visible, w1 is visible too, that 
> > means that 'triggered' is visible in s3 (there's a MB in s2). So we skip 
> > the schedule_hrtimeout_range(). So IMO you need no barriers on 'triggered'.
> > If you feel you need barriers, do you mind explaning a sequence of events 
> > that makes a barrier-free version break?
> 
> s5 from the previous iteration could happen after w1 during the next
> iteration and the test in s4 of the next iteration will miss the
> event, so the event could get lost on the iterations which is not the
> first one, no?

Hmmm, I just noticed that the set_current_state(TASK_INTERRUPTIBLE) at the 
beginning of the ->poll() loop has been dropped (and it makes sense since 
now ->poll() can sleep). So the iterations after the first becomes the 
interesting ones.
Device side, via wakeup():

w1) WR dev->events
w2) WR triggered (1)
w3) WMB
w4) WR task->state (RUNNING)

On the poller side:

s1) WR task->state (TASK_INTERRUPTIBLE)
s2) MB
s3) RD triggered
s4) IF0 => RD task->state (if !RUNNING -> sleep)
s5) WR triggered (0)
s6) RD dev->events

Now, it is very likely that after w1 there is some full mb, since the 
events (AKA internal manipulation of the device/file structure) happens 
inside a spinlocked region. So, if the write at s5 is actually able to 
override the one at w2, the dev->events set at w1 are likely going to be 
visible at the immediately next ->poll() loop.
To be sure though, independently from the device/file event setting 
behavior, IMO we need ...
Device side:

w1) WR dev->events
w2) MB
w3) WR triggered (1)
w4) WMB
w5) WR task->state (RUNNING)

Poller side:

s1) WR task->state (TASK_INTERRUPTIBLE)
s2) MB
s3) RD triggered
s4) IF0 => RD task->state (if !RUNNING -> sleep)
s5) WR triggered (0)
s6) MB
s7) RD dev->events

That is, an MB before w3 (triggered=1) and a set_mb(triggered,0) at 
s5+s6. The spinlock on the queue taken before entering pollwake() is not 
enough to guarantee the required ordering, since a LOCK is no guarantee 
that operations before it are visible after the LOCK.
Without the MB at w2, it could happen [w3, s5, s7, w1] that will make us 
miss the event *and* sleep.

- Davide

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/