[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.1.10.0811252313420.32523@alien.or.mcafeemobile.com>
Date: Wed, 26 Nov 2008 11:36:29 -0800 (PST)
From: Davide Libenzi <davidel@...ilserver.org>
To: Tejun Heo <htejun@...il.com>
cc: Oleg Nesterov <oleg@...hat.com>,
Eric Van Hensbergen <ericvh@...il.com>,
Ron Minnich <rminnich@...dia.gov>, Ingo Molnar <mingo@...e.hu>,
Christoph Hellwig <hch@...radead.org>,
Miklos Szeredi <mszeredi@...e.cz>,
Brad Boyer <flar@...andria.com>,
Al Viro <viro@...iv.linux.org.uk>,
Roland McGrath <roland@...hat.com>,
Mauro Carvalho Chehab <mchehab@...radead.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] poll: allow f_op->poll to sleep, take#5
On Wed, 26 Nov 2008, Tejun Heo wrote:
> Hello,
>
> Davide Libenzi wrote:
> > Look, pollwake() does:
> >
> > w1) WR triggered (1)
> > w2) WMB
> > w3) WR task->state (RUNNING)
> >
> > While poll_schedule_timeout() does:
> >
> > s1) WR task->state (TASK_INTERRUPTIBLE)
> > s2) MB
> > s3) RD triggered
> > s4) IF0 => RD task->state (if !RUNNING -> sleep)
> s5) after waking up, WR triggered to zero
>
> > The only risk is that w3 preceed s1, so that we go to sleep even though a
> > wakeup has been issued. But if w3 is visible, w1 is visible too, that
> > means that 'triggered' is visible in s3 (there's a MB in s2). So we skip
> > the schedule_hrtimeout_range(). So IMO you need no barriers on 'triggered'.
> > If you feel you need barriers, do you mind explaning a sequence of events
> > that makes a barrier-free version break?
>
> s5 from the previous iteration could happen after w1 during the next
> iteration and the test in s4 of the next iteration will miss the
> event, so the event could get lost on the iterations which is not the
> first one, no?
Hmmm, I just noticed that the set_current_state(TASK_INTERRUPTIBLE) at the
beginning of the ->poll() loop has been dropped (and it makes sense since
now ->poll() can sleep). So the iterations after the first becomes the
interesting ones.
Device side, via wakeup():
w1) WR dev->events
w2) WR triggered (1)
w3) WMB
w4) WR task->state (RUNNING)
On the poller side:
s1) WR task->state (TASK_INTERRUPTIBLE)
s2) MB
s3) RD triggered
s4) IF0 => RD task->state (if !RUNNING -> sleep)
s5) WR triggered (0)
s6) RD dev->events
Now, it is very likely that after w1 there is some full mb, since the
events (AKA internal manipulation of the device/file structure) happens
inside a spinlocked region. So, if the write at s5 is actually able to
override the one at w2, the dev->events set at w1 are likely going to be
visible at the immediately next ->poll() loop.
To be sure though, independently from the device/file event setting
behavior, IMO we need ...
Device side:
w1) WR dev->events
w2) MB
w3) WR triggered (1)
w4) WMB
w5) WR task->state (RUNNING)
Poller side:
s1) WR task->state (TASK_INTERRUPTIBLE)
s2) MB
s3) RD triggered
s4) IF0 => RD task->state (if !RUNNING -> sleep)
s5) WR triggered (0)
s6) MB
s7) RD dev->events
That is, an MB before w3 (triggered=1) and a set_mb(triggered,0) at
s5+s6. The spinlock on the queue taken before entering pollwake() is not
enough to guarantee the required ordering, since a LOCK is no guarantee
that operations before it are visible after the LOCK.
Without the MB at w2, it could happen [w3, s5, s7, w1] that will make us
miss the event *and* sleep.
- Davide
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists