[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250625152702.JiI8qdk-@linutronix.de>
Date: Wed, 25 Jun 2025 17:27:02 +0200
From: Nam Cao <namcao@...utronix.de>
To: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc: Alexander Viro <viro@...iv.linux.org.uk>,
Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>,
John Ogness <john.ogness@...utronix.de>,
Clark Williams <clrkwllms@...nel.org>,
Steven Rostedt <rostedt@...dmis.org>, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-rt-devel@...ts.linux.dev,
linux-rt-users@...r.kernel.org, Joe Damato <jdamato@...tly.com>,
Martin Karsten <mkarsten@...terloo.ca>,
Jens Axboe <axboe@...nel.dk>,
Frederic Weisbecker <frederic@...nel.org>,
Valentin Schneider <vschneid@...hat.com>
Subject: Re: [PATCH v3] eventpoll: Fix priority inversion problem
On Wed, Jun 25, 2025 at 04:50:31PM +0200, Sebastian Andrzej Siewior wrote:
> On 2025-05-27 11:08:36 [+0200], Nam Cao wrote:
> > --- a/fs/eventpoll.c
> > +++ b/fs/eventpoll.c
> > @@ -1867,19 +1704,18 @@ static int ep_send_events(struct eventpoll *ep,
> > init_poll_funcptr(&pt, NULL);
> >
> > mutex_lock(&ep->mtx);
> > - ep_start_scan(ep, &txlist);
> >
> > - /*
> > - * We can loop without lock because we are passed a task private list.
> > - * Items cannot vanish during the loop we are holding ep->mtx.
> > - */
> > - list_for_each_entry_safe(epi, tmp, &txlist, rdllink) {
> > + while (res < maxevents) {
> > struct wakeup_source *ws;
> > + struct llist_node *n;
> > __poll_t revents;
> >
> > - if (res >= maxevents)
> > + n = llist_del_first(&ep->rdllist);
> > + if (!n)
> > break;
> >
> > + epi = llist_entry(n, struct epitem, rdllink);
> > +
> > /*
> > * Activate ep->ws before deactivating epi->ws to prevent
> > * triggering auto-suspend here (in case we reactive epi->ws
> > @@ -1896,21 +1732,30 @@ static int ep_send_events(struct eventpoll *ep,
> > __pm_relax(ws);
> > }
> >
> > - list_del_init(&epi->rdllink);
> > -
> > /*
> > * If the event mask intersect the caller-requested one,
> > * deliver the event to userspace. Again, we are holding ep->mtx,
> > * so no operations coming from userspace can change the item.
> > */
> > revents = ep_item_poll(epi, &pt, 1);
> > - if (!revents)
> > + if (!revents) {
> > + init_llist_node(n);
> > +
> > + /*
> > + * Just in case epi becomes ready after ep_item_poll() above, but before
> > + * init_llist_node(). Make sure to add it to the ready list, otherwise an
> > + * event may be lost.
> > + */
>
> So why not llist_del_first_init() at the top? Wouldn't this avoid the
> add below?
Look at that function:
static inline struct llist_node *llist_del_first_init(struct llist_head *head)
{
struct llist_node *n = llist_del_first(head);
// BROKEN: another task does llist_add() here for the same node
if (n)
init_llist_node(n);
return n;
}
It is not atomic to another task doing llist_add() to the same node.
init_llist_node() would then put the list in an inconsistent state.
To be sure, I tried your suggestion. Systemd sometimes failed to boot, and
my stress test crashed instantly.
>
> > + if (unlikely(ep_item_poll(epi, &pt, 1))) {
> > + ep_pm_stay_awake(epi);
> > + epitem_ready(epi);
> > + }
> > continue;
> > + }
> >
> > events = epoll_put_uevent(revents, epi->event.data, events);
> > if (!events) {
> > - list_add(&epi->rdllink, &txlist);
> > - ep_pm_stay_awake(epi);
> > + llist_add(&epi->rdllink, &ep->rdllist);
>
> That epitem_ready() above and this llist_add() add epi back where it was
> retrieved from. Wouldn't it loop in this case?
This is the EFAULT case, we are giving up, therefore we put the item back
and bail out. Therefore no loop.
If we have already done at least one item, then we report that to user. If
none, then we report -EFAULT. Regardless, this current item is not
"successfully consumed", so we put it back for the others to take it. We
are done here.
> I think you can avoid the add above and here adding it to txlist would
> avoid the loop. (It returns NULL if the copy-to-user failed so I am not
> sure why another retry will change something but the old code did it,
> too so).
>
> > if (!res)
> > res = -EFAULT;
> > break;
>
> One note: The old code did "list_add() + ep_pm_stay_awake()". Now you do
> "ep_pm_stay_awake() + epitem_ready()". epitem_ready() adds the item
> conditionally to the list so you may do ep_pm_stay_awake() without
> adding it to the list because it already is. Looking through
> ep_pm_stay_awake() it shouldn't do any harm except incrementing a
> counter again.
Yes, it shouldn't do any harm.
Thanks for reviewing, I know this lockless thing is annoying to look at.
Nam
Powered by blists - more mailing lists