linux-kernel - Re: [PATCH 2/2] eventpoll: Fix epoll

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aflo53gea7i6cyy22avn7mqxb3xboakgjwnmj4bqmjp6oafejj@owgv35lly7zq>
Date: Wed, 17 Sep 2025 14:49:43 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: Nam Cao <namcao@...utronix.de>
Cc: Alexander Viro <viro@...iv.linux.org.uk>, 
	Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>, Shuah Khan <shuah@...nel.org>, 
	Davidlohr Bueso <dave@...olabs.net>, Soheil Hassas Yeganeh <soheil@...gle.com>, 
	Khazhismel Kumykov <khazhy@...gle.com>, Willem de Bruijn <willemb@...gle.com>, 
	Eric Dumazet <edumazet@...gle.com>, Jens Axboe <axboe@...nel.dk>, linux-fsdevel@...r.kernel.org, 
	linux-kernel@...r.kernel.org, linux-kselftest@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative

On Fri, Jul 18, 2025 at 09:52:29AM +0200, Nam Cao wrote:
> ep_events_available() checks for available events by looking at ep->rdllist
> and ep->ovflist. However, this is done without a lock, therefore the
> returned value is not reliable. Because it is possible that both checks on
> ep->rdllist and ep->ovflist are false while ep_start_scan() or
> ep_done_scan() is being executed on other CPUs, despite events are
> available.
> 
> This bug can be observed by:
> 
>   1. Create an eventpoll with at least one ready level-triggered event
> 
>   2. Create multiple threads who do epoll_wait() with zero timeout. The
>      threads do not consume the events, therefore all epoll_wait() should
>      return at least one event.
> 
> If one thread is executing ep_events_available() while another thread is
> executing ep_start_scan() or ep_done_scan(), epoll_wait() may wrongly
> return no event for the former thread.
> 
> This reproducer is implemented as TEST(epoll65) in
> tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
> 
> Fix it by skipping ep_events_available(), just call ep_try_send_events()
> directly.
> 
> epoll_sendevents() (io_uring) suffers the same problem, fix that as well.
> 
> There is still ep_busy_loop() who uses ep_events_available() without lock,
> but it is probably okay (?) for busy-polling.
> 

I'll say upfront I'm not an epoll person, just looked here out of
curiosity.

I can agree there is a bug. The event is generated before any of the
threads even exist and they only poll for it, none of them consume it.

However, the commit message fails to explain why the change fixes
anything and I think your patch de facto reverts e59d3c64cba6 ("epoll:
eliminate unnecessary lock for zero timeout"). Looking at that diff
the point was to avoid the expensive lock trip if timeout == 0 and there
are no events.

As for the bug is, from my reading the ep_start_scan()/ep_done_scan()
pair transiently disturbs the state checked by ep_events_available(),
resulting in false-negatives. Then the locked check works because by the
time you acquire it, the damage is undone.

Given the commits referenced in Fixes:, I suspect the real fix would be
to stop destroying that state of course.

But if that's not feasible, I would check if a sequence counter around
this would do the trick -- then the racy ep_events_available(ep) upfront
would become safe with smaller overhead than with your proposal for the
no-event case, but with higher overhead when there is something.

My proposal is trivial to implement, I have no idea if it will get a
buy-in though.

> Fixes: c5a282e9635e ("fs/epoll: reduce the scope of wq lock in
> epoll_wait()") Fixes: e59d3c64cba6 ("epoll: eliminate unnecessary lock
> for zero timeout") Fixes: ae3a4f1fdc2c ("eventpoll: add
> epoll_sendevents() helper") Signed-off-by: Nam Cao
> <namcao@...utronix.de>
> Cc: stable@...r.kernel.org
> ---
>  fs/eventpoll.c | 16 ++--------------
>  1 file changed, 2 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index 0fbf5dfedb24..541481eafc20 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -2022,7 +2022,7 @@ static int ep_schedule_timeout(ktime_t *to)
>  static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
>  		   int maxevents, struct timespec64 *timeout)
>  {
> -	int res, eavail, timed_out = 0;
> +	int res, eavail = 1, timed_out = 0;
>  	u64 slack = 0;
>  	wait_queue_entry_t wait;
>  	ktime_t expires, *to = NULL;
> @@ -2041,16 +2041,6 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
>  		timed_out = 1;
>  	}
>  
> -	/*
> -	 * This call is racy: We may or may not see events that are being added
> -	 * to the ready list under the lock (e.g., in IRQ callbacks). For cases
> -	 * with a non-zero timeout, this thread will check the ready list under
> -	 * lock and will add to the wait queue.  For cases with a zero
> -	 * timeout, the user by definition should not care and will have to
> -	 * recheck again.
> -	 */
> -	eavail = ep_events_available(ep);
> -
>  	while (1) {
>  		if (eavail) {
>  			res = ep_try_send_events(ep, events, maxevents);
> @@ -2496,9 +2486,7 @@ int epoll_sendevents(struct file *file, struct epoll_event __user *events,
>  	 * Racy call, but that's ok - it should get retried based on
>  	 * poll readiness anyway.
>  	 */
> -	if (ep_events_available(ep))
> -		return ep_try_send_events(ep, events, maxevents);
> -	return 0;
> +	return ep_try_send_events(ep, events, maxevents);
>  }
>  
>  /*
> -- 
> 2.39.5
>