lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250716074110.AY8PBtBi@linutronix.de>
Date: Wed, 16 Jul 2025 09:41:10 +0200
From: Nam Cao <namcao@...utronix.de>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Christian Brauner <brauner@...nel.org>, Xi Ruoyao <xry111@...111.site>,
	Frederic Weisbecker <frederic@...nel.org>,
	Valentin Schneider <vschneid@...hat.com>,
	Alexander Viro <viro@...iv.linux.org.uk>, Jan Kara <jack@...e.cz>,
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
	John Ogness <john.ogness@...utronix.de>,
	K Prateek Nayak <kprateek.nayak@....com>,
	Clark Williams <clrkwllms@...nel.org>,
	Steven Rostedt <rostedt@...dmis.org>, linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org, linux-rt-devel@...ts.linux.dev,
	linux-rt-users@...r.kernel.org, Joe Damato <jdamato@...tly.com>,
	Martin Karsten <mkarsten@...terloo.ca>,
	Jens Axboe <axboe@...nel.dk>, stable@...r.kernel.org
Subject: Re: [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock

On Tue, Jul 15, 2025 at 09:42:52AM -0700, Linus Torvalds wrote:
> On Tue, 15 Jul 2025 at 05:47, Nam Cao <namcao@...utronix.de> wrote:
> >
> >  fs/eventpoll.c | 139 +++++++++----------------------------------------
> >  1 file changed, 26 insertions(+), 113 deletions(-)
> 
> Yeah, this is more like the kind of diffstat I like to see for
> eventpoll. Plus it makes things fundamentally simpler.
> 
> It might be worth looking at ep_poll_callback() - the only case that
> had read_lock_irqsave() - and seeing if perhaps some of the tests
> inside the lock might be done optimistically, or delayed to after the
> lock.
> 
> For example, the whole wakeup sequence looks like it should be done
> *after* the ep->lock has been released, because it uses its own lock
> (ie the
> 
>                 if (sync)
>                         wake_up_sync(&ep->wq);
>                 else
>                         wake_up(&ep->wq);
> 
> thing uses the wq lock, and there is nothing that ep->lock protects
> there as far as I can tell.
> 
> So I think this has some potential for _simple_ optimizations, but I'm
> not sure how worth it it is.

Actually ep->lock does protect this part. Because ep_poll() touches ep->wq
without holding the waitqueue's lock.

We could do something like the diff below. But things are not so simple
anymore.

Best regards,
Nam

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index a171f7e7dacc..5e6c7da606e7 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1252,8 +1252,6 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 	unsigned long flags;
 	int ewake = 0;
 
-	spin_lock_irqsave(&ep->lock, flags);
-
 	ep_set_busy_poll_napi_id(epi);
 
 	/*
@@ -1263,7 +1261,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 	 * until the next EPOLL_CTL_MOD will be issued.
 	 */
 	if (!(epi->event.events & ~EP_PRIVATE_BITS))
-		goto out_unlock;
+		goto out;
 
 	/*
 	 * Check the events coming with the callback. At this stage, not
@@ -1272,7 +1270,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 	 * test for "key" != NULL before the event match test.
 	 */
 	if (pollflags && !(pollflags & epi->event.events))
-		goto out_unlock;
+		goto out;
 
 	/*
 	 * If we are transferring events to userspace, we can hold no locks
@@ -1280,6 +1278,8 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 	 * semantics). All the events that happen during that period of time are
 	 * chained in ep->ovflist and requeued later on.
 	 */
+	spin_lock_irqsave(&ep->lock, flags);
+
 	if (READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR) {
 		if (epi->next == EP_UNACTIVE_PTR) {
 			epi->next = READ_ONCE(ep->ovflist);
@@ -1292,9 +1292,14 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 		ep_pm_stay_awake_rcu(epi);
 	}
 
+	spin_unlock_irqrestore(&ep->lock, flags);
+
+
 	/*
 	 * Wake up ( if active ) both the eventpoll wait list and the ->poll()
 	 * wait list.
+	 *
+	 * Memory barrier for waitqueue_active() from spin_unlock_irqrestore().
 	 */
 	if (waitqueue_active(&ep->wq)) {
 		if ((epi->event.events & EPOLLEXCLUSIVE) &&
@@ -1321,9 +1326,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 	if (waitqueue_active(&ep->poll_wait))
 		pwake++;
 
-out_unlock:
-	spin_unlock_irqrestore(&ep->lock, flags);
-
+out:
 	/* We have to call this outside the lock */
 	if (pwake)
 		ep_poll_safewake(ep, epi, pollflags & EPOLL_URING_WAKE);
@@ -2001,13 +2004,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		init_wait(&wait);
 		wait.func = ep_autoremove_wake_function;
 
-		spin_lock_irq(&ep->lock);
-		/*
-		 * Barrierless variant, waitqueue_active() is called under
-		 * the same lock on wakeup ep_poll_callback() side, so it
-		 * is safe to avoid an explicit barrier.
-		 */
-		__set_current_state(TASK_INTERRUPTIBLE);
+		prepare_to_wait_exclusive(&ep->wq, &wait, TASK_INTERRUPTIBLE);
 
 		/*
 		 * Do the final check under the lock. ep_start/done_scan()
@@ -2016,10 +2013,8 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		 * period of time although events are pending, so lock is
 		 * important.
 		 */
+		spin_lock_irq(&ep->lock);
 		eavail = ep_events_available(ep);
-		if (!eavail)
-			__add_wait_queue_exclusive(&ep->wq, &wait);
-
 		spin_unlock_irq(&ep->lock);
 
 		if (!eavail)
@@ -2036,7 +2031,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		eavail = 1;
 
 		if (!list_empty_careful(&wait.entry)) {
-			spin_lock_irq(&ep->lock);
+			spin_lock_irq(&ep->wq.lock);
 			/*
 			 * If the thread timed out and is not on the wait queue,
 			 * it means that the thread was woken up after its
@@ -2047,7 +2042,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 			if (timed_out)
 				eavail = list_empty(&wait.entry);
 			__remove_wait_queue(&ep->wq, &wait);
-			spin_unlock_irq(&ep->lock);
+			spin_unlock_irq(&ep->wq.lock);
 		}
 	}
 }

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ