[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251028175639.2567832-1-kuniyu@google.com>
Date: Tue, 28 Oct 2025 17:56:28 +0000
From: Kuniyuki Iwashima <kuniyu@...gle.com>
To: Christian Brauner <brauner@...nel.org>, Jens Axboe <axboe@...nel.dk>
Cc: Dave Hansen <dave.hansen@...ux.intel.com>,
David Laight <david.laight.linux@...il.com>,
Linus Torvalds <torvalds@...ux-foundation.org>, Eric Dumazet <edumazet@...gle.com>,
Kuniyuki Iwashima <kuniyu@...gle.com>, Kuniyuki Iwashima <kuni1840@...il.com>, linux-kernel@...r.kernel.org,
Dave Hansen <dave.hansen@...el.com>
Subject: [PATCH v2] epoll: Use user_write_access_begin() and unsafe_put_user()
in epoll_put_uevent().
epoll_put_uevent() calls __put_user() twice, which are inlined
to two calls of out-of-line functions, __put_user_nocheck_4()
and __put_user_nocheck_8().
Both functions wrap mov with a stac/clac pair, which is expensive
on the AMD EPYC 7B12 (Zen 2) 64-Core Processor platform.
__put_user_nocheck_4 /proc/kcore [Percent: local period]
Percent │
89.91 │ stac
0.19 │ mov %eax,(%rcx)
0.15 │ xor %ecx,%ecx
9.69 │ clac
0.06 │ ← retq
This was remarkable while testing neper/udp_rr with 1000 flows per
thread.
Overhead Shared O Symbol
10.08% [kernel] [k] _copy_to_iter
7.12% [kernel] [k] ip6_output
6.40% [kernel] [k] sock_poll
5.71% [kernel] [k] move_addr_to_user
4.39% [kernel] [k] __put_user_nocheck_4
...
1.06% [kernel] [k] ep_try_send_events
... ^- epoll_put_uevent() was inlined
0.78% [kernel] [k] __put_user_nocheck_8
Let's use user_write_access_begin() and unsafe_put_user() in
epoll_put_uevent().
We saw 2% more pps with udp_rr by saving a stac/clac pair.
Before:
# nstat > /dev/null; sleep 10; nstat | grep -i udp
Udp6InDatagrams 2184011 0.0
@ep_try_send_events_ns:
[256, 512) 2796601 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K) 627863 |@@@@@@@@@@@ |
[1K, 2K) 166403 |@@@ |
[2K, 4K) 10437 | |
[4K, 8K) 1396 | |
[8K, 16K) 116 | |
After:
# nstat > /dev/null; sleep 10; nstat | grep -i udp
Udp6InDatagrams 2232730 0.0
@ep_try_send_events_ns:
[256, 512) 2900655 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K) 622045 |@@@@@@@@@@@ |
[1K, 2K) 172831 |@@@ |
[2K, 4K) 17687 | |
[4K, 8K) 1103 | |
[8K, 16K) 174 | |
Another option would be to use can_do_masked_user_access()
and masked_user_access_begin(), but we saw 3% regression. (See Link)
Link: https://lore.kernel.org/lkml/20251028053330.2391078-1-kuniyu@google.com/
Suggested-by: Eric Dumazet <edumazet@...gle.com>
Suggested-by: Dave Hansen <dave.hansen@...el.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@...gle.com>
---
v2:
* Drop patch 1
* Use user_write_access_begin() instead of a bare stac (Dave Hansen)
v1: https://lore.kernel.org/lkml/20251023000535.2897002-1-kuniyu@google.com/
---
include/linux/eventpoll.h | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
index ccb478eb174b..31a1b11e4ddf 100644
--- a/include/linux/eventpoll.h
+++ b/include/linux/eventpoll.h
@@ -82,11 +82,15 @@ static inline struct epoll_event __user *
epoll_put_uevent(__poll_t revents, __u64 data,
struct epoll_event __user *uevent)
{
- if (__put_user(revents, &uevent->events) ||
- __put_user(data, &uevent->data))
+ if (!user_write_access_begin(uevent, sizeof(*uevent)))
return NULL;
-
- return uevent+1;
+ unsafe_put_user(revents, &uevent->events, efault);
+ unsafe_put_user(data, &uevent->data, efault);
+ user_access_end();
+ return uevent + 1;
+efault:
+ user_access_end();
+ return NULL;
}
#endif
--
2.51.1.851.g4ebd6896fd-goog
Powered by blists - more mailing lists