[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230905214235.320571-1-peterx@redhat.com>
Date: Tue, 5 Sep 2023 17:42:28 -0400
From: Peter Xu <peterx@...hat.com>
To: linux-kernel@...r.kernel.org, linux-mm@...ck.org
Cc: Anish Moorthy <amoorthy@...gle.com>,
Axel Rasmussen <axelrasmussen@...gle.com>,
Alexander Viro <viro@...iv.linux.org.uk>,
Mike Kravetz <mike.kravetz@...cle.com>,
Peter Zijlstra <peterz@...radead.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Mike Rapoport <rppt@...nel.org>,
Christian Brauner <brauner@...nel.org>, peterx@...hat.com,
linux-fsdevel@...r.kernel.org,
Andrea Arcangeli <aarcange@...hat.com>,
Ingo Molnar <mingo@...hat.com>,
James Houghton <jthoughton@...gle.com>,
Nadav Amit <nadav.amit@...il.com>
Subject: [PATCH 0/7] mm/userfaultfd/poll: Scale userfaultfd wakeups
Userfaultfd is the type of file that doesn't need wake-all semantics: if
there is a message enqueued (for either a fault address, or an event), we
only need to wake up one service thread to handle it. Waking up more
normally means a waste of cpu cycles. Besides that, and more importantly,
that just doesn't scale.
Andrea used to have one patch that made read() to be O(1) but never hit
upstream. This is my effort to try upstreaming that (which is a
oneliner..), meanwhile on top of that I also made poll() O(1) on wakeup,
too (more or less bring EPOLLEXCLUSIVE to poll()), with some tests showing
that effect.
To verify this, I added a test called uffd-perf (leveraging the refactored
uffd selftest suite) that will measure the messaging channel latencies on
wakeups, and the waitqueue optimizations can be reflected by the new test:
Constants: 40 uffd threads, on N_CPUS=40, memsize=512M
Units: milliseconds (to finish the test)
|-----------------+--------+-------+------------|
| test case | before | after | diff (%) |
|-----------------+--------+-------+------------|
| workers=8,poll | 1762 | 1133 | -55.516328 |
| workers=8,read | 1437 | 585 | -145.64103 |
| workers=16,poll | 1117 | 1097 | -1.8231541 |
| workers=16,read | 1159 | 759 | -52.700922 |
| workers=32,poll | 1001 | 973 | -2.8776978 |
| workers=32,read | 866 | 713 | -21.458626 |
|-----------------+--------+-------+------------|
The more threads hanging on the fd_wqh, a bigger difference will be there
shown in the numbers. "8 worker threads" is the worst case here because it
means there can be a worst case of 40-8=32 threads hanging idle on fd_wqh
queue.
In real life, workers can be more than this, but small number of active
worker threads will cause similar effect.
This is currently based on Andrew's mm-unstable branch, but assuming this
is applicable to most of the not-so-old trees.
Comments welcomed, thanks.
Andrea Arcangeli (1):
mm/userfaultfd: Make uffd read() wait event exclusive
Peter Xu (6):
poll: Add a poll_flags for poll_queue_proc()
poll: POLL_ENQUEUE_EXCLUSIVE
fs/userfaultfd: Use exclusive waitqueue for poll()
selftests/mm: Replace uffd_read_mutex with a semaphore
selftests/mm: Create uffd_fault_thread_create|join()
selftests/mm: uffd perf test
drivers/vfio/virqfd.c | 4 +-
drivers/vhost/vhost.c | 2 +-
drivers/virt/acrn/irqfd.c | 2 +-
fs/aio.c | 2 +-
fs/eventpoll.c | 2 +-
fs/select.c | 9 +-
fs/userfaultfd.c | 8 +-
include/linux/poll.h | 25 ++-
io_uring/poll.c | 4 +-
mm/memcontrol.c | 4 +-
net/9p/trans_fd.c | 3 +-
tools/testing/selftests/mm/Makefile | 2 +
tools/testing/selftests/mm/uffd-common.c | 65 +++++++
tools/testing/selftests/mm/uffd-common.h | 7 +
tools/testing/selftests/mm/uffd-perf.c | 207 +++++++++++++++++++++++
tools/testing/selftests/mm/uffd-stress.c | 53 +-----
virt/kvm/eventfd.c | 2 +-
17 files changed, 337 insertions(+), 64 deletions(-)
create mode 100644 tools/testing/selftests/mm/uffd-perf.c
--
2.41.0
Powered by blists - more mailing lists