linux-kernel - Re: [PATCH 0/7] mm/userfaultfd/poll: Scale userfaultfd wakeups

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJHvVcjQR95KVfu2qv3hepkLWkH5J8qRG_BazHKSXoGoGnUATg@mail.gmail.com>
Date:   Thu, 7 Sep 2023 12:18:29 -0700
From:   Axel Rasmussen <axelrasmussen@...gle.com>
To:     Peter Xu <peterx@...hat.com>
Cc:     linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        Anish Moorthy <amoorthy@...gle.com>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Mike Rapoport <rppt@...nel.org>,
        Christian Brauner <brauner@...nel.org>,
        linux-fsdevel@...r.kernel.org,
        Andrea Arcangeli <aarcange@...hat.com>,
        Ingo Molnar <mingo@...hat.com>,
        James Houghton <jthoughton@...gle.com>,
        Nadav Amit <nadav.amit@...il.com>
Subject: Re: [PATCH 0/7] mm/userfaultfd/poll: Scale userfaultfd wakeups

On Tue, Sep 5, 2023 at 2:42 PM Peter Xu <peterx@...hat.com> wrote:
>
> Userfaultfd is the type of file that doesn't need wake-all semantics: if
> there is a message enqueued (for either a fault address, or an event), we
> only need to wake up one service thread to handle it.  Waking up more
> normally means a waste of cpu cycles.  Besides that, and more importantly,
> that just doesn't scale.

Hi Peter,

I took a quick look over the series and didn't see anything
objectionable. I was planning to actually test the series out and then
send out R-b's, but it will take some additional time (next week).

In the meantime, I was curious about the use case. A design I've seen
for VM live migration is to have 1 thread reading events off the uffd,
and then have many threads actually resolving the fault events that
come in (e.g. fetching pages over the network, issuing UFFDIO_COPY or
UFFDIO_CONTINUE, or whatever). In that design, since we only have a
single reader anyway, I think this series doesn't help.

But, I'm curious if you have data indicating that > 1 reader is more
performant overall? I suspect it might be the case that, with "enough"
vCPUs, it makes sense to do so, but I don't have benchmark data to
tell me what that tipping point is yet.

OTOH, if one reader is plenty in ~all cases, optimizing this path is
less important.




>
> Andrea used to have one patch that made read() to be O(1) but never hit
> upstream.  This is my effort to try upstreaming that (which is a
> oneliner..), meanwhile on top of that I also made poll() O(1) on wakeup,
> too (more or less bring EPOLLEXCLUSIVE to poll()), with some tests showing
> that effect.
>
> To verify this, I added a test called uffd-perf (leveraging the refactored
> uffd selftest suite) that will measure the messaging channel latencies on
> wakeups, and the waitqueue optimizations can be reflected by the new test:
>
>         Constants: 40 uffd threads, on N_CPUS=40, memsize=512M
>         Units: milliseconds (to finish the test)
>         |-----------------+--------+-------+------------|
>         | test case       | before | after |   diff (%) |
>         |-----------------+--------+-------+------------|
>         | workers=8,poll  |   1762 |  1133 | -55.516328 |
>         | workers=8,read  |   1437 |   585 | -145.64103 |
>         | workers=16,poll |   1117 |  1097 | -1.8231541 |
>         | workers=16,read |   1159 |   759 | -52.700922 |
>         | workers=32,poll |   1001 |   973 | -2.8776978 |
>         | workers=32,read |    866 |   713 | -21.458626 |
>         |-----------------+--------+-------+------------|
>
> The more threads hanging on the fd_wqh, a bigger difference will be there
> shown in the numbers.  "8 worker threads" is the worst case here because it
> means there can be a worst case of 40-8=32 threads hanging idle on fd_wqh
> queue.
>
> In real life, workers can be more than this, but small number of active
> worker threads will cause similar effect.
>
> This is currently based on Andrew's mm-unstable branch, but assuming this
> is applicable to most of the not-so-old trees.
>
> Comments welcomed, thanks.
>
> Andrea Arcangeli (1):
>   mm/userfaultfd: Make uffd read() wait event exclusive
>
> Peter Xu (6):
>   poll: Add a poll_flags for poll_queue_proc()
>   poll: POLL_ENQUEUE_EXCLUSIVE
>   fs/userfaultfd: Use exclusive waitqueue for poll()
>   selftests/mm: Replace uffd_read_mutex with a semaphore
>   selftests/mm: Create uffd_fault_thread_create|join()
>   selftests/mm: uffd perf test
>
>  drivers/vfio/virqfd.c                    |   4 +-
>  drivers/vhost/vhost.c                    |   2 +-
>  drivers/virt/acrn/irqfd.c                |   2 +-
>  fs/aio.c                                 |   2 +-
>  fs/eventpoll.c                           |   2 +-
>  fs/select.c                              |   9 +-
>  fs/userfaultfd.c                         |   8 +-
>  include/linux/poll.h                     |  25 ++-
>  io_uring/poll.c                          |   4 +-
>  mm/memcontrol.c                          |   4 +-
>  net/9p/trans_fd.c                        |   3 +-
>  tools/testing/selftests/mm/Makefile      |   2 +
>  tools/testing/selftests/mm/uffd-common.c |  65 +++++++
>  tools/testing/selftests/mm/uffd-common.h |   7 +
>  tools/testing/selftests/mm/uffd-perf.c   | 207 +++++++++++++++++++++++
>  tools/testing/selftests/mm/uffd-stress.c |  53 +-----
>  virt/kvm/eventfd.c                       |   2 +-
>  17 files changed, 337 insertions(+), 64 deletions(-)
>  create mode 100644 tools/testing/selftests/mm/uffd-perf.c
>
> --
> 2.41.0
>