lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+EESO5nvzka0KzFGzdGgiCWPLg7XD-8jA9=NTUOKFy-56orUg@mail.gmail.com>
Date:   Mon, 9 Oct 2023 17:29:08 +0100
From:   Lokesh Gidra <lokeshgidra@...gle.com>
To:     David Hildenbrand <david@...hat.com>
Cc:     Suren Baghdasaryan <surenb@...gle.com>, akpm@...ux-foundation.org,
        viro@...iv.linux.org.uk, brauner@...nel.org, shuah@...nel.org,
        aarcange@...hat.com, peterx@...hat.com, hughd@...gle.com,
        mhocko@...e.com, axelrasmussen@...gle.com, rppt@...nel.org,
        willy@...radead.org, Liam.Howlett@...cle.com, jannh@...gle.com,
        zhangpeng362@...wei.com, bgeffon@...gle.com,
        kaleshsingh@...gle.com, ngeoffray@...gle.com, jdduke@...gle.com,
        linux-mm@...ck.org, linux-fsdevel@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-kselftest@...r.kernel.org,
        kernel-team@...roid.com
Subject: Re: [PATCH v3 2/3] userfaultfd: UFFDIO_MOVE uABI

On Mon, Oct 9, 2023 at 5:24 PM David Hildenbrand <david@...hat.com> wrote:
>
> On 09.10.23 18:21, Suren Baghdasaryan wrote:
> > On Mon, Oct 9, 2023 at 7:38 AM David Hildenbrand <david@...hat.com> wrote:
> >>
> >> On 09.10.23 08:42, Suren Baghdasaryan wrote:
> >>> From: Andrea Arcangeli <aarcange@...hat.com>
> >>>
> >>> Implement the uABI of UFFDIO_MOVE ioctl.
> >>> UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application
> >>> needs pages to be allocated [1]. However, with UFFDIO_MOVE, if pages are
> >>> available (in userspace) for recycling, as is usually the case in heap
> >>> compaction algorithms, then we can avoid the page allocation and memcpy
> >>> (done by UFFDIO_COPY). Also, since the pages are recycled in the
> >>> userspace, we avoid the need to release (via madvise) the pages back to
> >>> the kernel [2].
> >>> We see over 40% reduction (on a Google pixel 6 device) in the compacting
> >>> thread’s completion time by using UFFDIO_MOVE vs. UFFDIO_COPY. This was
> >>> measured using a benchmark that emulates a heap compaction implementation
> >>> using userfaultfd (to allow concurrent accesses by application threads).
> >>> More details of the usecase are explained in [2].
> >>> Furthermore, UFFDIO_MOVE enables moving swapped-out pages without
> >>> touching them within the same vma. Today, it can only be done by mremap,
> >>> however it forces splitting the vma.
> >>>
> >>> [1] https://lore.kernel.org/all/1425575884-2574-1-git-send-email-aarcange@redhat.com/
> >>> [2] https://lore.kernel.org/linux-mm/CA+EESO4uO84SSnBhArH4HvLNhaUQ5nZKNKXqxRCyjniNVjp0Aw@mail.gmail.com/
> >>>
> >>> Update for the ioctl_userfaultfd(2)  manpage:
> >>>
> >>>      UFFDIO_MOVE
> >>>          (Since Linux xxx)  Move a continuous memory chunk into the
> >>>          userfault registered range and optionally wake up the blocked
> >>>          thread. The source and destination addresses and the number of
> >>>          bytes to move are specified by the src, dst, and len fields of
> >>>          the uffdio_move structure pointed to by argp:
> >>>
> >>>              struct uffdio_move {
> >>>                  __u64 dst;    /* Destination of move */
> >>>                  __u64 src;    /* Source of move */
> >>>                  __u64 len;    /* Number of bytes to move */
> >>>                  __u64 mode;   /* Flags controlling behavior of move */
> >>>                  __s64 move;   /* Number of bytes moved, or negated error */
> >>>              };
> >>>
> >>>          The following value may be bitwise ORed in mode to change the
> >>>          behavior of the UFFDIO_MOVE operation:
> >>>
> >>>          UFFDIO_MOVE_MODE_DONTWAKE
> >>>                 Do not wake up the thread that waits for page-fault
> >>>                 resolution
> >>>
> >>>          UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES
> >>>                 Allow holes in the source virtual range that is being moved.
> >>>                 When not specified, the holes will result in ENOENT error.
> >>>                 When specified, the holes will be accounted as successfully
> >>>                 moved memory. This is mostly useful to move hugepage aligned
> >>>                 virtual regions without knowing if there are transparent
> >>>                 hugepages in the regions or not, but preventing the risk of
> >>>                 having to split the hugepage during the operation.
> >>>
> >>>          The move field is used by the kernel to return the number of
> >>>          bytes that was actually moved, or an error (a negated errno-
> >>>          style value).  If the value returned in move doesn't match the
> >>>          value that was specified in len, the operation fails with the
> >>>          error EAGAIN.  The move field is output-only; it is not read by
> >>>          the UFFDIO_MOVE operation.
> >>>
> >>>          The operation may fail for various reasons. Usually, remapping of
> >>>          pages that are not exclusive to the given process fail; once KSM
> >>>          might deduplicate pages or fork() COW-shares pages during fork()
> >>>          with child processes, they are no longer exclusive. Further, the
> >>>          kernel might only perform lightweight checks for detecting whether
> >>>          the pages are exclusive, and return -EBUSY in case that check fails.
> >>>          To make the operation more likely to succeed, KSM should be
> >>>          disabled, fork() should be avoided or MADV_DONTFORK should be
> >>>          configured for the source VMA before fork().
> >>>
> >>>          This ioctl(2) operation returns 0 on success.  In this case, the
> >>>          entire area was moved.  On error, -1 is returned and errno is
> >>>          set to indicate the error.  Possible errors include:
> >>>
> >>>          EAGAIN The number of bytes moved (i.e., the value returned in
> >>>                 the move field) does not equal the value that was
> >>>                 specified in the len field.
> >>>
> >>>          EINVAL Either dst or len was not a multiple of the system page
> >>>                 size, or the range specified by src and len or dst and len
> >>>                 was invalid.
> >>>
> >>>          EINVAL An invalid bit was specified in the mode field.
> >>>
> >>>          ENOENT
> >>>                 The source virtual memory range has unmapped holes and
> >>>                 UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES is not set.
> >>>
> >>>          EEXIST
> >>>                 The destination virtual memory range is fully or partially
> >>>                 mapped.
> >>>
> >>>          EBUSY
> >>>                 The pages in the source virtual memory range are not
> >>>                 exclusive to the process. The kernel might only perform
> >>>                 lightweight checks for detecting whether the pages are
> >>>                 exclusive. To make the operation more likely to succeed,
> >>>                 KSM should be disabled, fork() should be avoided or
> >>>                 MADV_DONTFORK should be configured for the source virtual
> >>>                 memory area before fork().
> >>>
> >>>          ENOMEM Allocating memory needed for the operation failed.
> >>>
> >>>          ESRCH
> >>>                 The faulting process has exited at the time of a
> >>>                 UFFDIO_MOVE operation.
> >>>
> >>
> >> A general comment simply because I realized that just now: does anything
> >> speak against limiting the operations now to a single MM?
> >>
> >> The use cases I heard so far don't need it. If ever required, we could
> >> consider extending it.
> >>
> >> Let's reduce complexity and KIS unless really required.
> >
> > Let me check if there are use cases that require moves between MMs.
> > Andrea seems to have put considerable effort to make it work between
> > MMs and it would be a pity to lose that. I can send a follow-up patch
> > to recover that functionality and even if it does not get merged, it
> > can be used in the future as a reference. But first let me check if we
> > can drop it.

For the compaction use case that we have it's fine to limit it to
single MM. However, for general use I think Peter will have a better
idea.
>
> Yes, that sounds reasonable. Unless the big important use cases requires
> moving pages between processes, let's leave that as future work for now.
>
> --
> Cheers,
>
> David / dhildenb
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ