lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAK1f24kG72ogeh04SkngQ_=BW4nomZAejmRK3qrzLKvUKF1Tsw@mail.gmail.com>
Date: Thu, 18 Jan 2024 09:46:50 +0800
From: Lance Yang <ioworker0@...il.com>
To: "Zach O'Keefe" <zokeefe@...gle.com>
Cc: akpm@...ux-foundation.org, songmuchun@...edance.com, 
	linux-kernel@...r.kernel.org, Yang Shi <shy828301@...il.com>, 
	Peter Xu <peterx@...hat.com>, David Hildenbrand <david@...hat.com>, 
	Michael Knyszek <mknyszek@...gle.com>, Minchan Kim <minchan@...nel.org>, Michal Hocko <mhocko@...e.com>, 
	linux-mm@...ck.org
Subject: Re: [PATCH v1 1/2] mm/madvise: introduce MADV_TRY_COLLAPSE for
 attempted synchronous hugepage collapse

Hey Zach,

Thanks for taking the time to review!

Zach O'Keefe <zokeefe@...gle.com> 于2024年1月18日周四 01:11写道:
>
> [+linux-mm & others]
>
> On Tue, Jan 16, 2024 at 9:02 PM Lance Yang <ioworker0@...il.com> wrote:
> >
> > This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].
> >
> > Introduce a new madvise mode, MADV_TRY_COLLAPSE, that allows users to
> > make a least-effort attempt at a synchronous collapse of memory at
> > their own expense.
> >
> > The only difference from MADV_COLLAPSE is that the new hugepage allocation
> > avoids direct reclaim and/or compaction, quickly failing on allocation errors.
> >
> > The benefits of this approach are:
> >
> > * CPU is charged to the process that wants to spend the cycles for the THP
> > * Avoid unpredictable timing of khugepaged collapse
> > * Prevent unpredictable stalls caused by direct reclaim and/or compaction
> >
> > Semantics
> >
> > This call is independent of the system-wide THP sysfs settings, but will
> > fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
> > multiple VMAs, the semantics of the collapse over each VMA is independent
> > from the others.  This implies a hugepage cannot cross a VMA boundary.  If
> > collapse of a given hugepage-aligned/sized region fails, the operation may
> > continue to attempt collapsing the remainder of memory specified.
> >
> > The memory ranges provided must be page-aligned, but are not required to
> > be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
> > start/end of the range will be clamped to the first/last hugepage-aligned
> > address covered by said range.  The memory ranges must span at least one
> > hugepage-sized region.
> >
> > All non-resident pages covered by the range will first be
> > swapped/faulted-in, before being internally copied onto a freshly
> > allocated hugepage.  Unmapped pages will have their data directly
> > initialized to 0 in the new hugepage.  However, for every eligible
> > hugepage aligned/sized region to-be collapsed, at least one page must
> > currently be backed by memory (a PMD covering the address range must
> > already exist).
> >
> > Allocation for the new hugepage will not enter direct reclaim and/or
> > compaction, quickly failing if allocation fails. When the system has
> > multiple NUMA nodes, the hugepage will be allocated from the node providing
> > the most native pages. This operation operates on the current state of the
> > specified process and makes no persistent changes or guarantees on how pages
> > will be mapped, constructed, or faulted in the future.
> >
> > Return Value
> >
> > If all hugepage-sized/aligned regions covered by the provided range were
> > either successfully collapsed, or were already PMD-mapped THPs, this
> > operation will be deemed successful.  On success, madvise(2) returns 0.
> > Else, -1 is returned and errno is set to indicate the error for the
> > most-recently attempted hugepage collapse.  Note that many failures might
> > have occurred, since the operation may continue to collapse in the event a
> > single hugepage-sized/aligned region fails.
> >
> >         ENOMEM  Memory allocation failed or VMA not found
> >         EBUSY   Memcg charging failed
> >         EAGAIN  Required resource temporarily unavailable.  Try again
> >                 might succeed.
> >         EINVAL  Other error: No PMD found, subpage doesn't have Present
> >                 bit set, "Special" page no backed by struct page, VMA
> >                 incorrectly sized, address not page-aligned, ...
> >
> > Use Cases
> >
> > An immediate user of this new functionality is the Go runtime heap allocator
> > that manages memory in hugepage-sized chunks. In the past, whether it was a
> > newly allocated chunk through mmap() or a reused chunk released by
> > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
> > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
> > respectively. However, both approaches resulted in performance issues; for
> > both scenarios, there could be entries into direct reclaim and/or compaction,
> > leading to unpredictable stalls[4]. Now, the allocator can confidently use
> > madvise(MADV_TRY_COLLAPSE) to attempt the allocation of huge pages.
> >
> > [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77
> > [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a
> > [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af
> > [4] https://github.com/golang/go/issues/63334
>
> Thanks for the patch, Lance, and thanks for providing the links above,
> referring to issues Go has seen.
>
> I've reached out to the Go team to try and understand their use case,
> and how we could help. It's not immediately clear whether a
> lighter-weight MADV_COLLAPSE is the answer, but it could turn out to
> be.
>
> That said, with respect to the implementation, should a need for a
> lighter-weight MADV_COLLAPSE be warranted, I'd personally like to see
> process_madvise(2) be the "v2" of madvise(2), where we can start

I agree with you; it's a good idea!

> leveraging the forward-facing flags argument for these different
> advice flavors. We'd need to safely revert v5.10 commit a68a0262abdaa
> ("mm/madvise: remove racy mm ownership check") so that
> process_madvise(2) can always operate on self. IIRC, this was ~ the
> plan we landed on during MADV_COLLAPSE dev discussions (i.e. pick a
> sane default, and implement options in flags down the line).
>
> That flag could be a MADV_F_COLLAPSE_LIGHT, where we use a lighter

The name MADV_F_COLLAPSE_LIGHT sounds great for the flag, and its
semantics are very clear.

Thanks again for your review and your suggestion!
Lance

> allocation context, as well as, for example, only do a local
> lru_add_drain() vs lru_add_drain_all(). But I'll refrain from thinking
> too hard about it just yet.
>
> Best,
> Zach
>
>
>
>
> > Signed-off-by: Lance Yang <ioworker0@...il.com>
> > ---
> >  arch/alpha/include/uapi/asm/mman.h           |  1 +
> >  arch/mips/include/uapi/asm/mman.h            |  1 +
> >  arch/parisc/include/uapi/asm/mman.h          |  1 +
> >  arch/xtensa/include/uapi/asm/mman.h          |  1 +
> >  include/linux/huge_mm.h                      |  5 +++--
> >  include/uapi/asm-generic/mman-common.h       |  1 +
> >  mm/khugepaged.c                              | 19 ++++++++++++++++---
> >  mm/madvise.c                                 |  8 +++++++-
> >  tools/include/uapi/asm-generic/mman-common.h |  1 +
> >  9 files changed, 32 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> > index 763929e814e9..44aa1f57a982 100644
> > --- a/arch/alpha/include/uapi/asm/mman.h
> > +++ b/arch/alpha/include/uapi/asm/mman.h
> > @@ -77,6 +77,7 @@
> >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> >
> >  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > +#define MADV_TRY_COLLAPSE      26      /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> >
> >  /* compatibility flags */
> >  #define MAP_FILE       0
> > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> > index c6e1fc77c996..1ae16e5d7dfc 100644
> > --- a/arch/mips/include/uapi/asm/mman.h
> > +++ b/arch/mips/include/uapi/asm/mman.h
> > @@ -104,6 +104,7 @@
> >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> >
> >  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > +#define MADV_TRY_COLLAPSE      26      /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> >
> >  /* compatibility flags */
> >  #define MAP_FILE       0
> > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> > index 68c44f99bc93..f8d016ee1f98 100644
> > --- a/arch/parisc/include/uapi/asm/mman.h
> > +++ b/arch/parisc/include/uapi/asm/mman.h
> > @@ -71,6 +71,7 @@
> >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> >
> >  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > +#define MADV_TRY_COLLAPSE      26      /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> >
> >  #define MADV_HWPOISON     100          /* poison a page for testing */
> >  #define MADV_SOFT_OFFLINE 101          /* soft offline page for testing */
> > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> > index 1ff0c858544f..c495d1b39c83 100644
> > --- a/arch/xtensa/include/uapi/asm/mman.h
> > +++ b/arch/xtensa/include/uapi/asm/mman.h
> > @@ -112,6 +112,7 @@
> >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> >
> >  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > +#define MADV_TRY_COLLAPSE      26      /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> >
> >  /* compatibility flags */
> >  #define MAP_FILE       0
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 5adb86af35fc..e1af75aa18fb 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -303,7 +303,7 @@ int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
> >                      int advice);
> >  int madvise_collapse(struct vm_area_struct *vma,
> >                      struct vm_area_struct **prev,
> > -                    unsigned long start, unsigned long end);
> > +                    unsigned long start, unsigned long end, bool is_try);
> >  void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
> >                            unsigned long end, long adjust_next);
> >  spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
> > @@ -450,7 +450,8 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
> >
> >  static inline int madvise_collapse(struct vm_area_struct *vma,
> >                                    struct vm_area_struct **prev,
> > -                                  unsigned long start, unsigned long end)
> > +                                  unsigned long start, unsigned long end,
> > +                                  bool is_try)
> >  {
> >         return -EINVAL;
> >  }
> > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > index 6ce1f1ceb432..a9e5273db5f6 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -78,6 +78,7 @@
> >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> >
> >  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > +#define MADV_TRY_COLLAPSE      26      /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> >
> >  /* compatibility flags */
> >  #define MAP_FILE       0
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 2b219acb528e..c22703155b6e 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -96,6 +96,7 @@ static struct kmem_cache *mm_slot_cache __ro_after_init;
> >
> >  struct collapse_control {
> >         bool is_khugepaged;
> > +       bool is_try;
> >
> >         /* Num pages scanned per node */
> >         u32 node_load[MAX_NUMNODES];
> > @@ -1058,10 +1059,14 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
> >  static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
> >                               struct collapse_control *cc)
> >  {
> > -       gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
> > -                    GFP_TRANSHUGE);
> >         int node = hpage_collapse_find_target_node(cc);
> >         struct folio *folio;
> > +       gfp_t gfp;
> > +
> > +       if (cc->is_khugepaged)
> > +               gfp = alloc_hugepage_khugepaged_gfpmask();
> > +       else
> > +               gfp = cc->is_try ? GFP_TRANSHUGE_LIGHT : GFP_TRANSHUGE;
> >
> >         if (!hpage_collapse_alloc_folio(&folio, gfp, node, &cc->alloc_nmask)) {
> >                 *hpage = NULL;
> > @@ -2697,7 +2702,7 @@ static int madvise_collapse_errno(enum scan_result r)
> >  }
> >
> >  int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > -                    unsigned long start, unsigned long end)
> > +                    unsigned long start, unsigned long end, bool is_try)
> >  {
> >         struct collapse_control *cc;
> >         struct mm_struct *mm = vma->vm_mm;
> > @@ -2718,6 +2723,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >         if (!cc)
> >                 return -ENOMEM;
> >         cc->is_khugepaged = false;
> > +       cc->is_try = is_try;
> >
> >         mmgrab(mm);
> >         lru_add_drain_all();
> > @@ -2773,6 +2779,13 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >                         result = collapse_pte_mapped_thp(mm, addr, true);
> >                         mmap_read_unlock(mm);
> >                         goto handle_result;
> > +               /* MADV_TRY_COLLAPSE: fail quickly */
> > +               case SCAN_ALLOC_HUGE_PAGE_FAIL:
> > +               case SCAN_CGROUP_CHARGE_FAIL:
> > +                       if (cc->is_try) {
> > +                               last_fail = result;
> > +                               goto out_maybelock;
> > +                       }
> >                 /* Whitelisted set of results where continuing OK */
> >                 case SCAN_PMD_NULL:
> >                 case SCAN_PTE_NON_PRESENT:
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 912155a94ed5..5a359bcd286c 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior)
> >         case MADV_POPULATE_READ:
> >         case MADV_POPULATE_WRITE:
> >         case MADV_COLLAPSE:
> > +       case MADV_TRY_COLLAPSE:
> >                 return 0;
> >         default:
> >                 /* be safe, default to 1. list exceptions explicitly */
> > @@ -1082,8 +1083,10 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> >                 if (error)
> >                         goto out;
> >                 break;
> > +       case MADV_TRY_COLLAPSE:
> > +               return madvise_collapse(vma, prev, start, end, true);
> >         case MADV_COLLAPSE:
> > -               return madvise_collapse(vma, prev, start, end);
> > +               return madvise_collapse(vma, prev, start, end, false);
> >         }
> >
> >         anon_name = anon_vma_name(vma);
> > @@ -1178,6 +1181,7 @@ madvise_behavior_valid(int behavior)
> >         case MADV_HUGEPAGE:
> >         case MADV_NOHUGEPAGE:
> >         case MADV_COLLAPSE:
> > +       case MADV_TRY_COLLAPSE:
> >  #endif
> >         case MADV_DONTDUMP:
> >         case MADV_DODUMP:
> > @@ -1368,6 +1372,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> >   *             transparent huge pages so the existing pages will not be
> >   *             coalesced into THP and new pages will not be allocated as THP.
> >   *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
> > + *  MADV_TRY_COLLAPSE - similar to COLLAPSE, but avoids direct reclaim
> > + *             and/or compaction.
> >   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
> >   *             from being included in its core dump.
> >   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> > diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
> > index 6ce1f1ceb432..a9e5273db5f6 100644
> > --- a/tools/include/uapi/asm-generic/mman-common.h
> > +++ b/tools/include/uapi/asm-generic/mman-common.h
> > @@ -78,6 +78,7 @@
> >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> >
> >  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > +#define MADV_TRY_COLLAPSE      26      /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> >
> >  /* compatibility flags */
> >  #define MAP_FILE       0
> > --
> > 2.33.1
> >

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ