linux-kernel - Re: [PATCH v2 1/1] mm/madvise: add MADV_F_COLLAPSE_LIGHT to process

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAK1f24kQTUQsNf05cxJ_HsaLg9rTBWh8K64PCVbeudW8G7pStg@mail.gmail.com>
Date: Fri, 19 Jan 2024 10:37:14 +0800
From: Lance Yang <ioworker0@...il.com>
To: Yang Shi <shy828301@...il.com>
Cc: "Zach O'Keefe" <zokeefe@...gle.com>, Michal Hocko <mhocko@...e.com>, akpm@...ux-foundation.org, 
	david@...hat.com, songmuchun@...edance.com, peterx@...hat.com, 
	mknyszek@...gle.com, minchan@...nel.org, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, linux-api@...r.kernel.org
Subject: Re: [PATCH v2 1/1] mm/madvise: add MADV_F_COLLAPSE_LIGHT to process_madvise()

Hey Yang,

Thanks for taking the time to review!

On Fri, Jan 19, 2024 at 3:00 AM Yang Shi <shy828301@...il.com> wrote:
>
> On Thu, Jan 18, 2024 at 6:59 AM Zach O'Keefe <zokeefe@...gle.com> wrote:
> >
> > On Thu, Jan 18, 2024 at 5:43 AM Michal Hocko <mhocko@...e.com> wrote:
> > >
> > > Dang, forgot to cc linux-api...
> > >
> > > On Thu 18-01-24 14:40:19, Michal Hocko wrote:
> > > > On Thu 18-01-24 20:03:46, Lance Yang wrote:
> > > > [...]
> > > >
> > > > before we discuss the semantic, let's focus on the usecase.
> > > >
> > > > > Use Cases
> > > > >
> > > > > An immediate user of this new functionality is the Go runtime heap allocator
> > > > > that manages memory in hugepage-sized chunks. In the past, whether it was a
> > > > > newly allocated chunk through mmap() or a reused chunk released by
> > > > > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
> > > > > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
> > > > > respectively. However, both approaches resulted in performance issues; for
> > > > > both scenarios, there could be entries into direct reclaim and/or compaction,
> > > > > leading to unpredictable stalls[4]. Now, the allocator can confidently use
> > > > > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.
> >
> > Aside: The thought was a MADV_F_COLLAPSE_LIGHT _flag_; so it'd be
> > process_madvise(..., MADV_COLLAPSE, MADV_F_COLLAPSE_LIGHT)
> >
> > > > IIUC the primary reason is the cost of the huge page allocation which
> > > > can be really high if the memory is heavily fragmented and it is called
> > > > synchronously from the process directly, correct? Can that be worked
> > > > around by process_madvise and performing the operation from a different
> > > > context? Are there any other reasons to have a different mode?
> > > >
> > > > I mean I can think of a more relaxed (opportunistic) MADV_COLLAPSE -
> > > > e.g. non blocking one to make sure that the caller doesn't really block
> > > > on resource contention (be it locks or memory availability) because that
> > > > matches our non-blocking interface in other areas but having a LIGHT
> > > > operation sounds really vague and the exact semantic would be
> > > > implementation specific and might change over time. Non-blocking has a
> > > > clear semantic but it is not really clear whether that is what you
> > > > really need/want.
> >
> > IIUC, usecase from Go is unbounded latency due to sync compaction in a
> > context where the latency is unacceptable. Working w/ them to
> > understand how things can be improved -- it's possible the changes can
> > occur entirely on their side, w/o any additional kernel support.
> >
> > The non-blocking case awkwardly sits between MADV_COLLAPSE today, and
> > khugepaged; esp when common case is that the allocation can probably
> > be satisfied in fast path.
> >
> > The suggestion for something like "LIGHT" was intentionally vague
> > because it could allow for other optimizations / changes down the
> > line, as you point out. I think that might be a win, vs tying to a
> > specific optimization (e.g. like a MADV_F_COLLAPSE_NODEFRAG). But I
> > could be alone on that front, given the design of
> > /sys/kernel/mm/transparent_hugepage.
>
> Per the description Go marks the address spaces with MADV_HUGEPAGE. It
> means the application really wants to have huge page back the address
> space so kernel will try as hard as possible to get huge page. This is
> the default behavior of MADV_HUGEPAGE. If they don't want to enter
> direct reclaim, they can configure the defrag mode to "defer", which
> means no direct reclaim and wakeup kswapd and kcompactd, and rely on
> khugepaged to install huge page later on. But this mode is not
> supported by khugepaged defrag, so MADV_COLLAPSE may not support it
> (IIRC MADV_COLLAPSE uses khugepaged defrag mode). Or they can just not
> call MADV_HUGEPAGE and leave the decision to the users, IIRC Java does
> so (specifying a flag to indicate use huge page or not by the users).

Thank you for providing insights into the Go use cases with MADV_HUGEPAGE
and the configuration options for defrag mode.

Considering the limitations with the "defer" mode, it becomes apparent
that there
is a gap in addressing scenarios where an application desires a lighter-weight
alternative to MADV_HUGEPAGE.

MADV_F_COLLAPSE_LIGHT aims to fill this gap by providing a more flexible and
opportunistic approach, catering to applications in latency-sensitive
environments
that seek performance improvements with huge pages but prefer to avoid direct
reclaim and compaction. This option can serve as a valuable addition for users
who want more control over the behavior without the constraints of existing
configurations.

In the era of cloud-native computing, it's challenging for users to be
aware of the
THP configurations on all nodes in a cluster, let alone have
fine-grained control
over them. Simply disabling the use of huge pages due to concerns
about potential
direct reclamation and compaction may be regrettable, as users are deprived of
the opportunity to experiment with large page allocations. However,
relying solely on
MADV_HUGEPAGE introduces the risk of unpredictable stalls, making it a trade-off
that users must carefully consider.

By introducing MADV_F_COLLAPSE_LIGHT, we offer users a more flexible and
controllable solution in cloud-native environments, allowing them to
better balance
performance requirements and resource management. This selectively lightweight
alternative is designed to provide users with more choices to better
meet the diverse
needs of different scenarios.

Thanks again for your review and your suggestion!
Lance

>
> >
> > But circling back, I agree w/ you that the first order of business is to
> > iron out a real usecase. As of right now, it's not clear something
> > like this is required or helpful.
> >
> > Thanks,
> > Zach
> >
> >
> >
> >
> > > > > [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77
> > > > > [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a
> > > > > [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af
> > > > > [4] https://github.com/golang/go/issues/63334
> > > > >
> > > > > [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/
> > > > --
> > > > Michal Hocko
> > > > SUSE Labs
> > >
> > > --
> > > Michal Hocko
> > > SUSE Labs