lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAK1f24=cG4S2MzdMuZOvSmzrj0QFSXTeuVpfuObhUNGxDMSKWg@mail.gmail.com>
Date: Sat, 20 Jan 2024 10:34:37 +0800
From: Lance Yang <ioworker0@...il.com>
To: Andi Kleen <ak@...ux.intel.com>
Cc: akpm@...ux-foundation.org, zokeefe@...gle.com, songmuchun@...edance.com, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v1 1/2] mm/madvise: introduce MADV_TRY_COLLAPSE for
 attempted synchronous hugepage collapse

Hey Andi,

Thanks for taking the time to review!

We are currently discussing this at
Link: https://lore.kernel.org/all/20240118120347.61817-1-ioworker0@gmail.com/

On Fri, Jan 19, 2024 at 9:41 PM Andi Kleen <ak@...ux.intel.com> wrote:
>
> Lance Yang <ioworker0@...il.com> writes:
>
> > This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].
> >
> > Introduce a new madvise mode, MADV_TRY_COLLAPSE, that allows users to
> > make a least-effort attempt at a synchronous collapse of memory at
> > their own expense.
> >
> > The only difference from MADV_COLLAPSE is that the new hugepage allocation
> > avoids direct reclaim and/or compaction, quickly failing on allocation errors.
> >
> > The benefits of this approach are:
> >
> > * CPU is charged to the process that wants to spend the cycles for the THP
> > * Avoid unpredictable timing of khugepaged collapse
> > * Prevent unpredictable stalls caused by direct reclaim and/or
> > compaction
>
> I haven't completely followed the discussion, but it seem your second
> and third point could be addressed by a asynchronous THP fault without
> any new APIs: allocate 2MB while failing quickly, then on failure get
> a 4K page and provide it to the process, while asking khugepaged to
> convert the page ASAP in the background, but only after
> it managed to allocate a fresh 2MB page to minimize the process visible
> down time.
>
> I suppose that would be much more predictable, although there would be a
> slightly risk of overwhelming khugepaged. The later could be
> addressed by using a scalable workqueue that allocates more threads
> when needed.

Thank you for your suggestion!

Unfortunately, AFAIK, the default THP behavior on most Linux distros is
that MADV_HUGEPAGE blocks while the kernel eagerly reclaims
and compacts memory to allocate a hugepage.

In the era of cloud-native computing, it's challenging for users to be
aware of the THP configurations on all nodes in a cluster, let alone
have fine-grained control over them. Simply disabling the use of huge
pages due to concerns about potential direct reclaim and/or compaction
may be regrettable, as users are deprived of the opportunity to experiment
with huge page allocations. However, relying solely on
MADV_HUGEPAGE introduces the risk of unpredictable stalls,
making it a trade-off that users must carefully consider.

With the introduction of MADV_COLLAPSE into the kernel, it
is not governed by the defrag mode. MADV_COLLAPSE
offers the potential for more fine-grained control over the hugepage
allocation strategy.

BR,
Lance
>
> -Andi
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ