linux-kernel - Re: [PATCH v2 1/1] mm/madvise: add MADV_F_COLLAPSE_LIGHT to process

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAK1f24=eVDNRN3k4Oz62VZ4M=1igVQ9E-0vxDe=5M8HWrAb-8Q@mail.gmail.com>
Date: Fri, 19 Jan 2024 22:08:44 +0800
From: Lance Yang <ioworker0@...il.com>
To: Michal Hocko <mhocko@...e.com>
Cc: akpm@...ux-foundation.org, zokeefe@...gle.com, david@...hat.com, 
	songmuchun@...edance.com, shy828301@...il.com, peterx@...hat.com, 
	mknyszek@...gle.com, minchan@...nel.org, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 1/1] mm/madvise: add MADV_F_COLLAPSE_LIGHT to process_madvise()

On Fri, Jan 19, 2024 at 8:51 PM Michal Hocko <mhocko@...e.com> wrote:
>
> On Fri 19-01-24 10:03:05, Lance Yang wrote:
> > Hey Michal,
> >
> > Thanks for taking the time to review!
> >
> > On Thu, Jan 18, 2024 at 9:40 PM Michal Hocko <mhocko@...e.com> wrote:
> > >
> > > On Thu 18-01-24 20:03:46, Lance Yang wrote:
> > > [...]
> > >
> > > before we discuss the semantic, let's focus on the usecase.
> > >
> > > > Use Cases
> > > >
> > > > An immediate user of this new functionality is the Go runtime heap allocator
> > > > that manages memory in hugepage-sized chunks. In the past, whether it was a
> > > > newly allocated chunk through mmap() or a reused chunk released by
> > > > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
> > > > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
> > > > respectively. However, both approaches resulted in performance issues; for
> > > > both scenarios, there could be entries into direct reclaim and/or compaction,
> > > > leading to unpredictable stalls[4]. Now, the allocator can confidently use
> > > > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.
> > >
> > > IIUC the primary reason is the cost of the huge page allocation which
> > > can be really high if the memory is heavily fragmented and it is called
> > > synchronously from the process directly, correct? Can that be worked
> >
> > Yes, that's correct.
> >
> > > around by process_madvise and performing the operation from a different
> > > context? Are there any other reasons to have a different mode?
> >
> > In latency-sensitive scenarios, some applications aim to enhance performance
> > by utilizing huge pages as much as possible. At the same time, in case of
> > allocation failure, they prefer a quick return without triggering direct memory
> > reclamation and compaction.
>
> Could you elaborate some more on why?

Previously, the Go runtime attempted to marks all new memory as MADV_HUGEPAGE
on Linux and manages its hugepage eligibility status. Unfortunately,
the default THP
behavior on most Linux distros is that MADV_HUGEPAGE blocks while the kernel
eagerly reclaims and compacts memory to allocate a hugepage.
This direct reclaim and compaction is unbounded, and may result in significant
application thread stalls. In really bad cases, this can exceed 100s
of ms or even
seconds.
The overall strategy of trying to keep hugepages for the heap unbroken
however is
sound. So, the Go runtime uses MADV_COLLAPSE as an alternative.
See https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af

Later, a Google production service experienced a performance
regression with the Go
runtime's use of MADV_COLLAPSE. For now, the Go runtime has rolled
back the usage of MADV_COLLAPSE.
See https://github.com/golang/go/issues/63334

If there were a more relaxed (opportunistic) MADV_COLLAPSE, it would
avoid direct reclaim
and/or compaction and quickly fail on allocation errors. This could be
beneficial for similar
use cases.

BR,
Lance

>
> > > I mean I can think of a more relaxed (opportunistic) MADV_COLLAPSE -
> > > e.g. non blocking one to make sure that the caller doesn't really block
> > > on resource contention (be it locks or memory availability) because that
> > > matches our non-blocking interface in other areas but having a LIGHT
> > > operation sounds really vague and the exact semantic would be
> > > implementation specific and might change over time. Non-blocking has a
> > > clear semantic but it is not really clear whether that is what you
> > > really need/want.
> >
> > Could you provide me with some suggestions regarding the naming of a
> > more relaxed (opportunistic) MADV_COLLAPSE?
>
> Naming is not all that important at this stage (it could be
> MADV_COLLAPSE_NOBLOCK for example). The primary question is whether
> non-blocking in general is the desired behavior or the implementation
> should try but not too hard.
>
> --
> Michal Hocko
> SUSE Labs