linux-kernel - Re: [PATCH] selftests/mm: Introduce a test program to assess swap entry allocation for thp

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAGsJ_4yH4xOBtch1XEq4hPz9inWC+c60Wa84XxU-BXFg_1ga-Q@mail.gmail.com>
Date: Mon, 24 Jun 2024 19:55:04 +1200
From: Barry Song <21cnbao@...il.com>
To: "Huang, Ying" <ying.huang@...el.com>
Cc: Ryan Roberts <ryan.roberts@....com>, David Hildenbrand <david@...hat.com>, akpm@...ux-foundation.org, 
	shuah@...nel.org, linux-mm@...ck.org, chrisl@...nel.org, hughd@...gle.com, 
	kaleshsingh@...gle.com, kasong@...cent.com, linux-kernel@...r.kernel.org, 
	linux-kselftest@...r.kernel.org, Barry Song <v-songbaohua@...o.com>
Subject: Re: [PATCH] selftests/mm: Introduce a test program to assess swap
 entry allocation for thp_swapout

On Mon, Jun 24, 2024 at 7:01 PM Huang, Ying <ying.huang@...el.com> wrote:
>
> Barry Song <21cnbao@...il.com> writes:
>
> > On Mon, Jun 24, 2024 at 3:44 PM Huang, Ying <ying.huang@...el.com> wrote:
> >>
> >> Barry Song <21cnbao@...il.com> writes:
> >>
> >> > On Fri, Jun 21, 2024 at 9:24 PM Huang, Ying <ying.huang@...el.com> wrote:
> >> >>
> >> >> Barry Song <21cnbao@...il.com> writes:
> >> >>
> >> >> > On Fri, Jun 21, 2024 at 7:25 PM Ryan Roberts <ryan.roberts@....com> wrote:
> >> >> >>
> >> >> >> On 20/06/2024 12:34, David Hildenbrand wrote:
> >> >> >> > On 20.06.24 11:04, Ryan Roberts wrote:
> >> >> >> >> On 20/06/2024 01:26, Barry Song wrote:
> >> >> >> >>> From: Barry Song <v-songbaohua@...o.com>
> >> >> >> >>>
> >> >> >> >>> Both Ryan and Chris have been utilizing the small test program to aid
> >> >> >> >>> in debugging and identifying issues with swap entry allocation. While
> >> >> >> >>> a real or intricate workload might be more suitable for assessing the
> >> >> >> >>> correctness and effectiveness of the swap allocation policy, a small
> >> >> >> >>> test program presents a simpler means of understanding the problem and
> >> >> >> >>> initially verifying the improvements being made.
> >> >> >> >>>
> >> >> >> >>> Let's endeavor to integrate it into the self-test suite. Although it
> >> >> >> >>> presently only accommodates 64KB and 4KB, I'm optimistic that we can
> >> >> >> >>> expand its capabilities to support multiple sizes and simulate more
> >> >> >> >>> complex systems in the future as required.
> >> >> >> >>
> >> >> >> >> I'll try to summarize the thread with Huang Ying by suggesting this test program
> >> >> >> >> is "neccessary but not sufficient" to exhaustively test the mTHP swap-out path.
> >> >> >> >> I've certainly found it useful and think it would be a valuable addition to the
> >> >> >> >> tree.
> >> >> >> >>
> >> >> >> >> That said, I'm not convinced it is a selftest; IMO a selftest should provide a
> >> >> >> >> clear pass/fail result against some criteria and must be able to be run
> >> >> >> >> automatically by (e.g.) a CI system.
> >> >> >> >
> >> >> >> > Likely we should then consider moving other such performance-related thingies
> >> >> >> > out of the selftests?
> >> >> >>
> >> >> >> Yes, that would get my vote. But of the 4 tests you mentioned that use
> >> >> >> clock_gettime(), it looks like transhuge-stress is the only one that doesn't
> >> >> >> have a pass/fail result, so is probably the only candidate for moving.
> >> >> >>
> >> >> >> The others either use the times as a timeout and determines failure if the
> >> >> >> action didn't occur within the timeout (e.g. ksm_tests.c) or use it to add some
> >> >> >> supplemental performance information to an otherwise functionality-oriented test.
> >> >> >
> >> >> > Thank you very much, Ryan. I think you've found a better home for this
> >> >> > tool . I will
> >> >> > send v2, relocating it to tools/mm and adding a function to swap in
> >> >> > either the whole
> >> >> > mTHPs or a portion of mTHPs by "-a"(aligned swapin).
> >> >> >
> >> >> > So basically, we will have
> >> >> >
> >> >> > 1. Use MADV_PAGEPUT for rapid swap-out, putting the swap allocation code under
> >> >> > high exercise in a short time.
> >> >> >
> >> >> > 2. Use MADV_DONTNEED to simulate the behavior of libc and Java heap in freeing
> >> >> > memory, as well as for munmap, app exits, or OOM killer scenarios. This ensures
> >> >> > new mTHP is always generated, released or swapped out, similar to the behavior
> >> >> > on a PC or Android phone where many applications are frequently started and
> >> >> > terminated.
> >> >>
> >> >> MADV_DONTNEED 64KB memory, then memset() it, this just simulates the
> >> >> large folio swap-in exactly, which hasn't been merged by upstream.  I
> >> >> don't think that it's a good idea to make such kind of trick.
> >> >
> >> > I disagree. This is how userspace heaps can manage memory
> >> > deallocation.
> >>
> >> Sorry, I don't understand how.  Can you show some examples?  Such as
> >> strace log with 64KB aligned MADV_DONTNEED?
> >
> > In Java heap and memory allocators such as jemalloc and Scudo, memory is freed
> > using the MADV_DONTNEED flag when either free() is called or garbage collection
> > occurs. In Android, the Java heap is freed in chunks aligned to 64KB
> > or larger.
>
> Originally, I heard about that MADV_FREE is used by jemalloc.  Now, I
> know that they use MADV_DONTNEED too.  Thanks!
>
> Although I still suspect that libc/java allocator will free pages in
> exact 64KB size (IIUC, they should free pages in much larger trunk).  I
> agree that MADV_DONTNEED is a way to create fragmentation in swap
> devices.

Right.

They don't always free memory in exact 64KB sizes or mTHP size, but we
need to define a minimum granularity. Typically, when many objects are
freed, they combine into a larger free block, which is then released to
kernel all at once.

As an example, libc might map lots of 4MB VMAs and classify them into
different size categories—some for small objects and others for larger ones.
While attempts are made to consolidate adjacent free blocks to reduce
system calls, MADV_DONTNEED is often utilized at the minimum granularity
for small objects when merging is temporarily impractical - We don't always
encounter two or more memory blocks where all the objects have been
released :-)


>
> > In
> > Scudo and jemalloc, there is a configuration option to set the
> > management granularity.
> > This granularity is set to match the mTHP size(though the default
> > value is 16KB in the
> > latest Android if we don't run mTHP). Otherwise, you could end up with
> > millions of
> > partial unmap operations, which would severely degrade the performance of mTHP.
> >
> > Imagine libc/Java functioning like a slab allocator. When kfree() is
> > called, some pages
> > may become completely unoccupied and can be returned to the buddy allocator. In
> > userspace, memory is given back to the kernel in a similar manner,
> > typically using
> > MADV_DONTNEED. Therefore, MADV_DONTNEED is the most common memory
> > reclamation behavior in Android, coming with free(), delete() or GC.
> >
> > Imagine a system with extensive malloc, free, new, and delete
> > operations, where objects
> > are constantly being created and destroyed.
> >
> > On the other hand, whether libc/Java use MADV_DONTNEED to free memory is not
> > crucial, although they do. We need a method to simulate the lifecycle
> > of applications
> > —exiting and starting anew—on PCs or Android phones. It doesn't matter if you
> > use MADV_DONTNEED or munmap to achieve this.
> >
> > It is important to note that mTHP currently operates on a one-shot
> > basis(after swap-out,
> > you never get them back as mTHP as we don't support large folios
> > swapin). For the test
> > program, we need a method to generate new mTHPs continuously. Without this,
> > after the initial iterations, we would be left with only folios,
> > rendering the entire
> > test program *pointless*.
>
> I understand the requirements for new mTHPs.
>
> >>
> >> > Additionally, in the event of an application exit, munmap, or OOM killer, the
> >> > amount of freed memory can be much larger than 64KB. The primary purpose
> >> > of using MADV_DONTNEED is to release anonymous memory and generate
> >> > new mTHP so that the iteration can continue. Otherwise, the test program
> >> > becomes entirely pointless, as we only have large folios at the beginning.
> >> > That is exactly why Chris has failed to find his bugs by using other small
> >> > programs.
> >>
> >> Although I still don't understand how 64KB aligned MADV_DONTNEED is used
> >> for libc/java heap or munmap in a practical way.  After more thoughts, I
> >> think 64KB Aligned MADV_DONTNEED can simulate the fragmentation effect
> >> of processes exit at some degree if 64KB folios in these processes are
> >> swapped out without splitting.  If you have no other practical use
> >> cases, I suggest to make it explicit with comments in program.
> >>
>
> [snip]
>
> --
> Best Regards,
> Huang, Ying