[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <22b24ce9-d143-4b5f-87da-bf68e4fa46d3@redhat.com>
Date: Wed, 17 Jan 2024 19:41:04 +0100
From: David Hildenbrand <david@...hat.com>
To: Zach O'Keefe <zokeefe@...gle.com>, Lance Yang <ioworker0@...il.com>
Cc: akpm@...ux-foundation.org, songmuchun@...edance.com,
linux-kernel@...r.kernel.org, Yang Shi <shy828301@...il.com>,
Peter Xu <peterx@...hat.com>, Michael Knyszek <mknyszek@...gle.com>,
Minchan Kim <minchan@...nel.org>, Michal Hocko <mhocko@...e.com>,
linux-mm@...ck.org
Subject: Re: [PATCH v1 1/2] mm/madvise: introduce MADV_TRY_COLLAPSE for
attempted synchronous hugepage collapse
On 17.01.24 18:10, Zach O'Keefe wrote:
> [+linux-mm & others]
>
> On Tue, Jan 16, 2024 at 9:02 PM Lance Yang <ioworker0@...il.com> wrote:
>>
>> This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].
>>
>> Introduce a new madvise mode, MADV_TRY_COLLAPSE, that allows users to
>> make a least-effort attempt at a synchronous collapse of memory at
>> their own expense.
>>
>> The only difference from MADV_COLLAPSE is that the new hugepage allocation
>> avoids direct reclaim and/or compaction, quickly failing on allocation errors.
>>
>> The benefits of this approach are:
>>
>> * CPU is charged to the process that wants to spend the cycles for the THP
>> * Avoid unpredictable timing of khugepaged collapse
>> * Prevent unpredictable stalls caused by direct reclaim and/or compaction
>>
>> Semantics
>>
>> This call is independent of the system-wide THP sysfs settings, but will
>> fail for memory marked VM_NOHUGEPAGE. If the ranges provided span
>> multiple VMAs, the semantics of the collapse over each VMA is independent
>> from the others. This implies a hugepage cannot cross a VMA boundary. If
>> collapse of a given hugepage-aligned/sized region fails, the operation may
>> continue to attempt collapsing the remainder of memory specified.
>>
>> The memory ranges provided must be page-aligned, but are not required to
>> be hugepage-aligned. If the memory ranges are not hugepage-aligned, the
>> start/end of the range will be clamped to the first/last hugepage-aligned
>> address covered by said range. The memory ranges must span at least one
>> hugepage-sized region.
>>
>> All non-resident pages covered by the range will first be
>> swapped/faulted-in, before being internally copied onto a freshly
>> allocated hugepage. Unmapped pages will have their data directly
>> initialized to 0 in the new hugepage. However, for every eligible
>> hugepage aligned/sized region to-be collapsed, at least one page must
>> currently be backed by memory (a PMD covering the address range must
>> already exist).
>>
>> Allocation for the new hugepage will not enter direct reclaim and/or
>> compaction, quickly failing if allocation fails. When the system has
>> multiple NUMA nodes, the hugepage will be allocated from the node providing
>> the most native pages. This operation operates on the current state of the
>> specified process and makes no persistent changes or guarantees on how pages
>> will be mapped, constructed, or faulted in the future.
>>
>> Return Value
>>
>> If all hugepage-sized/aligned regions covered by the provided range were
>> either successfully collapsed, or were already PMD-mapped THPs, this
>> operation will be deemed successful. On success, madvise(2) returns 0.
>> Else, -1 is returned and errno is set to indicate the error for the
>> most-recently attempted hugepage collapse. Note that many failures might
>> have occurred, since the operation may continue to collapse in the event a
>> single hugepage-sized/aligned region fails.
>>
>> ENOMEM Memory allocation failed or VMA not found
>> EBUSY Memcg charging failed
>> EAGAIN Required resource temporarily unavailable. Try again
>> might succeed.
>> EINVAL Other error: No PMD found, subpage doesn't have Present
>> bit set, "Special" page no backed by struct page, VMA
>> incorrectly sized, address not page-aligned, ...
>>
>> Use Cases
>>
>> An immediate user of this new functionality is the Go runtime heap allocator
>> that manages memory in hugepage-sized chunks. In the past, whether it was a
>> newly allocated chunk through mmap() or a reused chunk released by
>> madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
>> huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
>> respectively. However, both approaches resulted in performance issues; for
>> both scenarios, there could be entries into direct reclaim and/or compaction,
>> leading to unpredictable stalls[4]. Now, the allocator can confidently use
>> madvise(MADV_TRY_COLLAPSE) to attempt the allocation of huge pages.
>>
>> [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77
>> [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a
>> [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af
>> [4] https://github.com/golang/go/issues/63334
>
> Thanks for the patch, Lance, and thanks for providing the links above,
> referring to issues Go has seen.
>
> I've reached out to the Go team to try and understand their use case,
> and how we could help. It's not immediately clear whether a
> lighter-weight MADV_COLLAPSE is the answer, but it could turn out to
> be.
>
> That said, with respect to the implementation, should a need for a
> lighter-weight MADV_COLLAPSE be warranted, I'd personally like to see
> process_madvise(2) be the "v2" of madvise(2), where we can start
> leveraging the forward-facing flags argument for these different
> advice flavors. We'd need to safely revert v5.10 commit a68a0262abdaa
> ("mm/madvise: remove racy mm ownership check") so that
> process_madvise(2) can always operate on self. IIRC, this was ~ the
> plan we landed on during MADV_COLLAPSE dev discussions (i.e. pick a
> sane default, and implement options in flags down the line).
+1, using process_madvise() would likely be the right approach.
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists