linux-kernel - Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87h6dp479n.fsf@yhuang6-desk2.ccr.corp.intel.com>
Date: Wed, 19 Jun 2024 17:21:56 +0800
From: "Huang, Ying" <ying.huang@...el.com>
To: Chris Li <chrisl@...nel.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>,  Kairui Song
 <kasong@...cent.com>,  Ryan Roberts <ryan.roberts@....com>,
  linux-kernel@...r.kernel.org,  linux-mm@...ck.org,  Barry Song
 <baohua@...nel.org>
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster
 order

Chris Li <chrisl@...nel.org> writes:

> On Mon, Jun 17, 2024 at 11:56 PM Huang, Ying <ying.huang@...el.com> wrote:
>>
>> Chris Li <chrisl@...nel.org> writes:
>>
>> > That is in general true with all kernel development regardless of
>> > using options or not. If there is a bug in my patch, I will need to
>> > debug and fix it or the patch might be reverted.
>> >
>> > I don't see that as a reason to take the option path or not. The
>> > option just means the user taking this option will need to understand
>> > the trade off and accept the defined behavior of that option.
>>
>> User configuration knobs are not forbidden for Linux kernel.  But we are
>> more careful about them because they will introduce ABI which we need to
>> maintain forever.  And they are hard to be used for users.  Optimizing
>> automatically is generally the better solution.  So, I suggest you to
>> think more about the automatically solution before diving into a new
>> option.
>
> I did, see my reply. Right now there are just no other options.
>
>>
>> >>
>> >> >> So, I prefer the transparent methods.  Just like THP vs. hugetlbfs.
>> >> >
>> >> > Me too. I prefer transparent over reservation if it can achieve the
>> >> > same goal. Do we have a fully transparent method spec out? How to
>> >> > achieve fully transparent and also avoid fragmentation caused by mix
>> >> > order allocation/free?
>> >> >
>> >> > Keep in mind that we are still in the early stage of the mTHP swap
>> >> > development, I can have the reservation patch relatively easily. If
>> >> > you come up with a better transparent method patch which can achieve
>> >> > the same goal later, we can use it instead.
>> >>
>> >> Because we are still in the early stage, I think that we should try to
>> >> improve transparent solution firstly.  Personally, what I don't like is
>> >> that we don't work on the transparent solution because we have the
>> >> reservation solution.
>> >
>> > Do you have a road map or the design for the transparent solution you can share?
>> > I am interested to know what is the short term step(e.g. a month)  in
>> > this transparent solution you have in mind, so we can compare the
>> > different approaches. I can't reason much just by the name
>> > "transparent solution" itself. Need more technical details.
>> >
>> > Right now we have a clear usage case we want to support, the swap
>> > in/out mTHP with bigger zsmalloc buffers. We can start with the
>> > limited usage case first then move to more general ones.
>>
>> TBH, This is what I don't like.  It appears that you refuse to think
>> about the transparent (or automatic) solution.
>
> Actually, that is not true, you make the wrong assumption about what I
> have considered. I want to find out what you have in mind to compare
> the near term solutions.

Sorry about my wrong assumption.

> In my recent LSF slide I already list 3 options to address this
> fragmentation problem.
> From easy to hard:
> 1) Assign cluster an order on allocation and remember the cluster
> order. (short term).
> That is this patch series
> 2) Buddy allocation on the swap entry (longer term)
> 3) Folio write out compound discontinuous swap entry. (ultimate)
>
> I also considered 4), which I did not put into the slide, because it
> is less effective than 3)
> 4) migrating the swap entries, which require scan page table entry.
> I briefly mentioned it during the session.

Or you need something like a rmap, that isn't easy.

> 3) should might qualify as your transparent solution. It is just much
> harder to implement.
> Even when we have 3), having some form of 1) can be beneficial as
> well. (less IO count, no indirect layer of swap offset).
>
>>
>> I haven't thought about them thoroughly, but at least we may think about
>>
>> - promoting low order non-full cluster when we find a free high order
>>   swap entries.
>>
>> - stealing a low order non-full cluster with low usage count for
>>   high-order allocation.
>
> Now we are talking.
> These two above fall well within 2) the buddy allocators
> But the buddy allocator will not be able to address all fragmentation
> issues, due to the allocator not being controlled the life cycle of
> the swap entry.
> It will not help Barry's zsmalloc usage case much because android
> likes to keep the swapfile full. I can already see that.

I think that buddy-like allocator (not exactly buddy algorithm) will
help fragmentation.  And it will help more users because it works
automatically.

I don't think they are too hard to be implemented.  We can try to find
some simple solution firstly.  So, I think that we don't need to push
them to long term.  At least, they can be done before introducing
high-order cluster reservation ABI.  Then, we can evaluate the benefit
and overhead of reservation ABI.

>> - freeing more swap entries when swap devices become fragmented.
>
> That requires a scan page table to free the swap entry, basically 4).

No.  You can just scan the page table of current process in
do_swap_page() and try to swap-in and free more swap entries.  That
doesn't work well for the shared pages.  However, I think that it can
help quite some workloads.

> It is all about investment and return. 1) is relatively easy to
> implement and with good improvement and return.

[snip]

--
Best Regards,
Huang, Ying