linux-kernel - Re: [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87plq4hpox.fsf@yhuang6-desk2.ccr.corp.intel.com>
Date: Mon, 19 Aug 2024 16:39:10 +0800
From: "Huang, Ying" <ying.huang@...el.com>
To: Chris Li <chrisl@...nel.org>
Cc: Hugh Dickins <hughd@...gle.com>,  Andrew Morton
 <akpm@...ux-foundation.org>,  Kairui Song <kasong@...cent.com>,  Ryan
 Roberts <ryan.roberts@....com>,  Kalesh Singh <kaleshsingh@...gle.com>,
  linux-kernel@...r.kernel.org,  linux-mm@...ck.org,  Barry Song
 <baohua@...nel.org>
Subject: Re: [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap
 cluster order

Chris Li <chrisl@...nel.org> writes:

> On Thu, Aug 8, 2024 at 1:38 AM Huang, Ying <ying.huang@...el.com> wrote:
>>
>> Chris Li <chrisl@...nel.org> writes:
>>
>> > On Wed, Aug 7, 2024 at 12:59 AM Huang, Ying <ying.huang@...el.com> wrote:
>> >>
>> >> Hi, Chris,
>> >>
>> >> Chris Li <chrisl@...nel.org> writes:
>> >>
>> >> > This is the short term solutions "swap cluster order" listed
>> >> > in my "Swap Abstraction" discussion slice 8 in the recent
>> >> > LSF/MM conference.
>> >> >
>> >> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP
>> >> > orders" is introduced, it only allocates the mTHP swap entries
>> >> > from the new empty cluster list.  It has a fragmentation issue
>> >> > reported by Barry.
>> >> >
>> >> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
>> >> >
>> >> > The reason is that all the empty clusters have been exhausted while
>> >> > there are plenty of free swap entries in the cluster that are
>> >> > not 100% free.
>> >> >
>> >> > Remember the swap allocation order in the cluster.
>> >> > Keep track of the per order non full cluster list for later allocation.
>> >> >
>> >> > This series gives the swap SSD allocation a new separate code path
>> >> > from the HDD allocation. The new allocator use cluster list only
>> >> > and do not global scan swap_map[] without lock any more.
>> >>
>> >> This sounds good.  Can we use SSD allocation method for HDD too?
>> >> We may not need a swap entry allocator optimized for HDD.
>> >
>> > Yes, that is the plan as well. That way we can completely get rid of
>> > the old scan_swap_map_slots() code.
>>
>> Good!
>>
>> > However, considering the size of the series, let's focus on the
>> > cluster allocation path first, get it tested and reviewed.
>>
>> OK.
>>
>> > For HDD optimization, mostly just the new block allocations portion
>> > need some separate code path from the new cluster allocator to not do
>> > the per cpu allocation.  Allocating from the non free list doesn't
>> > need to change too
>>
>> I suggest not consider HDD optimization at all.  Just use SSD algorithm
>> to simplify.
>
> Adding a global next allocating CI rather than the per CPU next CI
> pointer is pretty trivial as well. It is just a different way to fetch
> the next cluster pointer.

For HDD optimization, I mean original no struct swap_cluster_info etc.

>>
>> >>
>> >> Hi, Hugh,
>> >>
>> >> What do you think about this?
>> >>
>> >> > This streamline the swap allocation for SSD. The code matches the
>> >> > execution flow much better.
>> >> >
>> >> > User impact: For users that allocate and free mix order mTHP swapping,
>> >> > It greatly improves the success rate of the mTHP swap allocation after the
>> >> > initial phase.
>> >> >
>> >> > It also performs faster when the swapfile is close to full, because the
>> >> > allocator can get the non full cluster from a list rather than scanning
>> >> > a lot of swap_map entries.
>> >>
>> >> Do you have some test results to prove this?  Or which test below can
>> >> prove this?
>> >
>> > The two zram tests are already proving this. The system time
>> > improvement is about 2% on my low CPU count machine.
>> > Kairui has a higher core count machine and the difference is higher
>> > there. The theory is that higher CPU count has higher contentions.
>>
>> I will interpret this as the performance is better in theory.  But
>> there's almost no measurable results so far.
>
> I am trying to understand why don't see the performance improvement in
> the zram setup in my cover letter as a measurable result?

IIUC, there's no benchmark score difference, just system time.  And the
number is low too.

For Kairui's test, does all performance improvement come from "swapfile
is close to full"?

>>
>> > The 2% system time number does not sound like much. But consider this
>> > two factors:
>> > 1) swap allocator only takes a small percentage of the overall workload.
>> > 2) The new allocator does more work.
>> > The old allocator has a time tick budget. It will abort and fail to
>> > find an entry when it runs out of time budget, even though there are
>> > still some free entries on the swapfile.
>>
>> What is the time tick budget you mentioned?
>
> I was under the impression that the previous swap entry allocation
> code will not scan 100% of the swapfile if there is only one entry
> left.
> Please let me know if my understanding is not correct.
>
>         /* time to take a break? */
>         if (unlikely(--latency_ration < 0)) {
>                 if (n_ret)
>                         goto done;
>                 spin_unlock(&si->lock);
>                 cond_resched();
>                 spin_lock(&si->lock);
>                 latency_ration = LATENCY_LIMIT;
>         }

IIUC, this is to reduce latency via cond_resched().  If n_ret != 0, we
have allocated some swap entries successfully, it's OK to return to
reduce allocation latency.

>
>>
>> > The new allocator can get to the last few free swap entries if it is
>> > available. If not then, the new swap allocator will work harder on
>> > swap cache reclaim.
>> >
>> > From the swap cache reclaim aspect, it is very hard to optimize the
>> > swap cache reclaim in the old allocation path because the scan
>> > position is randomized.
>> > The full list and frag list both design to help reduce the repeat
>> > reclaim attempt of the swap cache.
>>
>> [snip]
>>

--
Best Regards,
Huang, Ying