linux-kernel - Re: [PATCH 0/2] mm/zswap: optimize the scalability of zswap rb-tree

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <d7e0c051-e1c4-4658-af06-cbbb2e5e3bfb@bytedance.com>
Date: Fri, 19 Jan 2024 14:40:20 +0800
From: Chengming Zhou <zhouchengming@...edance.com>
To: Yosry Ahmed <yosryahmed@...gle.com>, Johannes Weiner <hannes@...xchg.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>, linux-kernel@...r.kernel.org,
 linux-mm@...ck.org, Chris Li <chriscli@...gle.com>,
 Nhat Pham <nphamcs@...il.com>
Subject: Re: [PATCH 0/2] mm/zswap: optimize the scalability of zswap rb-tree

On 2024/1/19 02:37, Yosry Ahmed wrote:
> On Thu, Jan 18, 2024 at 10:07 AM Johannes Weiner <hannes@...xchg.org> wrote:
>>
>> On Thu, Jan 18, 2024 at 09:30:12AM -0800, Yosry Ahmed wrote:
>>> On Thu, Jan 18, 2024 at 7:34 AM Johannes Weiner <hannes@...xchg.org> wrote:
>>>>
>>>> On Wed, Jan 17, 2024 at 10:37:22AM -0800, Yosry Ahmed wrote:
>>>>> On Wed, Jan 17, 2024 at 1:23 AM Chengming Zhou
>>>>> <zhouchengming@...edance.com> wrote:
>>>>>>
>>>>>> When testing the zswap performance by using kernel build -j32 in a tmpfs
>>>>>> directory, I found the scalability of zswap rb-tree is not good, which
>>>>>> is protected by the only spinlock. That would cause heavy lock contention
>>>>>> if multiple tasks zswap_store/load concurrently.
>>>>>>
>>>>>> So a simple solution is to split the only one zswap rb-tree into multiple
>>>>>> rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
>>>>>> from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
>>>>>>
>>>>>> Although this method can't solve the spinlock contention completely, it
>>>>>> can mitigate much of that contention. Below is the results of kernel build
>>>>>> in tmpfs with zswap shrinker enabled:
>>>>>>
>>>>>>      linux-next  zswap-lock-optimize
>>>>>> real 1m9.181s    1m3.820s
>>>>>> user 17m44.036s  17m40.100s
>>>>>> sys  7m37.297s   4m54.622s
>>>>>>
>>>>>> So there are clearly improvements. And it's complementary with the ongoing
>>>>>> zswap xarray conversion by Chris. Anyway, I think we can also merge this
>>>>>> first, it's complementary IMHO. So I just refresh and resend this for
>>>>>> further discussion.
>>>>>
>>>>> The reason why I think we should wait for the xarray patch(es) is
>>>>> there is a chance we may see less improvements from splitting the tree
>>>>> if it was an xarray. If we merge this series first, there is no way to
>>>>> know.
>>>>
>>>> I mentioned this before, but I disagree quite strongly with this
>>>> general sentiment.
>>>>
>>>> Chengming's patches are simple, mature, and have convincing
>>>> numbers. IMO it's poor form to hold something like that for "let's see
>>>> how our other experiment works out". The only exception would be if we
>>>> all agree that the earlier change flies in the face of the overall
>>>> direction we want to pursue, which I don't think is the case here.
>>>
>>> My intention was not to delay merging these patches until the xarray
>>> patches are merged in. It was only to wait until the xarray patches
>>> are *posted*, so that we can redo the testing on top of them and
>>> verify that the gains are still there. That should have been around
>>> now, but the xarray patches were posted in a form that does not allow
>>> this testing (because we still have a lock on the read path), so I am
>>> less inclined.
>>>
>>> My rationale was that if the gains from splitting the tree become
>>> minimal after we switch to an xarray, we won't know. It's more
>>> difficult to remove optimizations than to add them, because we may
>>> cause a regression. I am kind of paranoid about having code sitting
>>> around that we don't have full information about how much it's needed.
>>
>> Yeah I understand that fear.
>>
>> I expect the splitting to help more than the move to xarray because
>> it's the writes that are hot. Luckily in this case it should be fairly
>> easy to differential-test after it's been merged by changing that tree
>> lookup macro/function locally to always return &trees[type][0], right?
> 
> Yeah that's exactly what I had in mind. Once we have a version of the
> xarray patch without the locking on the read side we can test with
> that. Chengming, does this sound reasonable to you?

It's ok, sounds reasonable to me. I agree with Johannes, we will need
both since xarray still have a spinlock in the writes, it's clearly
better to split it. As for testing, we can always return &trees[type][0].

Thanks!