lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a9d7103f-efcb-4ab8-91b9-6f1737789c49@arm.com>
Date:   Fri, 13 Oct 2023 17:31:20 +0100
From:   Ryan Roberts <ryan.roberts@....com>
To:     "Huang, Ying" <ying.huang@...el.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        David Hildenbrand <david@...hat.com>,
        Matthew Wilcox <willy@...radead.org>,
        Gao Xiang <xiang@...nel.org>, Yu Zhao <yuzhao@...gle.com>,
        Yang Shi <shy828301@...il.com>, Michal Hocko <mhocko@...e.com>,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [RFC PATCH v1 0/2] Swap-out small-sized THP without splitting

On 11/10/2023 07:37, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@....com> writes:
> 
> [...]
> 
>> Finally on testing, I've run the mm selftests and see no regressions, but I
>> don't think there is anything in there specifically aimed towards swap? Are
>> there any functional or performance tests that I should run? It would certainly
>> be good to confirm I haven't regressed PMD-size THP swap performance.
> 
> I have used swap sub test case of vm-scalbility to test.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/

I ended up using `usemem`, which is the core of this test suite, but deviated
from the pre-canned test case to allow me to use anonymous memory and get
numbers for small-sized THP (this is a very useful tool - thanks for pointing it
out!)

I've run the tests on Ampere Altra, set up with a 35G block ram device as the
swap device and from inside a memcg limited to 40G memory. I've then run
`usemem` with 70 processes (each has its own core), each allocating and writing
1G of memory. I've repeated everything 5 times and taken the mean and stdev:


Mean Performance Improvement vs 4K/baseline

| alloc size |            baseline |    remove-huge-flag | swap-file-small-thp |
|            |  v6.6-rc4+anonfolio |           + patch 1 |           + patch 2 |
|:-----------|--------------------:|--------------------:|--------------------:|
| 4K Page    |                0.0% |                2.3% |                9.1% |
| 64K THP    |              -44.1% |              -46.3% |               30.6% |
| 2M THP     |               56.0% |               54.2% |               60.1% |


Standard Deviation as Percentage of Mean

| alloc size |            baseline |    remove-huge-flag | swap-file-small-thp |
|            |  v6.6-rc4+anonfolio |           + patch 1 |           + patch 2 |
|:-----------|--------------------:|--------------------:|--------------------:|
| 4K Page    |                3.4% |                7.1% |                1.7% |
| 64K THP    |                1.9% |                5.6% |                7.7% |
| 2M THP     |                1.9% |                2.1% |                3.2% |


I don't see any meaningful performance cost to removing the HUGE flag, so
hopefully this gives us confidence to move forward with patch 1.

You can indeed see the performance regression in the baseline when THP is
configured to allocate small-sized THP only (in this case 64K). And you can see
the regression is fixed by patch 2, which avoids splitting the THP and thus
avoids the extra TLBIs. This correlates with what I saw in kernel compilation
workload.

Huang Ying, based on these results, do you still want me to persue a per-cpu
solution to avoid potential contention on the swap info lock? - I proposed in
the thread against patch 2 to do this in the swap_slots layer if so, rather than
in swapfile.c directly (I'm not sure how your original proposal would actually
work?). But based on these results, its not obvious to me that there is a
definite problem here, and it might be simpler to avoid the complexity?

Thanks,
Ryan

> 
> --
> Best Regards,
> Huang, Ying

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ