[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5313c721-9cf1-4ecd-ac23-1eeddabd691f@gmail.com>
Date: Mon, 21 Oct 2024 11:40:14 +0100
From: Usama Arif <usamaarif642@...il.com>
To: Barry Song <21cnbao@...il.com>
Cc: akpm@...ux-foundation.org, linux-mm@...ck.org, hannes@...xchg.org,
david@...hat.com, willy@...radead.org, kanchana.p.sridhar@...el.com,
yosryahmed@...gle.com, nphamcs@...il.com, chengming.zhou@...ux.dev,
ryan.roberts@....com, ying.huang@...el.com, riel@...riel.com,
shakeel.butt@...ux.dev, kernel-team@...a.com, linux-kernel@...r.kernel.org,
linux-doc@...r.kernel.org
Subject: Re: [RFC 0/4] mm: zswap: add support for zswapin of large folios
On 21/10/2024 06:09, Barry Song wrote:
> On Fri, Oct 18, 2024 at 11:50 PM Usama Arif <usamaarif642@...il.com> wrote:
>>
>> After large folio zswapout support added in [1], this patch adds
>> support for zswapin of large folios to bring it on par with zram.
>> This series makes sure that the benefits of large folios (fewer
>> page faults, batched PTE and rmap manipulation, reduced lru list,
>> TLB coalescing (for arm64 and amd)) are not lost at swap out when
>> using zswap.
>>
>> It builds on top of [2] which added large folio swapin support for
>> zram and provides the same level of large folio swapin support as
>> zram, i.e. only supporting swap count == 1.
>>
>> Patch 1 skips swapcache for swapping in zswap pages, this should improve
>> no readahead swapin performance [3], and also allows us to build on large
>> folio swapin support added in [2], hence is a prerequisite for patch 3.
>>
>> Patch 3 adds support for large folio zswapin. This patch does not add
>> support for hybrid backends (i.e. folios partly present swap and zswap).
>>
>> The main performance benefit comes from maintaining large folios *after*
>> swapin, large folio performance improvements have been mentioned in previous
>> series posted on it [2],[4], so have not added those. Below is a simple
>> microbenchmark to measure the time needed *for* zswpin of 1G memory (along
>> with memory integrity check).
>>
>> | no mTHP (ms) | 1M mTHP enabled (ms)
>> Base kernel | 1165 | 1163
>> Kernel with mTHP zswpin series | 1203 | 738
>
> Hi Usama,
> Do you know where this minor regression for non-mTHP comes from?
> As you even have skipped swapcache for small folios in zswap in patch1,
> that part should have some gain? is it because of zswap_present_test()?
>
Hi Barry,
The microbenchmark does a sequential read of 1G of memory, so it probably
isnt very representative of real world usecases. This also means that
swap_vma_readahead is able to readahead accurately all pages in its window.
With this patch series, if doing 4K swapin, you get 1G/4K calls of fast
do_swap_page. Without this patch, you get 1G/(4K*readahead window) of slow
do_swap_page calls. I had added some prints and I was seeing 8 pages being
readahead in 1 do_swap_page. The larger number of calls causes the slight
regression (eventhough they are quite fast). I think in a realistic scenario,
where readahead window wont be as large, there wont be a regression.
The cost of zswap_present_test in the whole call stack of swapping page is
very low and I think can be ignored.
I think the more interesting thing is what Kanchana pointed out in
https://lore.kernel.org/all/f2f2053f-ec5f-46a4-800d-50a3d2e61bff@gmail.com/
I am curious, did you see this when testing large folio swapin and compression
at 4K granuality? Its looks like swap thrashing so I think it would be common
between zswap and zram. I dont have larger granuality zswap compression done,
which is why I think there is a regression in time taken. (It could be because
its tested on intel as well).
Thanks,
Usama
>>
>> The time measured was pretty consistent between runs (~1-2% variation).
>> There is 36% improvement in zswapin time with 1M folios. The percentage
>> improvement is likely to be more if the memcmp is removed.
>>
>> diff --git a/tools/testing/selftests/cgroup/test_zswap.c b/tools/testing/selftests/cgroup/test_zswap.c
>> index 40de679248b8..77068c577c86 100644
>> --- a/tools/testing/selftests/cgroup/test_zswap.c
>> +++ b/tools/testing/selftests/cgroup/test_zswap.c
>> @@ -9,6 +9,8 @@
>> #include <string.h>
>> #include <sys/wait.h>
>> #include <sys/mman.h>
>> +#include <sys/time.h>
>> +#include <malloc.h>
>>
>> #include "../kselftest.h"
>> #include "cgroup_util.h"
>> @@ -407,6 +409,74 @@ static int test_zswap_writeback_disabled(const char *root)
>> return test_zswap_writeback(root, false);
>> }
>>
>> +static int zswapin_perf(const char *cgroup, void *arg)
>> +{
>> + long pagesize = sysconf(_SC_PAGESIZE);
>> + size_t memsize = MB(1*1024);
>> + char buf[pagesize];
>> + int ret = -1;
>> + char *mem;
>> + struct timeval start, end;
>> +
>> + mem = (char *)memalign(2*1024*1024, memsize);
>> + if (!mem)
>> + return ret;
>> +
>> + /*
>> + * Fill half of each page with increasing data, and keep other
>> + * half empty, this will result in data that is still compressible
>> + * and ends up in zswap, with material zswap usage.
>> + */
>> + for (int i = 0; i < pagesize; i++)
>> + buf[i] = i < pagesize/2 ? (char) i : 0;
>> +
>> + for (int i = 0; i < memsize; i += pagesize)
>> + memcpy(&mem[i], buf, pagesize);
>> +
>> + /* Try and reclaim allocated memory */
>> + if (cg_write_numeric(cgroup, "memory.reclaim", memsize)) {
>> + ksft_print_msg("Failed to reclaim all of the requested memory\n");
>> + goto out;
>> + }
>> +
>> + gettimeofday(&start, NULL);
>> + /* zswpin */
>> + for (int i = 0; i < memsize; i += pagesize) {
>> + if (memcmp(&mem[i], buf, pagesize)) {
>> + ksft_print_msg("invalid memory\n");
>> + goto out;
>> + }
>> + }
>> + gettimeofday(&end, NULL);
>> + printf ("zswapin took %fms to run.\n", (end.tv_sec - start.tv_sec)*1000 + (double)(end.tv_usec - start.tv_usec) / 1000);
>> + ret = 0;
>> +out:
>> + free(mem);
>> + return ret;
>> +}
>> +
>> +static int test_zswapin_perf(const char *root)
>> +{
>> + int ret = KSFT_FAIL;
>> + char *test_group;
>> +
>> + test_group = cg_name(root, "zswapin_perf_test");
>> + if (!test_group)
>> + goto out;
>> + if (cg_create(test_group))
>> + goto out;
>> +
>> + if (cg_run(test_group, zswapin_perf, NULL))
>> + goto out;
>> +
>> + ret = KSFT_PASS;
>> +out:
>> + cg_destroy(test_group);
>> + free(test_group);
>> + return ret;
>> +}
>> +
>> /*
>> * When trying to store a memcg page in zswap, if the memcg hits its memory
>> * limit in zswap, writeback should affect only the zswapped pages of that
>> @@ -584,6 +654,7 @@ struct zswap_test {
>> T(test_zswapin),
>> T(test_zswap_writeback_enabled),
>> T(test_zswap_writeback_disabled),
>> + T(test_zswapin_perf),
>> T(test_no_kmem_bypass),
>> T(test_no_invasive_cgroup_shrink),
>> };
>>
>> [1] https://lore.kernel.org/all/20241001053222.6944-1-kanchana.p.sridhar@intel.com/
>> [2] https://lore.kernel.org/all/20240821074541.516249-1-hanchuanhua@oppo.com/
>> [3] https://lore.kernel.org/all/1505886205-9671-5-git-send-email-minchan@kernel.org/T/#u
>> [4] https://lwn.net/Articles/955575/
>>
>> Usama Arif (4):
>> mm/zswap: skip swapcache for swapping in zswap pages
>> mm/zswap: modify zswap_decompress to accept page instead of folio
>> mm/zswap: add support for large folio zswapin
>> mm/zswap: count successful large folio zswap loads
>>
>> Documentation/admin-guide/mm/transhuge.rst | 3 +
>> include/linux/huge_mm.h | 1 +
>> include/linux/zswap.h | 6 ++
>> mm/huge_memory.c | 3 +
>> mm/memory.c | 16 +--
>> mm/page_io.c | 2 +-
>> mm/zswap.c | 120 ++++++++++++++-------
>> 7 files changed, 99 insertions(+), 52 deletions(-)
>>
>> --
>> 2.43.5
>>
>
> Thanks
> barry
Powered by blists - more mailing lists