[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e73c801a-8f36-44d8-a267-dd993aeccf35@gmail.com>
Date: Wed, 14 Aug 2024 11:13:35 +0100
From: Usama Arif <usamaarif642@...il.com>
To: Andi Kleen <ak@...ux.intel.com>
Cc: akpm@...ux-foundation.org, linux-mm@...ck.org, hannes@...xchg.org,
riel@...riel.com, shakeel.butt@...ux.dev, roman.gushchin@...ux.dev,
yuzhao@...gle.com, david@...hat.com, baohua@...nel.org,
ryan.roberts@....com, rppt@...nel.org, willy@...radead.org,
cerasuolodomenico@...il.com, corbet@....net, linux-kernel@...r.kernel.org,
linux-doc@...r.kernel.org, kernel-team@...a.com
Subject: Re: [PATCH v3 0/6] mm: split underutilized THPs
On 13/08/2024 18:22, Andi Kleen wrote:
> Usama Arif <usamaarif642@...il.com> writes:
>>
>> This patch-series is an attempt to mitigate the issue of running out of
>> memory when THP is always enabled. During runtime whenever a THP is being
>> faulted in or collapsed by khugepaged, the THP is added to a list.
>> Whenever memory reclaim happens, the kernel runs the deferred_split
>> shrinker which goes through the list and checks if the THP was underutilized,
>> i.e. how many of the base 4K pages of the entire THP were zero-filled.
>
> Sometimes when writing a benchmark I fill things with zero explictly
> to avoid faults later. For example if you want to measure memory
> read bandwidth you need to fault the pages first, but that fault
> pattern may well be zero.
>
> With your patch if there is memory pressure there are two effects:
>
> - If things are remapped to the zero page the benchmark
> reading memory may give unrealistically good results because
> what is thinks is a big memory area is actually only backed
> by a single page.
>
> - If I expect to write I may end up with an unexpected zeropage->real
> memory fault if the pages got remapped.
>
> I expect such patterns can happen without benchmarking too.
> I could see it being a problem for latency sensitive applications.
>
> Now you could argue that this all should only happen under memory
> pressure and when that happens things may be slow anyways and your
> patch will still be an improvement.
>
> Maybe that's true but there might be still corner cases
> which are negatively impacted by this. I don't have a good solution
> other than a tunable, but I expect it will cause problems for someone.
>
There are currently 2 knobs to control behaviour of THP low utilization shrinker introduced in this series.
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none:
The current default value for this is HPAGE_PMD_NR - 1 (511 for x86). If set to 511, the shrinker will immediately remove the folio from the deferred_list (Please see first if statement in thp_underutilized in Patch 5) and split is not attempted. Not a single page is checked at this point and there is no memory accesses done to impact performance.
If someone sets its to 510, it will exit as soon as a single page containing non-zero data is encountered (the else part in thp_underutilized).
/sys/kernel/mm/transparent_hugepage/thp_low_util_shrinker:
Introduced in patch 6, if someone really doesn't want to enable the shrinker, then they can set this to false. The folio will not be added to the _deferred_list at fault or collapse time, and it will be as if these patches didn't exist. Personally, I don't think its absolutely necessary to have this, but I added it incase someone comes up with some corner case.
For the first effect you mentioned, with the default behaviour of the patches with max_ptes_none set to 511, there will be no splitting of THPs, so you will get the same performance as without the series.
If there is some benchmark that allocates all of the system memory with zeropages, causing shrinker to run and if someone has changed max_ptes_none and if they have kept thp_low_util_shrinker enabled and if all the benchmark does is read those pages, thus giving good memory results, then that benchmark is not really useful and the good results it gives is not unrealistic but a result of these patches. The stress example I have in the cover letter is an example. With these patches you can run stress or any other benchmark that behaves like this and still run other applications at the same time that consume memory, so the improvement is not unrealistic.
For the second effect of memory faults affecting latency sensitive applications, if THP is always enabled, and such applications are running out of memory resulting in shrinker to run, then a higher priority should be to have memory to run (which the shrinker will provide) rather than stalling for memory creating memory pressure which will result in latency spikes and possibly OOM killer being invoked killing the application.
I think we should focus on real world applications for which I have posted numbers in the cover letter and not tailor this for some benchmarks. If there is some real world low latency application where you could show these patches causing an issue, I would be happy to look into it. But again, with the default max_ptes_none of 511, it wouldn't.
> The other problem I have with your patch is that it may cause the kernel
> to pollute CPU caches in the background, which again will cause noise in
> the system. Instead of plain memchr_inv, you should probably use some
> primitive to bypass caches or use a NTA prefetch hint at least.
>
A few points on this:
- the page is checked in 2 places, at shrink time and at split time, so having the page in cache is useful and needed.
- there is stuff like this already done in the kernel when there is memory pressure, for e.g. at swap time [1]. Its not memchr_inv, but doing the exact same thing as memchr_inv.
- At the time the shrinker runs, one of the highest priority of the kernel/system is to get free memory. We should not try to make this slower by messing around with caches.
I think the current behaviour in the patches is good because of the above points. But also I don't think there is a standard way of doing NTA prefetch across all architectures, x86 prefetch does it [1], but arm prefetch [2] does pld1keep, i.e. keep the data in L1 cache which is the opposite of what NTA prefetch is intended doing.
[1] https://elixir.bootlin.com/linux/v6.10.4/source/mm/zswap.c#L1390
[2] https://elixir.bootlin.com/linux/v6.10.4/source/arch/x86/include/asm/processor.h#L614
[3] https://elixir.bootlin.com/linux/v6.10.4/source/arch/arm64/include/asm/processor.h#L360
Powered by blists - more mailing lists