linux-kernel - Re: Reclaim regression after 1c30844d2dfe

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <6b76a95f-dadd-6dc7-e69c-3495c5551b4e@suse.cz>
Date:   Sat, 8 Feb 2020 10:08:23 +0100
From:   Vlastimil Babka <vbabka@...e.cz>
To:     Rik van Riel <riel@...riel.com>, Ivan Babrou <ivan@...udflare.com>,
        linux-mm@...ck.org, Mel Gorman <mgorman@...hsingularity.net>
Cc:     linux-kernel <linux-kernel@...r.kernel.org>,
        kernel-team <kernel-team@...udflare.com>,
        Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: Reclaim regression after 1c30844d2dfe

On 2/8/20 12:05 AM, Rik van Riel wrote:
> On Fri, 2020-02-07 at 14:54 -0800, Ivan Babrou wrote:
>> This change from 5.5 times:
>>
>> * https://github.com/torvalds/linux/commit/1c30844d2dfe
>>
>>> mm: reclaim small amounts of memory when an external fragmentation
>>> event occurs
>>
>> Introduced undesired effects in our environment.
>>
>> * NUMA with 2 x CPU
>> * 128GB of RAM
>> * THP disabled
>> * Upgraded from 4.19 to 5.4
>>
>> Before we saw free memory hover at around 1.4GB with no spikes. After
>> the upgrade we saw some machines decide that they need a lot more
>> than
>> that, with frequent spikes above 10GB, often only on a single numa
>> node.
>>
>> We can see kswapd quite active in balance_pgdat (it didn't look like
>> it slept at all):
>>
>> $ ps uax | fgrep kswapd
>> root       1850 23.0  0.0      0     0 ?        R    Jan30 1902:24
>> [kswapd0]
>> root       1851  1.8  0.0      0     0 ?        S    Jan30 152:16
>> [kswapd1]
>>
>> This in turn massively increased pressure on page cache, which did
>> not
>> go well to services that depend on having a quick response from a
>> local cache backed by solid storage.
>>
>> Here's how it looked like when I zeroed vm.watermark_boost_factor:
> 
> We have observed the same thing, even on single node systems.
> 
> I have some hacky patches to apply the watermark_boost thing on
> a per pgdat basis, which seems to resolve the issue, but I have
> not yet found the time to get the locking for that correct.

I wonder why per-pgdat basis would help in general (might help some
corcner cases?). Because I guess fundamentally the issue is the part
"reclaim an amount of memory relative to the size of the high watermark
and the watermark_boost_factor until the boost is cleared".
That means no matter how much memory there is already free, it will keep
reclaiming until nr_boost_reclaim reaches zero. This danger of runaway
reclaim wouldn't be there if it only reclaimed up to the boosted
watermark (or some watermark derived from that).

But yeah it's also weird that if you have so much free memory, you keep
getting the external fragmentation events that wake up kswapd for
boosting in the first place. Worth investigating too.

> Given how rare the watermark boosting is, maybe the answer is
> just to use atomics? Not sure :)
>