linux-kernel - Re: [PATCH RESEND] mm: drop lruvec->lru_lock if contended when skipping folio

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <29e481af-b5e1-4320-a672-8251f5099595@gmail.com>
Date: Tue, 20 Aug 2024 11:45:11 -0400
From: Usama Arif <usamaarif642@...il.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: yuzhao@...gle.com, david@...hat.com, leitao@...ian.org,
 huangzhaoyang@...il.com, bharata@....com, willy@...radead.org,
 vbabka@...e.cz, linux-kernel@...r.kernel.org, kernel-team@...a.com,
 Johannes Weiner <hannes@...xchg.org>, zhaoyang.huang@...soc.com,
 Rik van Riel <riel@...riel.com>
Subject: Re: [PATCH RESEND] mm: drop lruvec->lru_lock if contended when
 skipping folio



On 20/08/2024 02:17, Andrew Morton wrote:
> On Mon, 19 Aug 2024 19:46:48 +0100 Usama Arif <usamaarif642@...il.com> wrote:
> 
>> lruvec->lru_lock is highly contended and is held when calling
>> isolate_lru_folios. If the lru has a large number of CMA folios
>> consecutively, while the allocation type requested is not MIGRATE_MOVABLE,
>> isolate_lru_folios can hold the lock for a very long time while it
>> skips those. vmscan_lru_isolate tracepoint showed that skipped can go
>> above 70k in production and lockstat shows that waittime-max is x1000
>> higher without this patch.
>> This can cause lockups [1] and high memory pressure for extended periods of
>> time [2]. Hence release the lock if its contended when skipping a folio to
>> give other tasks a chance to acquire it and not stall.
>>
>> ...
>>
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1695,8 +1695,14 @@ static unsigned long isolate_lru_folios(unsigned long nr_to_scan,
>>  		if (folio_zonenum(folio) > sc->reclaim_idx ||
>>  				skip_cma(folio, sc)) {
>>  			nr_skipped[folio_zonenum(folio)] += nr_pages;
>> -			move_to = &folios_skipped;
>> -			goto move;
>> +			list_move(&folio->lru, &folios_skipped);
>> +			if (!spin_is_contended(&lruvec->lru_lock))
>> +				continue;
>> +			if (!list_empty(dst))
>> +				break;
>> +			spin_unlock_irq(&lruvec->lru_lock);
>> +			cond_resched();
>> +			spin_lock_irq(&lruvec->lru_lock);
>>  		}
> 
> Oh geeze ugly thing.  Must we do this?
> 
> The games that function plays with src, dst and move_to are a bit hard
> to follow.  Some tasteful comments explaining what's going on would
> help.
> 
> Also that test of !list_empty(dst).  It would be helpful to comment the
> dynamics which are happening in this case - why we're testing dst here.
> 
> 

So Johannes pointed out to me that this is not going to properly fix the problem of holding the lru_lock for a long time introduced in [1] because of 2 reasons:
- the task that is doing lock break is hoarding folios on folios_skipped and making the lru shorter, I didn't see it in the usecase I was trying, but it could be that yielding the lock to the other task is not of much use as it is going to go through a much shorter lru list or even an empty lru list and would OOM, while the folio it is looking for is on folios_skipped. We would be substituting one OOM problem for another with this patch.
- Compaction code goes through pages by pfn and not using the list, as this patch does not clear lru flag, compaction could claim this folio.

The patch in [1] is severely breaking production at Meta and its not a proper fix to the problem that the commit was trying to be solved. It results in holding the lru_lock for a very significant amount of time, stalling all other processes trying to claim memory, creating very high memory pressure for large periods of time and causing OOM.

The way forward would be to revert it and try to come up with a longer term solution that the original commit tried to solve. If no one is opposed to it, I will wait a couple of days for comments and send a revert patch.

[1] https://lore.kernel.org/all/1685501461-19290-1-git-send-email-zhaoyang.huang@unisoc.com/

Thanks,
Usama