linux-kernel - Re: [PATCH RESEND] mm: drop lruvec->lru_lock if contended when skipping folio

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <0b8e22e0-7ba0-49ac-9e05-8f473b3c8ee3@gmail.com>
Date: Wed, 21 Aug 2024 14:33:48 -0400
From: Usama Arif <usamaarif642@...il.com>
To: Zhaoyang Huang <huangzhaoyang@...il.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, yuzhao@...gle.com,
 david@...hat.com, leitao@...ian.org, bharata@....com, willy@...radead.org,
 vbabka@...e.cz, linux-kernel@...r.kernel.org, kernel-team@...a.com,
 Johannes Weiner <hannes@...xchg.org>, zhaoyang.huang@...soc.com,
 Rik van Riel <riel@...riel.com>
Subject: Re: [PATCH RESEND] mm: drop lruvec->lru_lock if contended when
 skipping folio



On 21/08/2024 01:51, Zhaoyang Huang wrote:
> On Tue, Aug 20, 2024 at 11:45 PM Usama Arif <usamaarif642@...il.com> wrote:
>>
>>
>>
>> On 20/08/2024 02:17, Andrew Morton wrote:
>>> On Mon, 19 Aug 2024 19:46:48 +0100 Usama Arif <usamaarif642@...il.com> wrote:
>>>
>>>> lruvec->lru_lock is highly contended and is held when calling
>>>> isolate_lru_folios. If the lru has a large number of CMA folios
>>>> consecutively, while the allocation type requested is not MIGRATE_MOVABLE,
>>>> isolate_lru_folios can hold the lock for a very long time while it
>>>> skips those. vmscan_lru_isolate tracepoint showed that skipped can go
>>>> above 70k in production and lockstat shows that waittime-max is x1000
>>>> higher without this patch.
>>>> This can cause lockups [1] and high memory pressure for extended periods of
>>>> time [2]. Hence release the lock if its contended when skipping a folio to
>>>> give other tasks a chance to acquire it and not stall.
>>>>
>>>> ...
>>>>
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -1695,8 +1695,14 @@ static unsigned long isolate_lru_folios(unsigned long nr_to_scan,
>>>>              if (folio_zonenum(folio) > sc->reclaim_idx ||
>>>>                              skip_cma(folio, sc)) {
>>>>                      nr_skipped[folio_zonenum(folio)] += nr_pages;
>>>> -                    move_to = &folios_skipped;
>>>> -                    goto move;
>>>> +                    list_move(&folio->lru, &folios_skipped);
>>>> +                    if (!spin_is_contended(&lruvec->lru_lock))
>>>> +                            continue;
>>>> +                    if (!list_empty(dst))
>>>> +                            break;
>>>> +                    spin_unlock_irq(&lruvec->lru_lock);
>>>> +                    cond_resched();
>>>> +                    spin_lock_irq(&lruvec->lru_lock);
>>>>              }
>>>
>>> Oh geeze ugly thing.  Must we do this?
>>>
>>> The games that function plays with src, dst and move_to are a bit hard
>>> to follow.  Some tasteful comments explaining what's going on would
>>> help.
>>>
>>> Also that test of !list_empty(dst).  It would be helpful to comment the
>>> dynamics which are happening in this case - why we're testing dst here.
>>>
>>>
>>
>> So Johannes pointed out to me that this is not going to properly fix the problem of holding the lru_lock for a long time introduced in [1] because of 2 reasons:
>> - the task that is doing lock break is hoarding folios on folios_skipped and making the lru shorter, I didn't see it in the usecase I was trying, but it could be that yielding the lock to the other task is not of much use as it is going to go through a much shorter lru list or even an empty lru list and would OOM, while the folio it is looking for is on folios_skipped. We would be substituting one OOM problem for another with this patch.
> Other tasks can not get folios either if they can not use CMA after
> the reclaiming even without the original patch. I am ok with your fix
> patch except the 'if (!list_empty(dst))' puzzled me.

The 'if (!list_empty(dst))' was there so that if the spinlock is contended and we are still skipping, then the shrink_active/inactive_list can still use the ones that are in destination, rather then continue skipping.

Both the original patch and this fix substitute one OOM problem for another one.

>> - Compaction code goes through pages by pfn and not using the list, as this patch does not clear lru flag, compaction could claim this folio.
>>
>> The patch in [1] is severely breaking production at Meta and its not a proper fix to the problem that the commit was trying to be solved. It results in holding the lru_lock for a very significant amount of time, stalling all other processes trying to claim memory, creating very high memory pressure for large periods of time and causing OOM.
>>
>> The way forward would be to revert it and try to come up with a longer term solution that the original commit tried to solve. If no one is opposed to it, I will wait a couple of days for comments and send a revert patch.
> I mainly focus on android systems which have no such scenarios as
> server encountered. I agree with reverting it if this fix can not be
> accepted.
>>
Yeah I think the 2nd point about compaction is a bigger issue, we can try fixing the 2nd point by clearing lru, but the first point will remain. I will send a revert for the original.

Thanks

>> [1] https://lore.kernel.org/all/1685501461-19290-1-git-send-email-zhaoyang.huang@unisoc.com/
>>
>> Thanks,
>> Usama