linux-kernel - Re: [PATCH V4] mm/gup: Clear the LRU flag of a page before adding to LRU batch

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <207a00a2-0895-4086-97ae-d31ead423cf8@redhat.com>
Date: Tue, 8 Apr 2025 12:04:42 +0200
From: David Hildenbrand <david@...hat.com>
To: Jinjiang Tu <tujinjiang@...wei.com>, yangge1116@....com,
 akpm@...ux-foundation.org
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org, stable@...r.kernel.org,
 21cnbao@...il.com, baolin.wang@...ux.alibaba.com,
 aneesh.kumar@...ux.ibm.com, liuzixing@...on.cn,
 Kefeng Wang <wangkefeng.wang@...wei.com>
Subject: Re: [PATCH V4] mm/gup: Clear the LRU flag of a page before adding to
 LRU batch

On 08.04.25 10:47, Jinjiang Tu wrote:
> 
> 在 2025/4/1 22:33, David Hildenbrand 写道:
>> On 27.03.25 12:16, Jinjiang Tu wrote:
>>>
>>> 在 2025/3/26 20:46, David Hildenbrand 写道:
>>>> On 26.03.25 13:42, Jinjiang Tu wrote:
>>>>> Hi,
>>>>>
>>>>
>>>> Hi!
>>>>
>>>>> We notiched a 12.3% performance regression for LibMicro pwrite
>>>>> testcase due to
>>>>> commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before
>>>>> adding to LRU batch").
>>>>>
>>>>> The testcase is executed as follows, and the file is tmpfs file.
>>>>>        pwrite -E -C 200 -L -S -W -N "pwrite_t1k" -s 1k -I 500 -f $TFILE
>>>>
>>>> Do we know how much that reflects real workloads? (IOW, how much
>>>> should we care)
>>>
>>> No, it's hard to say.
>>>
>>>>
>>>>>
>>>>> this testcase writes 1KB (only one page) to the tmpfs and repeats
>>>>> this step for many times. The Flame
>>>>> graph shows the performance regression comes from
>>>>> folio_mark_accessed() and workingset_activation().
>>>>>
>>>>> folio_mark_accessed() is called for the same page for many times.
>>>>> Before this patch, each call will
>>>>> add the page to cpu_fbatches.activate. When the fbatch is full, the
>>>>> fbatch is drained and the page
>>>>> is promoted to active list. And then, folio_mark_accessed() does
>>>>> nothing in later calls.
>>>>>
>>>>> But after this patch, the folio clear lru flags after it is added to
>>>>> cpu_fbatches.activate. After then,
>>>>> folio_mark_accessed will never call folio_activate() again due to the
>>>>> page is without lru flag, and
>>>>> the fbatch will not be full and the folio will not be marked active,
>>>>> later folio_mark_accessed()
>>>>> calls will always call workingset_activation(), leading to
>>>>> performance regression.
>>>>
>>>> Would there be a good place to drain the LRU to effectively get that
>>>> processed? (we can always try draining if the LRU flag is not set)
>>>
>>> Maybe we could drain the search the cpu_fbatches.activate of the
>>> local cpu in __lru_cache_activate_folio()? Drain other fbatches is
>>> meaningless .
>>
>> So the current behavior is that folio_mark_accessed() will end up
>> calling folio_activate()
>> once, and then __lru_cache_activate_folio() until the LRU was drained
>> (which can
>> take a looong time).
>>
>> The old behavior was that folio_mark_accessed() would keep calling
>> folio_activate() until
>> the LRU was drained simply because it was full of "all the same pages"
>> ?. Only *after*
>> the LRU was drained, folio_mark_accessed() would actually not do
>> anything (desired behavior).
>>
>> So the overhead comes primarily from __lru_cache_activate_folio()
>> searching through
>> the list "more" repeatedly because the LRU does get drained less
>> frequently; and
>> it would never find it in there in this case.
>>
>> So ... it used to be suboptimal before, now it's more suboptimal I
>> guess?! :)
>>
>> We'd need a way to better identify "this folio is already queued for
>> activation". Searching
>> that list as well would be one option, but the hole "search the list"
>> is nasty.
>>
>> Maybe we can simply set the folio as active when staging it for
>> activation, after having
>> cleared the LRU flag? We already do that during folio_add.
>>
>> As the LRU flag was cleared, nobody should be messing with that folio
>> until the cache was
>> drained and the move was successful.
>>
>>
>> Pretty sure this doesn't work, but just to throw out an idea:
>>
>>  From c26e1c0ceda6c818826e5b89dc7c7c9279138f80 Mon Sep 17 00:00:00 2001
>> From: David Hildenbrand <david@...hat.com>
>> Date: Tue, 1 Apr 2025 16:31:56 +0200
>> Subject: [PATCH] test
>>
>> Signed-off-by: David Hildenbrand <david@...hat.com>
>> ---
>>   mm/swap.c | 21 ++++++++++++++++-----
>>   1 file changed, 16 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/swap.c b/mm/swap.c
>> index fc8281ef42415..bbf9aa76db87f 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -175,6 +175,8 @@ static void folio_batch_move_lru(struct
>> folio_batch *fbatch, move_fn_t move_fn)
>>       folios_put(fbatch);
>>   }
>>
>> +static void lru_activate(struct lruvec *lruvec, struct folio *folio);
>> +
>>   static void __folio_batch_add_and_move(struct folio_batch __percpu
>> *fbatch,
>>           struct folio *folio, move_fn_t move_fn,
>>           bool on_lru, bool disable_irq)
>> @@ -191,6 +193,10 @@ static void __folio_batch_add_and_move(struct
>> folio_batch __percpu *fbatch,
>>       else
>>           local_lock(&cpu_fbatches.lock);
>>
>> +    /* We'll only perform the actual list move deferred. */
>> +    if (move_fn == lru_activate)
>> +        folio_set_active(folio);
>> +
>>       if (!folio_batch_add(this_cpu_ptr(fbatch), folio) ||
>> folio_test_large(folio) ||
>>           lru_cache_disabled())
>>           folio_batch_move_lru(this_cpu_ptr(fbatch), move_fn);
>> @@ -299,12 +305,14 @@ static void lru_activate(struct lruvec *lruvec,
>> struct folio *folio)
>>   {
>>       long nr_pages = folio_nr_pages(folio);
>>
>> -    if (folio_test_active(folio) || folio_test_unevictable(folio))
>> -        return;
>> -
>> +    /*
>> +     * We set the folio active after clearing the LRU flag, and set the
>> +     * LRU flag only after moving it to the right list.
>> +     */
>> +    VM_WARN_ON_ONCE(!folio_test_active(folio));
>> +    VM_WARN_ON_ONCE(folio_test_unevictable(folio));
>>
>>       lruvec_del_folio(lruvec, folio);
>> -    folio_set_active(folio);
>>       lruvec_add_folio(lruvec, folio);
>>       trace_mm_lru_activate(folio);
>>
>> @@ -342,7 +350,10 @@ void folio_activate(struct folio *folio)
>>           return;
>>
>>       lruvec = folio_lruvec_lock_irq(folio);
>> -    lru_activate(lruvec, folio);
>> +    if (!folio_test_unevictable(folio)) {
>> +        folio_set_active(folio);
>> +        lru_activate(lruvec, folio);
>> +    }
>>       unlock_page_lruvec_irq(lruvec);
>>       folio_set_lru(folio);
>>   }
> 
> I test with the patch, and the performance regression disappears.
> 
> By the way, I find folio_test_unevictable() is called in lru_deactivate, lru_lazyfree, etc.
> unevictable flag is set when the caller clears lru flag. IIUC, since commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before
> adding to LRU batch"), folios in fbatch can't be set unevictable flag, so there is no need to check unevictable flag in lru_deactivate, lru_lazyfree, etc?

I was asking myself the exact same question when crafting this patch. 
Sounds like a cleanup worth investigating! :)

Do you have capacity to look into that, and to turn my proposal into a 
proper patch? It might take me a bit until I get to it.

-- 
Cheers,

David / dhildenb