[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d2ae87ee-8ee3-0758-a433-8c937e5e3fb5@intel.com>
Date: Thu, 20 Jul 2023 20:02:11 +0800
From: "Yin, Fengwei" <fengwei.yin@...el.com>
To: Yosry Ahmed <yosryahmed@...gle.com>,
Hugh Dickins <hughd@...gle.com>
CC: Yu Zhao <yuzhao@...gle.com>, <linux-mm@...ck.org>,
<linux-kernel@...r.kernel.org>, <akpm@...ux-foundation.org>,
<willy@...radead.org>, <david@...hat.com>, <ryan.roberts@....com>,
<shy828301@...il.com>
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle
large folio
On 7/19/2023 11:44 PM, Yosry Ahmed wrote:
> On Wed, Jul 19, 2023 at 7:26 AM Hugh Dickins <hughd@...gle.com> wrote:
>>
>> On Wed, 19 Jul 2023, Yin Fengwei wrote:
>>>>>>>>>>>> Could this also happen against normal 4K page? I mean when user try to munlock
>>>>>>>>>>>> a normal 4K page and this 4K page is isolated. So it become unevictable page?
>>>>>>>>>>> Looks like it can be possible. If cpu 1 is in __munlock_folio() and
>>>>>>>>>>> cpu 2 is isolating the folio for any purpose:
>>>>>>>>>>>
>>>>>>>>>>> cpu1 cpu2
>>>>>>>>>>> isolate folio
>>>>>>>>>>> folio_test_clear_lru() // 0
>>>>>>>>>>> putback folio // add to unevictable list
>>>>>>>>>>> folio_test_clear_mlocked()
>>>>>>>> folio_set_lru()
>>> Let's wait the response from Huge and Yu. :).
>>
>> I haven't been able to give it enough thought, but I suspect you are right:
>> that the current __munlock_folio() is deficient when folio_test_clear_lru()
>> fails.
>>
>> (Though it has not been reported as a problem in practice: perhaps because
>> so few places try to isolate from the unevictable "list".)
>>
>> I forget what my order of development was, but it's likely that I first
>> wrote the version for our own internal kernel - which used our original
>> lruvec locking, which did not depend on getting PG_lru first (having got
>> lru_lock, it checked memcg, then tried again if that had changed).
>
> Right. Just holding the lruvec lock without clearing PG_lru would not
> protect against memcg movement in this case.
>
>>
>> I was uneasy with the PG_lru aspect of upstream lru_lock implementation,
>> but it turned out to work okay - elsewhere; but it looks as if I missed
>> its implication when adapting __munlock_page() for upstream.
>>
>> If I were trying to fix this __munlock_folio() race myself (sorry, I'm
>> not), I would first look at that aspect: instead of folio_test_clear_lru()
>> behaving always like a trylock, could "folio_wait_clear_lru()" or whatever
>> spin waiting for PG_lru here?
>
> +Matthew Wilcox
>
> It seems to me that before 70dea5346ea3 ("mm/swap: convert lru_add to
> a folio_batch"), __pagevec_lru_add_fn() (aka lru_add_fn()) used to do
> folio_set_lru() before checking folio_evictable(). While this is
> probably extraneous since folio_batch_move_lru() will set it again
> afterwards, it's probably harmless given that the lruvec lock is held
> throughout (so no one can complete the folio isolation anyway), and
> given that there were no problems introduced by this extra
> folio_set_lru() as far as I can tell.
After checking related code, Yes. Looks fine if we move folio_set_lru()
before if (folio_evictable(folio)) in lru_add_fn() because of holding
lru lock.
>
> If we restore folio_set_lru() to lru_add_fn(), and revert 2262ace60713
> ("mm/munlock:
> delete smp_mb() from __pagevec_lru_add_fn()") to restore the strict
> ordering between manipulating PG_lru and PG_mlocked, I suppose we can
> get away without having to spin. Again, that would only be possible if
> reworking mlock_count [1] is acceptable. Otherwise, we can't clear
> PG_mlocked before PG_lru in __munlock_folio().
What about following change to move mlocked operation before check lru
in __munlock_folio()?
diff --git a/mm/mlock.c b/mm/mlock.c
index 0a0c996c5c21..514f0d5bfbfd 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -122,7 +122,9 @@ static struct lruvec *__mlock_new_folio(struct folio *folio, struct lruvec *lruv
static struct lruvec *__munlock_folio(struct folio *folio, struct lruvec *lruvec)
{
int nr_pages = folio_nr_pages(folio);
- bool isolated = false;
+ bool isolated = false, mlocked = true;
+
+ mlocked = folio_test_clear_mlocked(folio);
if (!folio_test_clear_lru(folio))
goto munlock;
@@ -134,13 +136,17 @@ static struct lruvec *__munlock_folio(struct folio *folio, struct lruvec *lruvec
/* Then mlock_count is maintained, but might undercount */
if (folio->mlock_count)
folio->mlock_count--;
- if (folio->mlock_count)
+ if (folio->mlock_count) {
+ if (mlocked)
+ folio_set_mlocked(folio);
goto out;
+ }
}
/* else assume that was the last mlock: reclaim will fix it if not */
munlock:
- if (folio_test_clear_mlocked(folio)) {
+ if (mlocked) {
__zone_stat_mod_folio(folio, NR_MLOCK, -nr_pages);
if (isolated || !folio_test_unevictable(folio))
__count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
>
> I am not saying this is necessarily better than spinning, just a note
> (and perhaps selfishly making [1] more appealing ;)).
>
> [1]https://lore.kernel.org/lkml/20230618065719.1363271-1-yosryahmed@google.com/
>
>>
>> Hugh
Powered by blists - more mailing lists