[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c85a035b-70d2-47df-b2c3-db255c98f6ff@huawei.com>
Date: Mon, 21 Oct 2024 17:56:10 +0800
From: chenridong <chenridong@...wei.com>
To: Barry Song <21cnbao@...il.com>, Chen Ridong <chenridong@...weicloud.com>
CC: Kefeng Wang <wangkefeng.wang@...wei.com>, <akpm@...ux-foundation.org>,
<linux-mm@...ck.org>, <linux-kernel@...r.kernel.org>,
<wangweiyang2@...wei.com>, Michal Hocko <mhocko@...e.com>, Johannes Weiner
<hannes@...xchg.org>, Yosry Ahmed <yosryahmed@...gle.com>, Yu Zhao
<yuzhao@...gle.com>, David Hildenbrand <david@...hat.com>, Matthew Wilcox
<willy@...radead.org>, Ryan Roberts <ryan.roberts@....com>
Subject: Re: [PATCH v3] mm/vmscan: stop the loop if enough pages have been
page_out
On 2024/10/21 17:42, Barry Song wrote:
> On Mon, Oct 21, 2024 at 9:14 PM Chen Ridong <chenridong@...weicloud.com> wrote:
>>
>>
>>
>> On 2024/10/21 12:44, Barry Song wrote:
>>> On Fri, Oct 11, 2024 at 7:49 PM chenridong <chenridong@...wei.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2024/10/11 0:17, Barry Song wrote:
>>>>> On Thu, Oct 10, 2024 at 4:59 PM Kefeng Wang <wangkefeng.wang@...wei.com> wrote:
>>>>>>
>>>>>> Hi Ridong,
>>>>>>
>>>>>> This should be the first version for upstream, and the issue only
>>>>>> occurred when large folio is spited.
>>>>>>
>>>>>> Adding more CCs to see if there's more feedback.
>>>>>>
>>>>>>
>>>>>> On 2024/10/10 16:18, Chen Ridong wrote:
>>>>>>> From: Chen Ridong <chenridong@...wei.com>
>>>>>>>
>>>>>>> An issue was found with the following testing step:
>>>>>>> 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y
>>>>>>> 2. Mount memcg v1, and create memcg named test_memcg and set
>>>>>>> usage_in_bytes=2.1G, memsw.usage_in_bytes=3G.
>>>>>>> 3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg.
>>>>>>>
>>>>>>> It was found that:
>>>>>>>
>>>>>>> cat memory.usage_in_bytes
>>>>>>> 2144940032
>>>>>>> cat memory.memsw.usage_in_bytes
>>>>>>> 2255056896
>>>>>>>
>>>>>>> free -h
>>>>>>> total used free
>>>>>>> Mem: 31Gi 2.1Gi 27Gi
>>>>>>> Swap: 1.0Gi 618Mi 405Mi
>>>>>>>
>>>>>>> As shown above, the test_memcg used about 100M swap, but 600M+ swap memory
>>>>>>> was used, which means that 500M may be wasted because other memcgs can not
>>>>>>> use these swap memory.
>>>>>>>
>>>>>>> It can be explained as follows:
>>>>>>> 1. When entering shrink_inactive_list, it isolates folios from lru from
>>>>>>> tail to head. If it just takes folioN from lru(make it simple).
>>>>>>>
>>>>>>> inactive lru: folio1<->folio2<->folio3...<->folioN-1
>>>>>>> isolated list: folioN
>>>>>>>
>>>>>>> 2. In shrink_page_list function, if folioN is THP, it may be splited and
>>>>>>> added to swap cache folio by folio. After adding to swap cache, it will
>>>>>>> submit io to writeback folio to swap, which is asynchronous.
>>>>>>> When shrink_page_list is finished, the isolated folios list will be
>>>>>>> moved back to the head of inactive lru. The inactive lru may just look
>>>>>>> like this, with 512 filioes have been move to the head of inactive lru.
>>>>>>>
>>>>>>> folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
>>>>>>>
>>>>>>> 3. When folio writeback io is completed, the folio may be rotated to tail
>>>>>>> of lru. The following lru list is expected, with those filioes that have
>>>>>>> been added to swap cache are rotated to tail of lru. So those folios
>>>>>>> can be reclaimed as soon as possible.
>>>>>>>
>>>>>>> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>>>>>>>
>>>>>>> 4. However, shrink_page_list and folio writeback are asynchronous. If THP
>>>>>>> is splited, shrink_page_list loops at least 512 times, which means that
>>>>>>> shrink_page_list is not completed but some folios writeback have been
>>>>>>> completed, and this may lead to failure to rotate these folios to the
>>>>>>> tail of lru. The lru may look likes as below:
>>>>>
>>>>> I assume you’re referring to PMD-mapped THP, but your code also modifies
>>>>> mTHP, which might not be that large. For instance, it could be a 16KB mTHP.
>>>>>
>>>>>>>
>>>>>>> folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
>>>>>>> folioN51<->folioN52<->...folioN511<->folioN512
>>>>>>>
>>>>>>> Although those folios (N1-N50) have been finished writing back, they
>>>>>>> are still at the head of lru. When isolating folios from lru, it scans
>>>>>>> from tail to head, so it is difficult to scan those folios again.
>>>>>>>
>>>>>>> What mentioned above may lead to a large number of folios have been added
>>>>>>> to swap cache but can not be reclaimed in time, which may reduce reclaim
>>>>>>> efficiency and prevent other memcgs from using this swap memory even if
>>>>>>> they trigger OOM.
>>>>>>>
>>>>>>> To fix this issue, it's better to stop looping if THP has been splited and
>>>>>>> nr_pageout is greater than nr_to_reclaim.
>>>>>>>
>>>>>>> Signed-off-by: Chen Ridong <chenridong@...wei.com>
>>>>>>> ---
>>>>>>> mm/vmscan.c | 16 +++++++++++++++-
>>>>>>> 1 file changed, 15 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>> index 749cdc110c74..fd8ad251eda2 100644
>>>>>>> --- a/mm/vmscan.c
>>>>>>> +++ b/mm/vmscan.c
>>>>>>> @@ -1047,7 +1047,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>>>>> LIST_HEAD(demote_folios);
>>>>>>> unsigned int nr_reclaimed = 0;
>>>>>>> unsigned int pgactivate = 0;
>>>>>>> - bool do_demote_pass;
>>>>>>> + bool do_demote_pass, splited = false;
>>>>>>> struct swap_iocb *plug = NULL;
>>>>>>>
>>>>>>> folio_batch_init(&free_folios);
>>>>>>> @@ -1065,6 +1065,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>>>>>
>>>>>>> cond_resched();
>>>>>>>
>>>>>>> + /*
>>>>>>> + * If a large folio has been split, many folios are added
>>>>>>> + * to folio_list. Looping through the entire list takes
>>>>>>> + * too much time, which may prevent folios that have completed
>>>>>>> + * writeback from rotateing to the tail of the lru. Just
>>>>>>> + * stop looping if nr_pageout is greater than nr_to_reclaim.
>>>>>>> + */
>>>>>>> + if (unlikely(splited && stat->nr_pageout > sc->nr_to_reclaim))
>>>>>>> + break;
>>>>>
>>>>> I’m not entirely sure about the theory behind comparing stat->nr_pageout
>>>>> with sc->nr_to_reclaim. However, the condition might still hold true even
>>>>> if you’ve split a relatively small “large folio,” such as 16kB?
>>>>>
>>>>
>>>> Why compare stat->nr_pageout with sc->nr_to_reclaim? It's because if all
>>>> pages that have been pageout can be reclaimed, then enough pages can be
>>>> reclaimed when all pages have finished writeback. Thus, it may not have
>>>> to pageout more.
>>>>
>>>> If a small large folio(16 kB) has been split, it may return early
>>>> without the entire pages in the folio_list being pageout, but I think
>>>> that is fine. It can pageout more pages the next time it enters
>>>> shrink_folio_list if there are not enough pages to reclaimed.
>>>>
>>>> However, if pages that have been pageout are still at the head of the
>>>> LRU, it is difficult to scan these pages again. In this case, not only
>>>> might it "waste" some swap memory but it also has to pageout more pages.
>>>>
>>>> Considering the above, I sent this patch. It may not be a perfect
>>>> solution, but i think it's a good option to consider. And I am wondering
>>>> if anyone has a better solution.
>>>
>>> Hi Ridong,
>>> My overall understanding is that you have failed to describe your problem
>>> particularly I don't understand what your 3 and 4 mean:
>>>
>>>> 3. When folio writeback io is completed, the folio may be rotated to tail
>>>> of lru. The following lru list is expected, with those filioes that have
>>>> been added to swap cache are rotated to tail of lru. So those folios
>>>> can be reclaimed as soon as possible.
>>>>
>>>> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>>>
>>> > 4. However, shrink_page_list and folio writeback are asynchronous. If THP
>>> > is splited, shrink_page_list loops at least 512 times, which means that
>>> > shrink_page_list is not completed but some folios writeback have been
>>> > completed, and this may lead to failure to rotate these folios to the
>>> > tail of lru. The lru may look likes as below:
>>>
>>> can you please describe it in a readable approach?
>>>
>>> i feel your below diagram is somehow wrong:
>>> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>>>
>>> You mentioned "rotate', how could "rotate" makes:
>>> folioN512<->folioN511<->...filioN1 in (2)
>>> become
>>> filioN1<->...folioN511<->folioN512 in (3).
>>>
>>
>> I am sorry for any confusion.
>>
>> If THP is split, filioN1, filioN2, filioN3, ...filioN512 are committed
>> to writeback one by one. it assumed that filioN1,
>> filioN2,filioN3,...filioN512 are completed in order.
>>
>> Orignal:
>> folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
>>
>> filioN1 is finished, filioN1 is rotated to the tail of LRU:
>> folioN512<->folioN511<->...filioN2<->folio1<->folio2...<->folioN-1<->folioN1
>>
>> filioN2 is finished:
>> folioN512<->folioN511<->...filioN3<->folio1<->folio2...<->folioN-1<->folioN1<->folioN2
>>
>> filioN3 is finished:
>> folioN512<->folioN511<->...filioN4<->folio1<->folio2...<->folioN-1<->folioN1<->folioN2<->filioN3
>>
>> ...
>>
>> filioN512 is finished:
>> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>>
>> When the filios are finished, the LRU might just like this:
>> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>
> understood, thanks!
>
> Let me try to understand the following part:
>
>> 4:
>> folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
>> folioN51<->folioN52<->...folioN511<->folioN512
>
> > Although those folios (N1-N50) have been finished writing back, they
> > are still at the head of lru. When isolating folios from lru, it scans
> > from tail to head, so it is difficult to scan those folios again.
>
> What is the reason that "those folios (N1-N50) have finished writing back,
> yet they remain at the head of the LRU"? Is it because their writeback_end
> occurred while we were still looping in shrink_folio_list(), causing
> folio_end_writeback()'s folio_rotate_reclaimable() to fail in moving
> these folios, which are still in the "folio_list", to the tail of the LRU?
>
Yes, you are right.
>>
>>> btw, writeback isn't always async. it could be sync for zram and sync_io
>>> swap. in that case, your patch might change the order of LRU. i mean,
>>> for example, while a mTHP becomes cold, we always reclaim all of them,
>>> but not part of them and put back part of small folios to the head of lru.
>>>
>>
>> Yes, This can be changed.
>> Although it may put back part of small folios to the head of lru, it can
>> return in time from shrink_folio_list without causing much additional I/O.
>>
>> If you have understood this issue, do you have any suggestions to fix
>> it? My patch may not be a perfect way to fix this issue.
>>
>
> My point is that synchronous I/O, like zRAM, doesn't have this issue and
> doesn't require this fix, as writeback is always completed without
> asynchronous latency.
>
I have tested zRAM and found no issues.
This is version 1, and I don't know whether this fix will be accepted.
If it is accepted, perhaps this patch could be modified to apply only to
asynchronous io.
Best regards,
Ridong
>
>> Best regards,
>> Ridong
>>
>
> Thanks
> Barry
Powered by blists - more mailing lists