lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <05f2d04c-333d-4298-8c7a-d5adeac5df82@intel.com>
Date: Tue, 27 Feb 2024 16:33:33 +0800
From: Yin Fengwei <fengwei.yin@...el.com>
To: Barry Song <21cnbao@...il.com>
CC: Ryan Roberts <ryan.roberts@....com>, Lance Yang <ioworker0@...il.com>,
	David Hildenbrand <david@...hat.com>, <akpm@...ux-foundation.org>,
	<linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>, <mhocko@...e.com>,
	<minchan@...nel.org>, <peterx@...hat.com>, <shy828301@...il.com>,
	<songmuchun@...edance.com>, <wangkefeng.wang@...wei.com>,
	<zokeefe@...gle.com>
Subject: Re: [PATCH 1/1] mm/madvise: enhance lazyfreeing with mTHP in
 madvise_free



On 2/27/24 15:54, Barry Song wrote:
> On Tue, Feb 27, 2024 at 8:42 PM Yin Fengwei <fengwei.yin@...el.com> wrote:
>>
>>
>>
>> On 2/27/24 15:21, Barry Song wrote:
>>> On Tue, Feb 27, 2024 at 8:11 PM Barry Song <21cnbao@...il.com> wrote:
>>>>
>>>> On Tue, Feb 27, 2024 at 8:02 PM Yin Fengwei <fengwei.yin@...el.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 2/27/24 14:40, Barry Song wrote:
>>>>>> On Tue, Feb 27, 2024 at 7:14 PM Yin Fengwei <fengwei.yin@...el.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2/27/24 10:17, Barry Song wrote:
>>>>>>>>> Like if we hit folio which is partially mapped to the range, don't split it but
>>>>>>>>> just unmap the mapping part from the range. Let page reclaim decide whether
>>>>>>>>> split the large folio or not (If it's not mapped to any other range,it will be
>>>>>>>>> freed as whole large folio. If part of it still mapped to other range,page reclaim
>>>>>>>>> can decide whether to split it or ignore it for current reclaim cycle).
>>>>>>>> Yes, we can. but we still have to play the ptes check game to avoid adding
>>>>>>>> folios multiple times to reclaim the list.
>>>>>>>>
>>>>>>>> I don't see too much difference between splitting in madvise and splitting
>>>>>>>> in vmscan.  as our real purpose is avoiding splitting entirely mapped
>>>>>>>> large folios. for partial mapped large folios, if we split in madvise, then
>>>>>>>> we don't need to play the game of skipping folios while iterating PTEs.
>>>>>>>> if we don't split in madvise, we have to make sure the large folio is only
>>>>>>>> added in reclaimed list one time by checking if PTEs belong to the
>>>>>>>> previous added folio.
>>>>>>>
>>>>>>> If the partial mapped large folio is unmapped from the range, the related PTE
>>>>>>> become none. How could the folio be added to reclaimed list multiple times?
>>>>>>
>>>>>> in case we have 16 PTEs in a large folio.
>>>>>> PTE0 present
>>>>>> PTE1 present
>>>>>> PTE2 present
>>>>>> PTE3  none
>>>>>> PTE4 present
>>>>>> PTE5 none
>>>>>> PTE6 present
>>>>>> ....
>>>>>> the current code is scanning PTE one by one.
>>>>>> while scanning PTE0, we have added the folio. then PTE1, PTE2, PTE4, PTE6...
>>>>> No. Before detect the folio is fully mapped to the range, we can't add folio
>>>>> to reclaim list because the partial mapped folio shouldn't be added. We can
>>>>> only scan PTE15 and know it's fully mapped.
>>>>
>>>> you never know PTE15 is the last one mapping to the large folio, PTE15 can
>>>> be mapping to a completely different folio with PTE0.
>>>>
>>>>>
>>>>> So, when scanning PTE0, we will not add folio. Then when hit PTE3, we know
>>>>> this is a partial mapped large folio. We will unmap it. Then all 16 PTEs
>>>>> become none.
>>>>
>>>> I don't understand why all 16PTEs become none as we set PTEs to none.
>>>> we set PTEs to swap entries till try_to_unmap_one called by vmscan.
>>>>
>>>>>
>>>>> If the large folio is fully mapped, the folio will be added to reclaim list
>>>>> after scan PTE15 and know it's fully mapped.
>>>>
>>>> our approach is calling pte_batch_pte while meeting the first pte, if
>>>> pte_batch_pte = 16,
>>>> then we add this folio to reclaim_list and skip the left 15 PTEs.
>>>
>>> Let's compare two different implementation, for partial mapped large folio
>>> with 8 PTEs as below,
>>>
>>> PTE0 present for large folio1
>>> PTE1 present for large folio1
>>> PTE2 present for another folio2
>>> PTE3 present for another folio3
>>> PTE4 present for large folio1
>>> PTE5 present for large folio1
>>> PTE6 present for another folio4
>>> PTE7 present for another folio5
>>>
>>> If we don't split in madvise(depend on vmscan to split after adding
>>> folio1), we will have
>> Let me clarify something here:
>>
>> I prefer that we don't split large folio here. Instead, we unmap the
>> large folio from this VMA range (I think you missed the unmap operation
>> I mentioned).
> 
> I don't understand why we unmap as this is a MADV_PAGEOUT not
> an unmap. unmapping totally changes the semantics. Would you like
> to show pseudo code?
Oh. Yes. MADV_PAGEOUT is not suitable.

What about MADV_FREE?

> 
> for MADV_PAGEOUT on swap-out, the last step is writing swap entries
> to replace PTEs which are present. I don't understand how an unmap
> can be involved in this process.
> 
>>
>> The intention is trying best to avoid splitting the large folio. If
>> the folio is only partially mapped to this VMA range, it's likely it
>> will be reclaimed as whole large folio. Which brings benefit for lru
>> and zone lock contention comparing to splitting large folio.
> 
> which also brings negative side effects such as redundant I/O.
> For example, if you have only one subpage left in a large folio,
> pageout will still write nr_pages subpages into swap, then immediately
> free them in swap.
> 
>>
>> The thing I am not sure is unmapping from specific VMA range is not
>> available and whether it's worthy to add it.
> 
> I think we might have the possibility to have some complex code to
> add folio1, folio2, folio3, folio4 and folio5 in the above example into
> reclaim_list while avoiding splitting folio1. but i really don't understand
> how unmap will work.
> 
>>
>>> to make sure folio1, folio2, folio3, folio4, folio5 are added to
>>> reclaim_list by doing a complex
>>> game while scanning these 8 PTEs.
>>>
>>> if we split in madvise, they become:
>>>
>>> PTE0 present for large folioA  - splitted from folio 1
>>> PTE1 present for large folioB - splitted from folio 1
>>> PTE2 present for another folio2
>>> PTE3 present for another folio3
>>> PTE4 present for large folioC - splitted from folio 1
>>> PTE5 present for large folioD - splitted from folio 1
>>> PTE6 present for another folio4
>>> PTE7 present for another folio5
>>>
>>> we simply add the above 8 folios into reclaim_list one by one.
>>>
>>> I would vote for splitting for partial mapped large folio in madvise.
>>>
> 
> Thanks
> Barry

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ