lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <60049ec1-df14-4c3f-b3dd-5d771c2ceac4@redhat.com>
Date: Mon, 15 Apr 2024 17:40:43 +0200
From: David Hildenbrand <david@...hat.com>
To: Yang Shi <shy828301@...il.com>, Zi Yan <ziy@...dia.com>
Cc: linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>,
 Matthew Wilcox <willy@...radead.org>, Ryan Roberts <ryan.roberts@....com>,
 Barry Song <21cnbao@...il.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mm/rmap: do not add fully unmapped large folio to
 deferred split list

On 13.04.24 00:29, Yang Shi wrote:
> On Fri, Apr 12, 2024 at 2:06 PM Zi Yan <ziy@...dia.com> wrote:
>>
>> On 12 Apr 2024, at 15:32, David Hildenbrand wrote:
>>
>>> On 12.04.24 16:35, Zi Yan wrote:
>>>> On 11 Apr 2024, at 11:46, David Hildenbrand wrote:
>>>>
>>>>> On 11.04.24 17:32, Zi Yan wrote:
>>>>>> From: Zi Yan <ziy@...dia.com>
>>>>>>
>>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>>>>> if any page in a folio loses its final mapping. It is possible that
>>>>>> the folio is unmapped fully, but it is unnecessary to add the folio
>>>>>> to deferred split list at all. Fix it by checking folio mapcount before
>>>>>> adding a folio to deferred split list.
>>>>>>
>>>>>> Signed-off-by: Zi Yan <ziy@...dia.com>
>>>>>> ---
>>>>>>     mm/rmap.c | 9 ++++++---
>>>>>>     1 file changed, 6 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>> index 2608c40dffad..d599a772e282 100644
>>>>>> --- a/mm/rmap.c
>>>>>> +++ b/mm/rmap.c
>>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>>                     enum rmap_level level)
>>>>>>     {
>>>>>>             atomic_t *mapped = &folio->_nr_pages_mapped;
>>>>>> -  int last, nr = 0, nr_pmdmapped = 0;
>>>>>> +  int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
>>>>>>             enum node_stat_item idx;
>>>>>>             __folio_rmap_sanity_checks(folio, page, nr_pages, level);
>>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>>                             break;
>>>>>>                     }
>>>>>>    -                atomic_sub(nr_pages, &folio->_large_mapcount);
>>>>>> +          mapcount = atomic_sub_return(nr_pages,
>>>>>> +                                       &folio->_large_mapcount) + 1;
>>>>>
>>>>> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here.
>>>>
>>>> Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped)
>>>> below, but to save an atomic op, I chose to read mapcount here.
>>>
>>> Some points:
>>>
>>> (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic
>>> RMW that return a value -- and how they interact with memory barriers.
>>> Further, how relaxed variants are only optimized on some architectures.
>>>
>>> atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory
>>> access that should not be refetched. Usually cheaper than most other stuff
>>> that involves atomics.
>>
>> I should have checked the actual implementation instead of being fooled
>> by the name. Will read about it. Thanks.
>>
>>>
>>> (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped)
>>> to figure out if the folio is now completely unmapped.
>>>
>>> (3) There is one fundamental issue: if we are not batch-unmapping the whole
>>> thing, we will still add the folios to the deferred split queue. Migration
>>> would still do that, or if there are multiple VMAs covering a folio.
>>>
>>> (4) We should really avoid making common operations slower only to make
>>> some unreliable stats less unreliable.
>>>
>>>
>>> We should likely do something like the following, which might even be a bit
>>> faster in some cases because we avoid a function call in case we unmap
>>> individual PTEs by checking _deferred_list ahead of time
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 2608c40dffad..356598b3dc3c 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>                   * page of the folio is unmapped and at least one page
>>>                   * is still mapped.
>>>                   */
>>> -               if (folio_test_large(folio) && folio_test_anon(folio))
>>> -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
>>> -                               deferred_split_folio(folio);
>>> +               if (folio_test_large(folio) && folio_test_anon(folio) &&
>>> +                   (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) &&
>>> +                   atomic_read(mapped) &&
>>> +                   data_race(list_empty(&folio->_deferred_list)))
>>
>> data_race() might not be needed, as Ryan pointed out[1]
>>
>>> +                       deferred_split_folio(folio);
>>>          }
>>>
>>> I also thought about handling the scenario where we unmap the whole
>>> think in smaller chunks. We could detect "!atomic_read(mapped)" and
>>> detect that it is on the deferred split list, and simply remove it
>>> from that list incrementing an THP_UNDO_DEFERRED_SPLIT_PAGE event.
>>>
>>> But it would be racy with concurrent remapping of the folio (might happen with
>>> anon folios in corner cases I guess).
>>>
>>> What we can do is the following, though:
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index dc30139590e6..f05cba1807f2 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3133,6 +3133,8 @@ void folio_undo_large_rmappable(struct folio *folio)
>>>          ds_queue = get_deferred_split_queue(folio);
>>>          spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>>          if (!list_empty(&folio->_deferred_list)) {
>>> +               if (folio_test_pmd_mappable(folio))
>>> +                       count_vm_event(THP_UNDO_DEFERRED_SPLIT_PAGE);
>>>                  ds_queue->split_queue_len--;
>>>                  list_del_init(&folio->_deferred_list);
>>>          }
>>>
>>> Adding the right event of course.
>>>
>>>
>>> Then it's easy to filter out these "temporarily added to the list, but never split
>>> before the folio was freed" cases.
>>
>> So instead of making THP_DEFERRED_SPLIT_PAGE precise, use
>> THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE instead? That should work.
> 
> It is definitely possible that the THP on the deferred split queue are
> freed instead of split. For example, 1M is unmapped for a 2M THP, then
> later the remaining 1M is unmapped, or the process exits before memory
> pressure happens. So how come we can tell it is "temporarily added to
> list"? Then THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE
> actually just counts how many pages are still on deferred split queue.

Not quite I think. I don't think we have a counter that counts how many 
large folios on the deferred list were split. I think we only have 
THP_SPLIT_PAGE.

We could have
* THP_DEFERRED_SPLIT_PAGE
* THP_UNDO_DEFERRED_SPLIT_PAGE
* THP_PERFORM_DEFERRED_SPLIT_PAGE

Maybe that would catch more cases (not sure if all, though). Then, you 
could tell how many are still on that list. THP_DEFERRED_SPLIT_PAGE - 
THP_UNDO_DEFERRED_SPLIT_PAGE - THP_PERFORM_DEFERRED_SPLIT_PAGE.

That could give one a clearer picture how deferred split interacts with 
actual splitting (possibly under memory pressure), the whole reason why 
deferred splitting was added after all.

> It may be useful. However the counter is typically used to estimate
> how many THP are partially unmapped during a period of time.

I'd say it's a bit of an abuse of that counter; well, or interpreting 
something into the counter that that counter never reliably represented.

I can easily write a program that keeps sending your counter to infinity 
simply by triggering that behavior in a loop, so it's all a bit shaky.

Something like Ryans script makes more sense, where you get a clearer 
picture of what's mapped where and how. Because that information can be 
much more valuable than just knowing if it's mapped fully or partially 
(again, relevant for handling with memory waste).

-- 
Cheers,

David / dhildenb


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ