linux-kernel - Re: [PATCH v7 2/7] mm: swap: free_swap_and_cache_nr() as batched free_swap_and

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <25d2a8ad-aad4-4bd5-bd3e-190b76175f61@redhat.com>
Date: Tue, 9 Apr 2024 12:42:39 +0200
From: David Hildenbrand <david@...hat.com>
To: Ryan Roberts <ryan.roberts@....com>, Barry Song <21cnbao@...il.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
 Matthew Wilcox <willy@...radead.org>, Huang Ying <ying.huang@...el.com>,
 Gao Xiang <xiang@...nel.org>, Yu Zhao <yuzhao@...gle.com>,
 Yang Shi <shy828301@...il.com>, Michal Hocko <mhocko@...e.com>,
 Kefeng Wang <wangkefeng.wang@...wei.com>, Chris Li <chrisl@...nel.org>,
 Lance Yang <ioworker0@...il.com>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH v7 2/7] mm: swap: free_swap_and_cache_nr() as batched
 free_swap_and_cache()

On 09.04.24 11:55, Ryan Roberts wrote:
> On 09/04/2024 10:41, Barry Song wrote:
>> On Tue, Apr 9, 2024 at 9:24 PM David Hildenbrand <david@...hat.com> wrote:
>>>
>>> On 09.04.24 11:22, Barry Song wrote:
>>>> On Tue, Apr 9, 2024 at 8:51 PM Barry Song <21cnbao@...il.com> wrote:
>>>>>
>>>>> On Tue, Apr 9, 2024 at 6:40 AM Ryan Roberts <ryan.roberts@....com> wrote:
>>>>>>
>>>>>> Now that we no longer have a convenient flag in the cluster to determine
>>>>>> if a folio is large, free_swap_and_cache() will take a reference and
>>>>>> lock a large folio much more often, which could lead to contention and
>>>>>> (e.g.) failure to split large folios, etc.
>>>>>>
>>>>>> Let's solve that problem by batch freeing swap and cache with a new
>>>>>> function, free_swap_and_cache_nr(), to free a contiguous range of swap
>>>>>> entries together. This allows us to first drop a reference to each swap
>>>>>> slot before we try to release the cache folio. This means we only try to
>>>>>> release the folio once, only taking the reference and lock once - much
>>>>>> better than the previous 512 times for the 2M THP case.
>>>>>>
>>>>>> Contiguous swap entries are gathered in zap_pte_range() and
>>>>>> madvise_free_pte_range() in a similar way to how present ptes are
>>>>>> already gathered in zap_pte_range().
>>>>>>
>>>>>> While we are at it, let's simplify by converting the return type of both
>>>>>> functions to void. The return value was used only by zap_pte_range() to
>>>>>> print a bad pte, and was ignored by everyone else, so the extra
>>>>>> reporting wasn't exactly guaranteed. We will still get the warning with
>>>>>> most of the information from get_swap_device(). With the batch version,
>>>>>> we wouldn't know which pte was bad anyway so could print the wrong one.
>>>>>>
>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@....com>
>>>>>> ---
>>>>>>    include/linux/pgtable.h | 29 ++++++++++++
>>>>>>    include/linux/swap.h    | 12 +++--
>>>>>>    mm/internal.h           | 63 ++++++++++++++++++++++++++
>>>>>>    mm/madvise.c            | 12 +++--
>>>>>>    mm/memory.c             | 13 +++---
>>>>>>    mm/swapfile.c           | 97 +++++++++++++++++++++++++++++++++--------
>>>>>>    6 files changed, 195 insertions(+), 31 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>> index a3fc8150b047..75096025fe52 100644
>>>>>> --- a/include/linux/pgtable.h
>>>>>> +++ b/include/linux/pgtable.h
>>>>>> @@ -708,6 +708,35 @@ static inline void pte_clear_not_present_full(struct mm_struct *mm,
>>>>>>    }
>>>>>>    #endif
>>>>>>
>>>>>> +#ifndef clear_not_present_full_ptes
>>>>>> +/**
>>>>>> + * clear_not_present_full_ptes - Clear multiple not present PTEs which are
>>>>>> + *                              consecutive in the pgtable.
>>>>>> + * @mm: Address space the ptes represent.
>>>>>> + * @addr: Address of the first pte.
>>>>>> + * @ptep: Page table pointer for the first entry.
>>>>>> + * @nr: Number of entries to clear.
>>>>>> + * @full: Whether we are clearing a full mm.
>>>>>> + *
>>>>>> + * May be overridden by the architecture; otherwise, implemented as a simple
>>>>>> + * loop over pte_clear_not_present_full().
>>>>>> + *
>>>>>> + * Context: The caller holds the page table lock.  The PTEs are all not present.
>>>>>> + * The PTEs are all in the same PMD.
>>>>>> + */
>>>>>> +static inline void clear_not_present_full_ptes(struct mm_struct *mm,
>>>>>> +               unsigned long addr, pte_t *ptep, unsigned int nr, int full)
>>>>>> +{
>>>>>> +       for (;;) {
>>>>>> +               pte_clear_not_present_full(mm, addr, ptep, full);
>>>>>> +               if (--nr == 0)
>>>>>> +                       break;
>>>>>> +               ptep++;
>>>>>> +               addr += PAGE_SIZE;
>>>>>> +       }
>>>>>> +}
>>>>>> +#endif
>>>>>> +
>>>>>>    #ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
>>>>>>    extern pte_t ptep_clear_flush(struct vm_area_struct *vma,
>>>>>>                                 unsigned long address,
>>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>>>> index f6f78198f000..5737236dc3ce 100644
>>>>>> --- a/include/linux/swap.h
>>>>>> +++ b/include/linux/swap.h
>>>>>> @@ -471,7 +471,7 @@ extern int swap_duplicate(swp_entry_t);
>>>>>>    extern int swapcache_prepare(swp_entry_t);
>>>>>>    extern void swap_free(swp_entry_t);
>>>>>>    extern void swapcache_free_entries(swp_entry_t *entries, int n);
>>>>>> -extern int free_swap_and_cache(swp_entry_t);
>>>>>> +extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>>>>>>    int swap_type_of(dev_t device, sector_t offset);
>>>>>>    int find_first_swap(dev_t *device);
>>>>>>    extern unsigned int count_swap_pages(int, int);
>>>>>> @@ -520,8 +520,9 @@ static inline void put_swap_device(struct swap_info_struct *si)
>>>>>>    #define free_pages_and_swap_cache(pages, nr) \
>>>>>>           release_pages((pages), (nr));
>>>>>>
>>>>>> -/* used to sanity check ptes in zap_pte_range when CONFIG_SWAP=0 */
>>>>>> -#define free_swap_and_cache(e) is_pfn_swap_entry(e)
>>>>>> +static inline void free_swap_and_cache_nr(swp_entry_t entry, int nr)
>>>>>> +{
>>>>>> +}
>>>>>>
>>>>>>    static inline void free_swap_cache(struct folio *folio)
>>>>>>    {
>>>>>> @@ -589,6 +590,11 @@ static inline int add_swap_extent(struct swap_info_struct *sis,
>>>>>>    }
>>>>>>    #endif /* CONFIG_SWAP */
>>>>>>
>>>>>> +static inline void free_swap_and_cache(swp_entry_t entry)
>>>>>> +{
>>>>>> +       free_swap_and_cache_nr(entry, 1);
>>>>>> +}
>>>>>> +
>>>>>>    #ifdef CONFIG_MEMCG
>>>>>>    static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
>>>>>>    {
>>>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>>>> index 3bdc8693b54f..de68705624b0 100644
>>>>>> --- a/mm/internal.h
>>>>>> +++ b/mm/internal.h
>>>>>> @@ -11,6 +11,8 @@
>>>>>>    #include <linux/mm.h>
>>>>>>    #include <linux/pagemap.h>
>>>>>>    #include <linux/rmap.h>
>>>>>> +#include <linux/swap.h>
>>>>>> +#include <linux/swapops.h>
>>>>>>    #include <linux/tracepoint-defs.h>
>>>>>>
>>>>>>    struct folio_batch;
>>>>>> @@ -189,6 +191,67 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>>>>>
>>>>>>           return min(ptep - start_ptep, max_nr);
>>>>>>    }
>>>>>> +
>>>>>> +/**
>>>>>> + * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
>>>>>> + * @pte: The initial pte state; is_swap_pte(pte) must be true.
>>>>>> + *
>>>>>> + * Increments the swap offset, while maintaining all other fields, including
>>>>>> + * swap type, and any swp pte bits. The resulting pte is returned.
>>>>>> + */
>>>>>> +static inline pte_t pte_next_swp_offset(pte_t pte)
>>>>>> +{
>>>>>> +       swp_entry_t entry = pte_to_swp_entry(pte);
>>>>>> +       pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry),
>>>>>> +                                                  swp_offset(entry) + 1));
>>>>>> +
>>>>>> +       if (pte_swp_soft_dirty(pte))
>>>>>> +               new = pte_swp_mksoft_dirty(new);
>>>>>> +       if (pte_swp_exclusive(pte))
>>>>>> +               new = pte_swp_mkexclusive(new);
>>>>>> +       if (pte_swp_uffd_wp(pte))
>>>>>> +               new = pte_swp_mkuffd_wp(new);
>>>>>
>>>>> I don't quite understand this. If this page table entry is exclusive,
>>>>> will its subsequent page table entry also be exclusive without
>>>>> question?
>>>>> in try_to_unmap_one, exclusive is per-subpage but not per-folio:
>>>>>
>>>>>                   anon_exclusive = folio_test_anon(folio) &&
>>>>>                                    PageAnonExclusive(subpage);
>>>>>
>>>>> same questions also for diry, wp etc.
>>>>
>>>> Sorry for the noise. you are right. based on your new version, I think I should
>>>> entirely drop:
>>>>
>>>> [PATCH v2 3/5] mm: swap_pte_batch: add an output argument to reture if
>>>> all swap entries are exclusive
>>>
>>> Yes. If we ever want to ignore some bits, we should likely add flags to
>>> change the behavior, like for folio_pte_batch().
>>>
>>> For swapin, you really want the exclusive bits to match, though.
>>
>> I am not quite sure I definitely need exclusive bits to match. i can either
>> drop my 3/5 or ignore the exclusive bit as below (if anyone is not shared,
>> swpin won't reuse the large folio, but it can still entirely map it read-only):
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index cae39c372bfc..5726e729c9ee 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -253,10 +253,22 @@ static inline int swap_pte_batch(pte_t
>> *start_ptep, int max_nr, pte_t pte,
>>                  *any_shared |= !pte_swp_exclusive(pte);
>>
>>          while (ptep < end_ptep) {
>> +               pte_t ignore_exclusive_pte;
>> +               pte_t ignore_exclusive_expected_pte;
>>                  pte = ptep_get(ptep);
>>
>> -               if (!pte_same(pte, expected_pte))
>> -                       break;
>> +               if (any_shared) {
>> +                       ignore_exclusive_pte = pte;
>> +                       ignore_exclusive_expected_pte = expected_pte;
>> +                       ignore_exclusive_pte =
>> pte_swp_clear_exclusive(ignore_exclusive_pte);
>> +                       ignore_exclusive_expected_pte =
>> pte_swp_clear_exclusive(expected_pte);
>> +
>> +                       if (!pte_same(ignore_exclusive_pte,
>> ignore_exclusive_expected_pte))
>> +                               break;
>> +               } else {
>> +                       if (!pte_same(pte, expected_pte))
>> +                               break;
>> +               }
>>
>>                  if (any_shared)
>>                          *any_shared |= !pte_swp_exclusive(pte);
> 
> I'll leave David to comment on this proposal; I'm not sure I understand all the
> details. The code change does look a bit "busy" though - sometimes that can be
> an indicator :)

I'd prefer to keep it simple initially. Devil is in the detail for the 
refault case: in the past, dropping an exclusive flag could have been 
problematic with some O_DIRECT workloads that were not using FOLL_PIN. 
IIUC, some still remain.

More details can be had from 
https://lore.kernel.org/all/20230113171026.582290-1-david@redhat.com/


It might all change a bit if I manage to get folio_test_anon_exclusive() 
flying. The current plan is that all individual swp PTEs would still 
have pte_swp_exclusive() set, and we would *not* clear the 
folio_test_anon_exclusive() flag while the folio is still in the 
swapcache [which we do today to make fork() handling easier].

That will make refault significantly easier to handle regarding 
exclusivity handling with large folios.

-- 
Cheers,

David / dhildenb