linux-kernel - Re: [PATCH] mm/huge_memory: Avoid PMD-size page cache if needed

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <a168f908-3906-43e3-8676-360809ed5c8d@redhat.com>
Date: Sat, 13 Jul 2024 19:25:34 +1000
From: Gavin Shan <gshan@...hat.com>
To: David Hildenbrand <david@...hat.com>, Matthew Wilcox <willy@...radead.org>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org,
 akpm@...ux-foundation.org, william.kucharski@...cle.com,
 ryan.roberts@....com, shan.gavin@...il.com
Subject: Re: [PATCH] mm/huge_memory: Avoid PMD-size page cache if needed

On 7/13/24 11:03 AM, David Hildenbrand wrote:
> On 12.07.24 07:39, Gavin Shan wrote:
>>
>> David, I can help to clean it up. Could you please help to confirm the following
> 
> Thanks!
> 
>> changes are exactly what you're suggesting? Hopefully, there are nothing I've missed.
>> The original issue can be fixed by the changes. With the changes applied, madvise(MADV_COLLAPSE)
>> returns with errno -22 in the test program.
>>
>> The fix tag needs to adjusted either.
>>
>> Fixes: 3485b88390b0 ("mm: thp: introduce multi-size THP sysfs interface")
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 2aa986a5cd1b..45909efb0ef0 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -74,7 +74,12 @@ extern struct kobj_attribute shmem_enabled_attr;
>>    /*
>>     * Mask of all large folio orders supported for file THP.
>>     */
>> -#define THP_ORDERS_ALL_FILE    (BIT(PMD_ORDER) | BIT(PUD_ORDER))
> 
> DAX doesn't have any MAX_PAGECACHE_ORDER restrictions (like hugetlb). So this should be
> 
> /*
>   * FSDAX never splits folios, so the MAX_PAGECACHE_ORDER limit does not
>   * apply here.
>   */
> THP_ORDERS_ALL_FILE_DAX ((BIT(PMD_ORDER) | BIT(PUD_ORDER))
> 
> Something like that
> 

Ok. It will be corrected in v2.

>> +#define THP_ORDERS_ALL_FILE_DAX                \
>> +       ((BIT(PMD_ORDER) | BIT(PUD_ORDER)) & (BIT(MAX_PAGECACHE_ORDER + 1) - 1))
>> +#define THP_ORDERS_ALL_FILE_DEFAULT    \
>> +       ((BIT(MAX_PAGECACHE_ORDER + 1) - 1) & ~BIT(0))
>> +#define THP_ORDERS_ALL_FILE            \
>> +       (THP_ORDERS_ALL_FILE_DAX | THP_ORDERS_ALL_FILE_DEFAULT)
> 
> Maybe we can get rid of THP_ORDERS_ALL_FILE (to prevent abuse) and fixup
> THP_ORDERS_ALL instead.
> 

Sure, it will be removed in v2.

>>    /*
>>     * Mask of all large folio orders supported for THP.
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 2120f7478e55..4690f33afaa6 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -88,9 +88,17 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>>           bool smaps = tva_flags & TVA_SMAPS;
>>           bool in_pf = tva_flags & TVA_IN_PF;
>>           bool enforce_sysfs = tva_flags & TVA_ENFORCE_SYSFS;
>> +       unsigned long supported_orders;
>> +
>>           /* Check the intersection of requested and supported orders. */
>> -       orders &= vma_is_anonymous(vma) ?
>> -                       THP_ORDERS_ALL_ANON : THP_ORDERS_ALL_FILE;
>> +       if (vma_is_anonymous(vma))
>> +               supported_orders = THP_ORDERS_ALL_ANON;
>> +       else if (vma_is_dax(vma))
>> +               supported_orders = THP_ORDERS_ALL_FILE_DAX;
>> +       else
>> +               supported_orders = THP_ORDERS_ALL_FILE_DEFAULT;
> 
> This is what I had in mind.
> 
> But, do we have to special-case shmem as well or will that be handled correctly?
> 

With previous fixes and this one, I don't see there is any missed cases
for shmem to have 512MB page cache, exceeding MAX_PAGECACHE_ORDER. Hopefully,
I don't miss anything from the code inspection.

- regular read/write paths: covered by the previous fixes
- synchronous readahead: covered by the previous fixes
- asynchronous readahead: page size granularity, no huge page
- page fault handling: covered by the previous fixes
- collapsing PTEs to PMD: to be covered by this patch
- swapin: shouldn't have 512MB huge page since we don't have such huge pages during swapout period
- other cases I missed (?)

Thanks,
Gavin