linux-kernel - Re: [PATCH 2/8] mm: memory: extend finish

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fc7d3a7b-e4a9-48c3-9df3-5489c62789e7@linux.alibaba.com>
Date: Thu, 9 May 2024 09:10:36 +0800
From: Baolin Wang <baolin.wang@...ux.alibaba.com>
To: Ryan Roberts <ryan.roberts@....com>, akpm@...ux-foundation.org,
 hughd@...gle.com
Cc: willy@...radead.org, david@...hat.com, ioworker0@...il.com,
 wangkefeng.wang@...wei.com, ying.huang@...el.com, 21cnbao@...il.com,
 shy828301@...il.com, ziy@...dia.com, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/8] mm: memory: extend finish_fault() to support large
 folio



On 2024/5/8 18:47, Ryan Roberts wrote:
> On 08/05/2024 10:31, Baolin Wang wrote:
>>
>>
>> On 2024/5/8 16:53, Ryan Roberts wrote:
>>> On 08/05/2024 04:44, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/5/7 18:37, Ryan Roberts wrote:
>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>> Add large folio mapping establishment support for finish_fault() as a
>>>>>> preparation,
>>>>>> to support multi-size THP allocation of anonymous shmem pages in the following
>>>>>> patches.
>>>>>>
>>>>>> Signed-off-by: Baolin Wang <baolin.wang@...ux.alibaba.com>
>>>>>> ---
>>>>>>     mm/memory.c | 43 +++++++++++++++++++++++++++++++++----------
>>>>>>     1 file changed, 33 insertions(+), 10 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>> index eea6e4984eae..936377220b77 100644
>>>>>> --- a/mm/memory.c
>>>>>> +++ b/mm/memory.c
>>>>>> @@ -4747,9 +4747,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>>>>     {
>>>>>>         struct vm_area_struct *vma = vmf->vma;
>>>>>>         struct page *page;
>>>>>> +    struct folio *folio;
>>>>>>         vm_fault_t ret;
>>>>>>         bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>>>>>>                   !(vma->vm_flags & VM_SHARED);
>>>>>> +    int type, nr_pages, i;
>>>>>> +    unsigned long addr = vmf->address;
>>>>>>           /* Did we COW the page? */
>>>>>>         if (is_cow)
>>>>>> @@ -4780,24 +4783,44 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>>>>                 return VM_FAULT_OOM;
>>>>>>         }
>>>>>>     +    folio = page_folio(page);
>>>>>> +    nr_pages = folio_nr_pages(folio);
>>>>>> +
>>>>>> +    if (unlikely(userfaultfd_armed(vma))) {
>>>>>> +        nr_pages = 1;
>>>>>> +    } else if (nr_pages > 1) {
>>>>>> +        unsigned long start = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
>>>>>> +        unsigned long end = start + nr_pages * PAGE_SIZE;
>>>>>> +
>>>>>> +        /* In case the folio size in page cache beyond the VMA limits. */
>>>>>> +        addr = max(start, vma->vm_start);
>>>>>> +        nr_pages = (min(end, vma->vm_end) - addr) >> PAGE_SHIFT;
>>>>>> +
>>>>>> +        page = folio_page(folio, (addr - start) >> PAGE_SHIFT);
>>>>>
>>>>> I still don't really follow the logic in this else if block. Isn't it possible
>>>>> that finish_fault() gets called with a page from a folio that isn't aligned
>>>>> with
>>>>> vmf->address?
>>>>>
>>>>> For example, let's say we have a file who's size is 64K and which is cached
>>>>> in a
>>>>> single large folio in the page cache. But the file is mapped into a process at
>>>>> VA 16K to 80K. Let's say we fault on the first page (VA=16K). You will
>>>>> calculate
>>>>
>>>> For shmem, this doesn't happen because the VA is aligned with the hugepage size
>>>> in the shmem_get_unmapped_area() function. See patch 7.
>>>
>>> Certainly agree that shmem can always make sure that it packs a vma in a way
>>> such that its folios are naturally aligned in VA when faulting in memory. If you
>>> mremap it, that alignment will be lost; I don't think that would be a problem
>>
>> When mremap it, it will also call shmem_get_unmapped_area() to align the VA, but
>> for mremap() with MAP_FIXED flag as David pointed out, yes, this patch may be
>> not work perfectly.
> 
> Assuming this works similarly to anon mTHP, remapping to an arbitrary address
> shouldn't be a problem within a single process; the previously allocated folios
> will now be unaligned, but they will be correctly mapped so it doesn't break
> anything. And new faults will allocate folios so that they are as large as
> allowed by the sysfs interface AND which do not overlap with any non-none pte
> AND which are naturally aligned. It's when you start sharing with other
> processes that the fun and games start...
> 
>>
>>> for a single process; mremap will take care of moving the ptes correctly and
>>> this path is not involved.
>>>
>>> But what about the case when a process mmaps a shmem region, then forks, then
>>> the child mremaps the shmem region. Then the parent faults in a THP into the
>>> region (nicely aligned). Then the child faults in the same offset in the region
>>> and gets the THP that the parent allocated; that THP will be aligned in the
>>> parent's VM space but not in the child's.
>>
>> Sorry, I did not get your point here. IIUC, the child's VA will also be aligned
>> if the child mremap is not set MAP_FIXED, since the child's mremap will still
>> call shmem_get_unmapped_area() to find an aligned new VA.
> 
> In general, you shouldn't be relying on the vma bounds being aligned to a THP
> boundary.
> 
>> Please correct me if I missed your point.
> 
> (I'm not 100% sure this is definitely how it works, but seems the only sane way
> to me):
> 
> Let's imagine we have a process that maps 4 pages of shared anon memory at VA=64K:
> 
>    mmap(64K, 16K, PROT_X, MAP_SHARED | MAP_ANONYMOUS | MAP_FIXED, ...)
> 
> Then it forks a child, and the child moves the mapping to VA=68K:
> 
>    mremap(64K, 16K, 16K, MREMAP_FIXED | MREMAP_MAYMOVE, 68K)
> 
> Then the parent writes to address 64K (offset 0 in the shared region); this will
> fault and cause a 16K mTHP to be allocated and mapped, covering the whole region
> at 64K-80K in the parent.
> 
> Then the child reads address 68K (offset 0 in the shared region); this will
> fault and cause the previously allocated 16K folio to be looked up and it must
> be mapped in the child between 68K-84K. This is not naturally aligned in the child.
> 
> For the child, your code will incorrectly calculate start/end as 64K-80K.

OK, so you set MREMAP_FIXED flag, just as David pointed out. Yes, it 
will not aligned in the child for this case. Thanks for the explanation.