[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fc7d3a7b-e4a9-48c3-9df3-5489c62789e7@linux.alibaba.com>
Date: Thu, 9 May 2024 09:10:36 +0800
From: Baolin Wang <baolin.wang@...ux.alibaba.com>
To: Ryan Roberts <ryan.roberts@....com>, akpm@...ux-foundation.org,
hughd@...gle.com
Cc: willy@...radead.org, david@...hat.com, ioworker0@...il.com,
wangkefeng.wang@...wei.com, ying.huang@...el.com, 21cnbao@...il.com,
shy828301@...il.com, ziy@...dia.com, linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/8] mm: memory: extend finish_fault() to support large
folio
On 2024/5/8 18:47, Ryan Roberts wrote:
> On 08/05/2024 10:31, Baolin Wang wrote:
>>
>>
>> On 2024/5/8 16:53, Ryan Roberts wrote:
>>> On 08/05/2024 04:44, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/5/7 18:37, Ryan Roberts wrote:
>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>> Add large folio mapping establishment support for finish_fault() as a
>>>>>> preparation,
>>>>>> to support multi-size THP allocation of anonymous shmem pages in the following
>>>>>> patches.
>>>>>>
>>>>>> Signed-off-by: Baolin Wang <baolin.wang@...ux.alibaba.com>
>>>>>> ---
>>>>>> mm/memory.c | 43 +++++++++++++++++++++++++++++++++----------
>>>>>> 1 file changed, 33 insertions(+), 10 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>> index eea6e4984eae..936377220b77 100644
>>>>>> --- a/mm/memory.c
>>>>>> +++ b/mm/memory.c
>>>>>> @@ -4747,9 +4747,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>>>> {
>>>>>> struct vm_area_struct *vma = vmf->vma;
>>>>>> struct page *page;
>>>>>> + struct folio *folio;
>>>>>> vm_fault_t ret;
>>>>>> bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>>>>>> !(vma->vm_flags & VM_SHARED);
>>>>>> + int type, nr_pages, i;
>>>>>> + unsigned long addr = vmf->address;
>>>>>> /* Did we COW the page? */
>>>>>> if (is_cow)
>>>>>> @@ -4780,24 +4783,44 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>>>> return VM_FAULT_OOM;
>>>>>> }
>>>>>> + folio = page_folio(page);
>>>>>> + nr_pages = folio_nr_pages(folio);
>>>>>> +
>>>>>> + if (unlikely(userfaultfd_armed(vma))) {
>>>>>> + nr_pages = 1;
>>>>>> + } else if (nr_pages > 1) {
>>>>>> + unsigned long start = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
>>>>>> + unsigned long end = start + nr_pages * PAGE_SIZE;
>>>>>> +
>>>>>> + /* In case the folio size in page cache beyond the VMA limits. */
>>>>>> + addr = max(start, vma->vm_start);
>>>>>> + nr_pages = (min(end, vma->vm_end) - addr) >> PAGE_SHIFT;
>>>>>> +
>>>>>> + page = folio_page(folio, (addr - start) >> PAGE_SHIFT);
>>>>>
>>>>> I still don't really follow the logic in this else if block. Isn't it possible
>>>>> that finish_fault() gets called with a page from a folio that isn't aligned
>>>>> with
>>>>> vmf->address?
>>>>>
>>>>> For example, let's say we have a file who's size is 64K and which is cached
>>>>> in a
>>>>> single large folio in the page cache. But the file is mapped into a process at
>>>>> VA 16K to 80K. Let's say we fault on the first page (VA=16K). You will
>>>>> calculate
>>>>
>>>> For shmem, this doesn't happen because the VA is aligned with the hugepage size
>>>> in the shmem_get_unmapped_area() function. See patch 7.
>>>
>>> Certainly agree that shmem can always make sure that it packs a vma in a way
>>> such that its folios are naturally aligned in VA when faulting in memory. If you
>>> mremap it, that alignment will be lost; I don't think that would be a problem
>>
>> When mremap it, it will also call shmem_get_unmapped_area() to align the VA, but
>> for mremap() with MAP_FIXED flag as David pointed out, yes, this patch may be
>> not work perfectly.
>
> Assuming this works similarly to anon mTHP, remapping to an arbitrary address
> shouldn't be a problem within a single process; the previously allocated folios
> will now be unaligned, but they will be correctly mapped so it doesn't break
> anything. And new faults will allocate folios so that they are as large as
> allowed by the sysfs interface AND which do not overlap with any non-none pte
> AND which are naturally aligned. It's when you start sharing with other
> processes that the fun and games start...
>
>>
>>> for a single process; mremap will take care of moving the ptes correctly and
>>> this path is not involved.
>>>
>>> But what about the case when a process mmaps a shmem region, then forks, then
>>> the child mremaps the shmem region. Then the parent faults in a THP into the
>>> region (nicely aligned). Then the child faults in the same offset in the region
>>> and gets the THP that the parent allocated; that THP will be aligned in the
>>> parent's VM space but not in the child's.
>>
>> Sorry, I did not get your point here. IIUC, the child's VA will also be aligned
>> if the child mremap is not set MAP_FIXED, since the child's mremap will still
>> call shmem_get_unmapped_area() to find an aligned new VA.
>
> In general, you shouldn't be relying on the vma bounds being aligned to a THP
> boundary.
>
>> Please correct me if I missed your point.
>
> (I'm not 100% sure this is definitely how it works, but seems the only sane way
> to me):
>
> Let's imagine we have a process that maps 4 pages of shared anon memory at VA=64K:
>
> mmap(64K, 16K, PROT_X, MAP_SHARED | MAP_ANONYMOUS | MAP_FIXED, ...)
>
> Then it forks a child, and the child moves the mapping to VA=68K:
>
> mremap(64K, 16K, 16K, MREMAP_FIXED | MREMAP_MAYMOVE, 68K)
>
> Then the parent writes to address 64K (offset 0 in the shared region); this will
> fault and cause a 16K mTHP to be allocated and mapped, covering the whole region
> at 64K-80K in the parent.
>
> Then the child reads address 68K (offset 0 in the shared region); this will
> fault and cause the previously allocated 16K folio to be looked up and it must
> be mapped in the child between 68K-84K. This is not naturally aligned in the child.
>
> For the child, your code will incorrectly calculate start/end as 64K-80K.
OK, so you set MREMAP_FIXED flag, just as David pointed out. Yes, it
will not aligned in the child for this case. Thanks for the explanation.
Powered by blists - more mailing lists