linux-kernel - Re: [PATCH v3 1/6] mm: memory: extend finish

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGsJ_4z60mrjuQ5qKCKn0+knk_M1dy=NsH4nVLqe5Khue_5gFw@mail.gmail.com>
Date: Mon, 3 Jun 2024 17:28:47 +1200
From: Barry Song <21cnbao@...il.com>
To: Baolin Wang <baolin.wang@...ux.alibaba.com>
Cc: akpm@...ux-foundation.org, hughd@...gle.com, willy@...radead.org, 
	david@...hat.com, wangkefeng.wang@...wei.com, ying.huang@...el.com, 
	ryan.roberts@....com, shy828301@...il.com, ziy@...dia.com, 
	ioworker0@...il.com, da.gomez@...sung.com, p.raghav@...sung.com, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 1/6] mm: memory: extend finish_fault() to support large folio

On Thu, May 30, 2024 at 2:04 PM Baolin Wang
<baolin.wang@...ux.alibaba.com> wrote:
>
> Add large folio mapping establishment support for finish_fault() as a preparation,
> to support multi-size THP allocation of anonymous shmem pages in the following
> patches.
>
> Signed-off-by: Baolin Wang <baolin.wang@...ux.alibaba.com>
> ---
>  mm/memory.c | 58 ++++++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 48 insertions(+), 10 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index eef4e482c0c2..435187ff7ea4 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4831,9 +4831,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>  {
>         struct vm_area_struct *vma = vmf->vma;
>         struct page *page;
> +       struct folio *folio;
>         vm_fault_t ret;
>         bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>                       !(vma->vm_flags & VM_SHARED);
> +       int type, nr_pages, i;
> +       unsigned long addr = vmf->address;
>
>         /* Did we COW the page? */
>         if (is_cow)
> @@ -4864,24 +4867,59 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>                         return VM_FAULT_OOM;
>         }
>
> +       folio = page_folio(page);
> +       nr_pages = folio_nr_pages(folio);
> +
> +       /*
> +        * Using per-page fault to maintain the uffd semantics, and same
> +        * approach also applies to non-anonymous-shmem faults to avoid
> +        * inflating the RSS of the process.

I don't feel the comment explains the root cause.
For non-shmem, anyway we have allocated the memory? Avoiding inflating
RSS seems not so useful as we have occupied the memory. the memory footprint
is what we really care about. so we want to rely on read-ahead hints of subpage
to determine read-ahead size? that is why we don't map nr_pages for non-shmem
files though we can potentially reduce nr_pages - 1 page faults?

> +        */
> +       if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma))) {
> +               nr_pages = 1;
> +       } else if (nr_pages > 1) {
> +               pgoff_t idx = folio_page_idx(folio, page);
> +               /* The page offset of vmf->address within the VMA. */
> +               pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff;
> +
> +               /*
> +                * Fallback to per-page fault in case the folio size in page
> +                * cache beyond the VMA limits.
> +                */
> +               if (unlikely(vma_off < idx ||
> +                            vma_off + (nr_pages - idx) > vma_pages(vma))) {
> +                       nr_pages = 1;
> +               } else {
> +                       /* Now we can set mappings for the whole large folio. */
> +                       addr = vmf->address - idx * PAGE_SIZE;
> +                       page = &folio->page;
> +               }
> +       }
> +
>         vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> -                                     vmf->address, &vmf->ptl);
> +                                      addr, &vmf->ptl);
>         if (!vmf->pte)
>                 return VM_FAULT_NOPAGE;
>
>         /* Re-check under ptl */
> -       if (likely(!vmf_pte_changed(vmf))) {
> -               struct folio *folio = page_folio(page);
> -               int type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
> -
> -               set_pte_range(vmf, folio, page, 1, vmf->address);
> -               add_mm_counter(vma->vm_mm, type, 1);
> -               ret = 0;
> -       } else {
> -               update_mmu_tlb(vma, vmf->address, vmf->pte);
> +       if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) {
> +               update_mmu_tlb(vma, addr, vmf->pte);
>                 ret = VM_FAULT_NOPAGE;
> +               goto unlock;
> +       } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {

In what case we can't use !pte_range_none(vmf->pte, 1) for nr_pages == 1
then unify the code for nr_pages==1 and nr_pages > 1?

It seems this has been discussed before, but I forget the reason.

> +               for (i = 0; i < nr_pages; i++)
> +                       update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
> +               ret = VM_FAULT_NOPAGE;
> +               goto unlock;
>         }
>
> +       folio_ref_add(folio, nr_pages - 1);
> +       set_pte_range(vmf, folio, page, nr_pages, addr);
> +       type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
> +       add_mm_counter(vma->vm_mm, type, nr_pages);
> +       ret = 0;
> +
> +unlock:
>         pte_unmap_unlock(vmf->pte, vmf->ptl);
>         return ret;
>  }
> --
> 2.39.3
>

Thanks
Barry